Open-Source AI Language Model with a Distinctly European Perspective

Open-Source AI Language Model with a Distinctly European Perspective

Author: ChemistryViews

The OpenGPT-X research project has developed an AI language model that is open-source and has a distinctly European perspective for use in academia and industry. Teuken-7B is currently one of the few large language models developed multilingually from the ground up. It contains approximately 50 % of non-English pre-training data and has been trained in all 24 official European languages. Teuken-7B was trained on the JUWELS supercomputer at Jülich Research Center.

The OpenGPT-X team has focused also on how to train and operate multilingual AI language models in a more energy- and cost-efficient way. They developed a multilingual “tokenizer”. A tokenizer breaks down words into individual word components. The fewer tokens, the more (energy-) efficiently and quickly a language model can generate the answer. The developed tokenizer leads to a reduction in training costs compared to other multilingual tokenizers like Llama3 or Mistral, the consortium says. This is particularly valuable for languages with longer word structures, such as German, Finnish, or Hungarian.

Teuken-7B is accessible via the Gaia-X infrastructure and can be downloaded from Hugging Face. Unlike existing cloud solutions, Gaia-X is a federated ecosystem that allows service providers and data owners to connect. Data remains securely with its owners and is only shared under defined conditions. The Gaia-X standards guarantee data storage and processing by the strictest European data protection and security regulations.

The model is available in two versions: one for research purposes and the other under an Apache 2.0 license for commercial use. The performance of the two models is roughly comparable, but some of the datasets used for instruction tuning preclude commercial use and were, therefore, not used in the Apache 2.0 version.

The OpenGPT-X project, funded by the German Federal Ministry of Economic Affairs and Climate Action (BMWK) with approximately €14 million, started on 1 January 2022 and will end on 31 March 2025. This means, further optimizations and evaluations of the models can take place.

The ten project partners include Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Fraunhofer Institute for Integrated Circuits (IIS), IONOS (a European cloud infrastructure provider), German Research Center for Artificial Intelligence (DFKI), Aleph Alpha (German AI start-up), Jülich Research Center, TU Dresden, ControlExpert (company specializing in AI-based solutions for the automotive industry), Westdeutscher Rundfunk (WDR – West German Broadcasting), and KI Bundesverband (German AI Association).


 

Also of Interest

Fascinating Insights into AI and Chemistry

Collection: Insights into AI and Chemistry

An expanding compilation of articles concerning the intersection of artificial intelligence and chemistry

 

 

 

Leave a Reply

Kindly review our community guidelines before leaving a comment.

Your email address will not be published. Required fields are marked *