Open-Source AI Models for Scientific Advancement

Arthur Spirling, New York University, NY, USA, a political and data scientist who uses and teaches about language models, has raised concerns about the use of proprietary and closed large language models (LLMs) in research. These models are developed and run by companies that do not disclose their underlying models for independent inspection or verification. It is unclear on which documents these models have been trained, and the companies might change the training data and their business models at any time. Spirling argues that this threatens progress on research ethics and reproducibility of results.

Instead, Spirling suggests that researchers should collaborate to develop open-source LLMs that are transparent and not dependent on a corporation’s favor. One example of an open-source LLM is BLOOM, which was built by the AI company Hugging Face, News York City, NY, USA, in collaboration with over 1,000 volunteer researchers and partially funded by the French government.

However, open-source LLMs are generally not as well-funded as proprietary models, so there needs to be more collaboration and pooling of international resources and expertise. Academic codes of conduct and regulations for working with LLMs are necessary. Spirling believes that these will take time and such regulations will probably be initially clumsy and slow to take effect. Another challenge is that this field moves so fast that versions of LLMs become obsolete within weeks or months. Therefore, the more academics that join this effort, the better.

Currently, researchers have access to open LLMs developed by private organizations, such as LLaMA, developed by Facebook parent company Meta in Menlo Park, CA, USA. LLaMA was originally released to researchers on a case-by-case basis only, but the full model was later published online, and both LLaMA and OPT-175B are now free to use. However, this leaves science relying on corporations’ benevolence, which is an unstable situation in the long run.