BigCode is an open scientific collaboration working on the responsible development of large language models for code, empowering the machine learning and open source communities through open governance.
Code LLMs enable the completion and synthesis of code, both from other code snippets and natural language descriptions, and work across a wide range of domains, tasks, and programming languages. These models can, for example, assist professional and citizen developers with building new applications.
BigCode invites AI researchers to work together on the development of state-of-the-art code LLMs, and collaborate on research topics such as:
- Constructing a representative evaluation suite for code LLMs, covering a diverse set of tasks and programming languages
- Developing new methods for faster training and inference of LLMs
- The legal, ethics, and governance aspects of code LLMs
The BigCode project is conducted in the spirit of Open Science. Datasets, models, and experiments are developed through open collaboration and released with permissive licenses back to the community. While the project has corporate support from ServiceNow and HuggingFace e.g. for hosting models and datasets, and for training compute; all technical governance takes place within working groups and task forces across the community.
As code LLMs are developed with data from the open-source community, we believe open governance can help to ensure that these models are benefiting the larger developer community. We are working on tools to give code creators agency over whether their source code is included in the training data, and give attribution to developers when models output near-copies of the training data.