Announcing a holiday gift: 🎅SantaCoder - a 1.1B multilingual LM for code that outperforms much larger open-source models on both left-to-right generation and infilling!
SantaCoder is trained on Python, Java, and JavaScript and outperforms other large multilingual models such as InCoder (6.7B) or CodeGen-multi (2.7B) considerably!
A lot of pieces from a lot of collaborators came together to get to that result:
The foundation to train SantaCoder is The Stack (v1.1) dataset. Given the relatively small size of our model (1B parameters) we chose three popular programming languages: Python, Java, and JavaScript.
Before training any models we looked into removing sensitive information from code such as email addresses, secret keys and IP addresses. For that purpose we annotated 400 samples and then built and continuously refined RegEx rules to remove the information before training.
In our experiments, we explored two questions:
First, can we use Multi Query Attention (MQA) together with Fill-in-the-middle (FIM) without performance loss?
Secondly, what's the best data filtering procedure for code models?
In Multi Head Attention every head has a set of queries, keys, and values. In MQA, the queries are unique while keys and values are shared. This saves memory and speeds up inference for large batches. We found it had only a minor impact on performance:
Fill-in-the-Middle is a clever approach where a sequence is reordered such that prefix|middle|suffix becomes prefix|suffix|middle. With that you can use normal left-to-right generation to fill the middle part.
While some claim FIM is for free we found it's rather FIM for cheap:
In addition to the standard near-deduplication and heuristics pipeline, we ran 4 filtering experiments: GitHub stars, tokenizer fertility, comment-to-code ratio and more near-deduplication.
Filtering for GitHub stars hurts performance while comments and near-dedup help!
The thorough evaluation was made possible by the evaluation working group. They evaluated on the MultiPL-E benchmark (multilingual HumanEval and MBPP) and CodeXGLUE and extended the code evaluation harness!
With these new insights we trained a final model called SantaCoder. We applied both the extra near-deduplication and code-to-comment ratio filters and trained for 600K steps (236B tokens). The result is an efficient (MQA) and flexible (FIM) multilingual model:
We release all models and intermediate checkpoints on the Hugging Face Hub and load the via the revision:
The compute for these experiments was sponsored by @ServiceNowRSRCH's research cluster.
The SantaCoder models are licensed under an open & responsible AI license (OpenRAIL). These are AI-specific licenses enabling free use and distribution of the model while setting specific use restrictions (e.g. malware generation). cc @ResponsibleAIL
An important aspect of using these models is that they can copy code from the training data which requires attribution. To help users navigate this we built a search index of the pretraining data.
Between now and Christmas🎄 we are running a series on experiments to figure out what the best pre-processing is for code datasets such as The Stack. We'll share the W&B dashboards of these 🎅-models so if you are interested you can follow along!
We are training ~1B parameter models on the Python/Java/JavaScript subset of The Stack. On the architecture side we want to evaluate the Fill-in-the-Middle (FIM) objective, as well as multi-query attention.
For preprocessing we are looking at GitHub stars/forks, tokenizer fertility, comment/code ratio, more near-deduplication and other heuristics. The experiments use the new version of The Stack that we'll release soon which removed opt-out requests and refined license filters.
Dataset collection: With gharchive.org over 220M repos were identified and 137M successfully cloned with over 50B files and 90TB of data. Filtered by extension and permissive licenses this yields 3TB of data. We also make a near-deduplicated version (1.5TB) available.
The dataset includes ~30 programming languages covering common languages such as Java, C/C++ and Python as well as lower resource languages (2GB of Dockerfiles 🐳). If you'd like to see a new language added, feel free to add it in this issue:
Excited to announce the BigCode project led by @ServiceNowRSRCH and @huggingface! In the spirit of BigScience we aim to develop large language models for code in an open and responsible way.
🌸Language models for code (Codex, CodeGen) and the applications they power (AI assisted programming) are gaining traction. Some models have been released, but there are still questions around data governance, robustness of evaluation benchmarks, and the engineering behind them.
📚The first goal of BigCode is to develop and release a dataset large enough to train a state-of-the-art language model for code. We will also ensure that only files from repositories with permissive licenses go into the dataset.