BigCode Profile picture
Dec 22 15 tweets 7 min read
Announcing a holiday gift: 🎅SantaCoder - a 1.1B multilingual LM for code that outperforms much larger open-source models on both left-to-right generation and infilling!

Demo: hf.co/spaces/bigcode…
Paper: hf.co/datasets/bigco…
Attribution: hf.co/spaces/bigcode…

A🧵:
SantaCoder is trained on Python, Java, and JavaScript and outperforms other large multilingual models such as InCoder (6.7B) or CodeGen-multi (2.7B) considerably!

A lot of pieces from a lot of collaborators came together to get to that result:
The foundation to train SantaCoder is The Stack (v1.1) dataset. Given the relatively small size of our model (1B parameters) we chose three popular programming languages: Python, Java, and JavaScript.

You can check if your code was used for training here: huggingface.co/spaces/bigcode…
Before training any models we looked into removing sensitive information from code such as email addresses, secret keys and IP addresses. For that purpose we annotated 400 samples and then built and continuously refined RegEx rules to remove the information before training.
In our experiments, we explored two questions:

First, can we use Multi Query Attention (MQA) together with Fill-in-the-middle (FIM) without performance loss?

Secondly, what's the best data filtering procedure for code models?
In Multi Head Attention every head has a set of queries, keys, and values. In MQA, the queries are unique while keys and values are shared. This saves memory and speeds up inference for large batches. We found it had only a minor impact on performance:
Fill-in-the-Middle is a clever approach where a sequence is reordered such that prefix|middle|suffix becomes prefix|suffix|middle. With that you can use normal left-to-right generation to fill the middle part.

While some claim FIM is for free we found it's rather FIM for cheap:
In addition to the standard near-deduplication and heuristics pipeline, we ran 4 filtering experiments: GitHub stars, tokenizer fertility, comment-to-code ratio and more near-deduplication.

Filtering for GitHub stars hurts performance while comments and near-dedup help!
The thorough evaluation was made possible by the evaluation working group. They evaluated on the MultiPL-E benchmark (multilingual HumanEval and MBPP) and CodeXGLUE and extended the code evaluation harness!

github.com/bigcode-projec…
With these new insights we trained a final model called SantaCoder. We applied both the extra near-deduplication and code-to-comment ratio filters and trained for 600K steps (236B tokens). The result is an efficient (MQA) and flexible (FIM) multilingual model:
We release all models and intermediate checkpoints on the Hugging Face Hub and load the via the revision:

huggingface.co/bigcode/santac…

The compute for these experiments was sponsored by @ServiceNowRSRCH's research cluster.
The SantaCoder models are licensed under an open & responsible AI license (OpenRAIL). These are AI-specific licenses enabling free use and distribution of the model while setting specific use restrictions (e.g. malware generation). cc @ResponsibleAIL

FAQ: bigcode-project.org/docs/pages/mod…
An important aspect of using these models is that they can copy code from the training data which requires attribution. To help users navigate this we built a search index of the pretraining data.

hf.co/spaces/bigcode…
Finally, we summarized our findings in a technical report with a wonderful group of collaborators:

Paper: hf.co/datasets/bigco…

So what's next? Scaling to larger models and training more languages early next year! 🚀

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with BigCode

BigCode Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @BigCodeProject

Nov 29
Between now and Christmas🎄 we are running a series on experiments to figure out what the best pre-processing is for code datasets such as The Stack. We'll share the W&B dashboards of these 🎅-models so if you are interested you can follow along! Image
We are training ~1B parameter models on the Python/Java/JavaScript subset of The Stack. On the architecture side we want to evaluate the Fill-in-the-Middle (FIM) objective, as well as multi-query attention.
For preprocessing we are looking at GitHub stars/forks, tokenizer fertility, comment/code ratio, more near-deduplication and other heuristics. The experiments use the new version of The Stack that we'll release soon which removed opt-out requests and refined license filters.
Read 7 tweets
Oct 27
Introducing 📑 The Stack - a 3TB dataset of permissively licensed code in 30 programming languages.

hf.co/datasets/bigco…

You want your code excluded from the model training? There is an opt-out form and data governance plan:

bigcode-project.org/docs/about/the…

Let's take a tour🧵
Dataset collection: With gharchive.org over 220M repos were identified and 137M successfully cloned with over 50B files and 90TB of data. Filtered by extension and permissive licenses this yields 3TB of data. We also make a near-deduplicated version (1.5TB) available.
The dataset includes ~30 programming languages covering common languages such as Java, C/C++ and Python as well as lower resource languages (2GB of Dockerfiles 🐳). If you'd like to see a new language added, feel free to add it in this issue:

github.com/orgs/bigcode-p…
Read 12 tweets
Sep 26
print("Hello world! 🎉")

Excited to announce the BigCode project led by @ServiceNowRSRCH and @huggingface! In the spirit of BigScience we aim to develop large language models for code in an open and responsible way.

Join here: bigcode-project.org/docs/about/joi…

A thread with our goals🧵 Image
🌸Language models for code (Codex, CodeGen) and the applications they power (AI assisted programming) are gaining traction. Some models have been released, but there are still questions around data governance, robustness of evaluation benchmarks, and the engineering behind them.
📚The first goal of BigCode is to develop and release a dataset large enough to train a state-of-the-art language model for code. We will also ensure that only files from repositories with permissive licenses go into the dataset.
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(