This pipeline illustrates a tiny portion of the data preparation of StarCoder, an open-source version of Github CoPilot, trained as part of the the BigCode project.
The pipeline is based on this repository.
The pipeline includes the following components:
- loading a code dataset from the Hugging Face hub
- filtering code based on comment to code ratio
- filtering code based on line length
- detecting and replacing PII (personal identifiable information) from code.