Name		Name	Last commit message	Last commit date
parent directory ..
components/load_from_hub		components/load_from_hub
README.md		README.md
pipeline.py		pipeline.py

README.md

Starcoder pipeline

This pipeline illustrates a tiny portion of the data preparation of StarCoder, an open-source version of Github CoPilot, trained as part of the the BigCode project.

The pipeline is based on this repository.

The pipeline includes the following components:

loading a code dataset from the Hugging Face hub
filtering code based on comment to code ratio
filtering code based on line length
detecting and replacing PII (personal identifiable information) from code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

starcoder

starcoder

README.md

Starcoder pipeline

Files

starcoder

Directory actions

More options

Directory actions

More options

Latest commit

History

starcoder

Folders and files

parent directory

README.md

Starcoder pipeline