Welcome to the taiga site repository!
Here, as well as on our website, you can explore our documentation, leave feedback, open issues and create pull requests
Taiga corpus is an ambitious project to become the largest fully available webcorpus constructed from open text sources. Taiga corpus is:
- open source
- big - about 6 billion words by now
- sorted by datasets applicable to different machine laearning tasks
- made by linguists, experienced in text crawling, parsing and filtering
- rich with metainformation
- POS-tagged and syntactically tagged in Universal Dependencies
A wisely constructed web corpus has a lot more potential applications than is classically accounted to have. The “web as corpus” paradigm recently has had its natural continuation as a formulation “web as train set”. Open-source websites provide ample opportunities for NLP-developers and computational linguists, who nevertheless have to gather all the corresponding data by themselves, repeating the same actions for cleaning and de-duplicating the material, as traditional web corpora provide only search interface and do not give any access to the whole data. The "Taiga" corpus project unites the needs of developers, machine learners and computational linguists, as a web corpus for big linguistic data analysis and actual NLP and NLU systems modeling. Its main aim is to influence the culture of corpus research for Russian language and reflect the paradigm shift in linguistic methodology.
- Tatiana Shavrina ([email protected])
- Yana Kurmachova ([email protected])
Under inspiring supervision of Olga Lyashevskaya
- Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of "CORPORA2017", international conference , Saint-Petersbourg, 2017.
- Shavrina T. (2018) Differential approach to webcorpus construction. In Dialogue, Russian International Conference on Computational Linguistics, RSUH, Moscow.