Taiga corpus

Welcome to the taiga site repository!

Here, as well as on our website, you can explore our documentation, leave feedback, open issues and create pull requests

About the project

Taiga corpus is an ambitious project to become the largest fully available webcorpus constructed from open text sources. Taiga corpus is:

open source
big - about 6 billion words by now
sorted by datasets applicable to different machine laearning tasks
made by linguists, experienced in text crawling, parsing and filtering
rich with metainformation
POS-tagged and syntactically tagged in Universal Dependencies

Our motivation

A wisely constructed web corpus has a lot more potential applications than is classically accounted to have. The “web as corpus” paradigm recently has had its natural continuation as a formulation “web as train set”. Open-source websites provide ample opportunities for NLP-developers and computational linguists, who nevertheless have to gather all the corresponding data by themselves, repeating the same actions for cleaning and de-duplicating the material, as traditional web corpora provide only search interface and do not give any access to the whole data. The "Taiga" corpus project unites the needs of developers, machine learners and computational linguists, as a web corpus for big linguistic data analysis and actual NLP and NLU systems modeling. Its main aim is to influence the culture of corpus research for Russian language and reflect the paradigm shift in linguistic methodology.

Project creators

Tatiana Shavrina ([email protected])
Yana Kurmachova ([email protected])

Under inspiring supervision of Olga Lyashevskaya

References:

Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of "CORPORA2017", international conference , Saint-Petersbourg, 2017.
Shavrina T. (2018) Differential approach to webcorpus construction. In Dialogue, Russian International Conference on Computational Linguistics, RSUH, Moscow.

Name		Name	Last commit message	Last commit date
Latest commit History 218 Commits
_layouts		_layouts
_sass		_sass
_site		_site
assets		assets
corpus		corpus
404.md		404.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
_config.yml		_config.yml
dcef3abedf0e0761203aaeb85886a6f3.jpg		dcef3abedf0e0761203aaeb85886a6f3.jpg
downloads.md		downloads.md
format.md		format.md
index.md		index.md
mission.md		mission.md
news.md		news.md
pipeline.md		pipeline.md
segments.md		segments.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Taiga corpus

About the project

Our motivation

Project creators

References:

About

Releases

Packages

Contributors 2

Languages

TatianaShavrina/taiga_site

Folders and files

Latest commit

History

Repository files navigation

Taiga corpus

About the project

Our motivation

Project creators

References:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages