Skip to content

Latest commit

 

History

History
52 lines (35 loc) · 3.49 KB

index.md

File metadata and controls

52 lines (35 loc) · 3.49 KB

Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

Each text in corpus is represented in plain text and with morphological and syntactic annotation (UDPipe, homonymy resolved automatically) + has metainformation - date, theme, authorship, text difficulcy...etc (depending on source)

By now, about 5 billions of words are 77% literary texts (33 literary magazines), 19% of naive poetry, 2% of news (4 popular sites) and 2% of other (popular science, culture mags, social networks, amateur poems and prose), with documentation available.

See also: Omnia Russica corpus - a bigger version of Taiga available! 33 billion words from Taiga, Common Crawl, Wikipedia and Aranea corpus.

Segment information

<iframe src="https://cdn.datamatic.io/runtime/echarts/3.7.2_230/embedded/index.html#id=115038797393892898117/1XxvinvhVz-Gh0WJzjQ_0sD5_f7coQueI" frameborder="0" width="687" height="493" allowtransparency="true"></iframe>

We have gathered the resources with respect to popular NLP-problems:

  • thematic modelling - news with theme tags, all the sites which provide rubrication (news, poems, prose)
  • readability of texts - a popular science magazine NPlus1 has a readability metric for each text, provided by editor.
  • NER and fact extraction - news with references to mentioned person’s page or wiki-information, news with personalia tags
  • key-words extraction - news with key-word tags, hashtags on social media
  • authorship attribution - all the texts with author information - magazines, news, and more important - social media - with gender, age, city, time and education mark-up.
  • chat-bot training - open-source film subtitles
  • text generation - any resource depending on genre
  • rare words studying, frequency dictionaries - literary magazines, social media
  • morphological and syntactic parsers - any resource with respect to the genre

Taiga corpus is an ambitious project to become the largest fully available webcorpus constructed from open text sources. Taiga corpus is:

  • open source, CC BY-SA 3.0
  • big - about 5 billion words by now
  • sorted by datasets applicable to different machine laearning tasks
  • made by linguists, experienced in text crawling, parsing and filtering
  • rich with metainformation
  • POS-tagged and syntactically tagged in Universal Dependencies

With these principles, we believe that a corpus product that meets modern requirements of corpus linguistics can be created - it will not be a black box, it will be reflecting modern language and its features, not biased and capable of encouraging more cooperation between developers and linguists.

This project is a project in the HSE Compling framework

Project creators

Under inspiring supervision of Olga Lyashevskaya

References:

Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of "CORPORA2017", international conference , Saint-Petersbourg, 2017.

Support or Contact

Check out our documentation or contact us and we’ll help you sort it out.

We welcome users to ask question on Google Groups!