Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Latest commit

 

History

History
142 lines (129 loc) · 50 KB

pretrained-vectors.md

File metadata and controls

142 lines (129 loc) · 50 KB
id title
pretrained-vectors
Wiki word vectors

We are publishing pre-trained word vectors for 294 languages, trained on Wikipedia using fastText. These vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.

Please note that a newer version of multi-lingual word vectors are available at: Word vectors for 157 languages.

Models

The models can be downloaded from:

Abkhazian: bin+text, text Acehnese: bin+text, text Adyghe: bin+text, text
Afar: bin+text, text Afrikaans: bin+text, text Akan: bin+text, text
Albanian: bin+text, text Alemannic: bin+text, text Amharic: bin+text, text
Anglo_Saxon: bin+text, text Arabic: bin+text, text Aragonese: bin+text, text
Aramaic: bin+text, text Armenian: bin+text, text Aromanian: bin+text, text
Assamese: bin+text, text Asturian: bin+text, text Avar: bin+text, text
Aymara: bin+text, text Azerbaijani: bin+text, text Bambara: bin+text, text
Banjar: bin+text, text Banyumasan: bin+text, text Bashkir: bin+text, text
Basque: bin+text, text Bavarian: bin+text, text Belarusian: bin+text, text
Bengali: bin+text, text Bihari: bin+text, text Bishnupriya Manipuri: bin+text, text
Bislama: bin+text, text Bosnian: bin+text, text Breton: bin+text, text
Buginese: bin+text, text Bulgarian: bin+text, text Burmese: bin+text, text
Buryat: bin+text, text Cantonese: bin+text, text Catalan: bin+text, text
Cebuano: bin+text, text Central Bicolano: bin+text, text Chamorro: bin+text, text
Chavacano: bin+text, text Chechen: bin+text, text Cherokee: bin+text, text
Cheyenne: bin+text, text Chichewa: bin+text, text Chinese: bin+text, text
Choctaw: bin+text, text Chuvash: bin+text, text Classical Chinese: bin+text, text
Cornish: bin+text, text Corsican: bin+text, text Cree: bin+text, text
Crimean Tatar: bin+text, text Croatian: bin+text, text Czech: bin+text, text
Danish: bin+text, text Divehi: bin+text, text Dutch: bin+text, text
Dutch Low Saxon: bin+text, text Dzongkha: bin+text, text Eastern Punjabi: bin+text, text
Egyptian Arabic: bin+text, text Emilian_Romagnol: bin+text, text English: bin+text, text
Erzya: bin+text, text Esperanto: bin+text, text Estonian: bin+text, text
Ewe: bin+text, text Extremaduran: bin+text, text Faroese: bin+text, text
Fiji Hindi: bin+text, text Fijian: bin+text, text Finnish: bin+text, text
Franco_Provençal: bin+text, text French: bin+text, text Friulian: bin+text, text
Fula: bin+text, text Gagauz: bin+text, text Galician: bin+text, text
Gan: bin+text, text Georgian: bin+text, text German: bin+text, text
Gilaki: bin+text, text Goan Konkani: bin+text, text Gothic: bin+text, text
Greek: bin+text, text Greenlandic: bin+text, text Guarani: bin+text, text
Gujarati: bin+text, text Haitian: bin+text, text Hakka: bin+text, text
Hausa: bin+text, text Hawaiian: bin+text, text Hebrew: bin+text, text
Herero: bin+text, text Hill Mari: bin+text, text Hindi: bin+text, text
Hiri Motu: bin+text, text Hungarian: bin+text, text Icelandic: bin+text, text
Ido: bin+text, text Igbo: bin+text, text Ilokano: bin+text, text
Indonesian: bin+text, text Interlingua: bin+text, text Interlingue: bin+text, text
Inuktitut: bin+text, text Inupiak: bin+text, text Irish: bin+text, text
Italian: bin+text, text Jamaican Patois: bin+text, text Japanese: bin+text, text
Javanese: bin+text, text Kabardian: bin+text, text Kabyle: bin+text, text
Kalmyk: bin+text, text Kannada: bin+text, text Kanuri: bin+text, text
Kapampangan: bin+text, text Karachay_Balkar: bin+text, text Karakalpak: bin+text, text
Kashmiri: bin+text, text Kashubian: bin+text, text Kazakh: bin+text, text
Khmer: bin+text, text Kikuyu: bin+text, text Kinyarwanda: bin+text, text
Kirghiz: bin+text, text Kirundi: bin+text, text Komi: bin+text, text
Komi_Permyak: bin+text, text Kongo: bin+text, text Korean: bin+text, text
Kuanyama: bin+text, text Kurdish (Kurmanji): bin+text, text Kurdish (Sorani): bin+text, text
Ladino: bin+text, text Lak: bin+text, text Lao: bin+text, text
Latgalian: bin+text, text Latin: bin+text, text Latvian: bin+text, text
Lezgian: bin+text, text Ligurian: bin+text, text Limburgish: bin+text, text
Lingala: bin+text, text Lithuanian: bin+text, text Livvi_Karelian: bin+text, text
Lojban: bin+text, text Lombard: bin+text, text Low Saxon: bin+text, text
Lower Sorbian: bin+text, text Luganda: bin+text, text Luxembourgish: bin+text, text
Macedonian: bin+text, text Maithili: bin+text, text Malagasy: bin+text, text
Malay: bin+text, text Malayalam: bin+text, text Maltese: bin+text, text
Manx: bin+text, text Maori: bin+text, text Marathi: bin+text, text
Marshallese: bin+text, text Mazandarani: bin+text, text Meadow Mari: bin+text, text
Min Dong: bin+text, text Min Nan: bin+text, text Minangkabau: bin+text, text
Mingrelian: bin+text, text Mirandese: bin+text, text Moksha: bin+text, text
Moldovan: bin+text, text Mongolian: bin+text, text Muscogee: bin+text, text
Nahuatl: bin+text, text Nauruan: bin+text, text Navajo: bin+text, text
Ndonga: bin+text, text Neapolitan: bin+text, text Nepali: bin+text, text
Newar: bin+text, text Norfolk: bin+text, text Norman: bin+text, text
North Frisian: bin+text, text Northern Luri: bin+text, text Northern Sami: bin+text, text
Northern Sotho: bin+text, text Norwegian (Bokmål): bin+text, text Norwegian (Nynorsk): bin+text, text
Novial: bin+text, text Nuosu: bin+text, text Occitan: bin+text, text
Old Church Slavonic: bin+text, text Oriya: bin+text, text Oromo: bin+text, text
Ossetian: bin+text, text Palatinate German: bin+text, text Pali: bin+text, text
Pangasinan: bin+text, text Papiamentu: bin+text, text Pashto: bin+text, text
Pennsylvania German: bin+text, text Persian: bin+text, text Picard: bin+text, text
Piedmontese: bin+text, text Polish: bin+text, text Pontic: bin+text, text
Portuguese: bin+text, text Quechua: bin+text, text Ripuarian: bin+text, text
Romani: bin+text, text Romanian: bin+text, text Romansh: bin+text, text
Russian: bin+text, text Rusyn: bin+text, text Sakha: bin+text, text
Samoan: bin+text, text Samogitian: bin+text, text Sango: bin+text, text
Sanskrit: bin+text, text Sardinian: bin+text, text Saterland Frisian: bin+text, text
Scots: bin+text, text Scottish Gaelic: bin+text, text Serbian: bin+text, text
Serbo_Croatian: bin+text, text Sesotho: bin+text, text Shona: bin+text, text
Sicilian: bin+text, text Silesian: bin+text, text Simple English: bin+text, text
Sindhi: bin+text, text Sinhalese: bin+text, text Slovak: bin+text, text
Slovenian: bin+text, text Somali: bin+text, text Southern Azerbaijani: bin+text, text
Spanish: bin+text, text Sranan: bin+text, text Sundanese: bin+text, text
Swahili: bin+text, text Swati: bin+text, text Swedish: bin+text, text
Tagalog: bin+text, text Tahitian: bin+text, text Tajik: bin+text, text
Tamil: bin+text, text Tarantino: bin+text, text Tatar: bin+text, text
Telugu: bin+text, text Tetum: bin+text, text Thai: bin+text, text
Tibetan: bin+text, text Tigrinya: bin+text, text Tok Pisin: bin+text, text
Tongan: bin+text, text Tsonga: bin+text, text Tswana: bin+text, text
Tulu: bin+text, text Tumbuka: bin+text, text Turkish: bin+text, text
Turkmen: bin+text, text Tuvan: bin+text, text Twi: bin+text, text
Udmurt: bin+text, text Ukrainian: bin+text, text Upper Sorbian: bin+text, text
Urdu: bin+text, text Uyghur: bin+text, text Uzbek: bin+text, text
Venda: bin+text, text Venetian: bin+text, text Vepsian: bin+text, text
Vietnamese: bin+text, text Volapük: bin+text, text Võro: bin+text, text
Walloon: bin+text, text Waray: bin+text, text Welsh: bin+text, text
West Flemish: bin+text, text West Frisian: bin+text, text Western Punjabi: bin+text, text
Wolof: bin+text, text Wu: bin+text, text Xhosa: bin+text, text
Yiddish: bin+text, text Yoruba: bin+text, text Zazaki: bin+text, text
Zeelandic: bin+text, text Zhuang: bin+text, text Zulu: bin+text, text

Format

The word vectors come in both the binary and text default formats of fastText. In the text format, each line contains a word followed by its vector. Each value is space separated. Words are ordered by their frequency in a descending order.

License

The word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.

References

If you use these word vectors, please cite the following paper:

P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2017enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={Transactions of the Association for Computational Linguistics},
  volume={5},
  year={2017},
  issn={2307-387X},
  pages={135--146}
}