An online interface of this resource index is also available HERE.
This repository collects resources for NLP in MSA (Modern Standard Arabic), Levantine and Egyptian Arabic, as part of the National NLP Plan. Resources are divided to folders by type. If you have a resource you can contribute, to be released under some open license, please submit a pull request, or contact us at [email protected].
When contributing to the list, please add a link to the license for all non-paper resources, e.g. {AGPL-3.0}, {?} for an unkonwn licesnse or {X} for unreleased/closed/copyrighted resources. For code resource, please also add the main language in which the tool is written, e.g. [Python] or [?] for an unknown programming language. Please add hosting mirrors with pointy brackets, e.g. <Zenodo mirror>.
Contents
- 1 Corpora
- 1.1 Unannotated Corpora
- 1.2 Multilingual Corpora
- 1.3 Annotated Datasets by Task
- 1.3.1 Content Moderation
- 1.3.2 Dependency Treebanks
- 1.3.3 Dialect Identification
- 1.3.4 Diacritization/Vocalization
- 1.3.5 Morphological Analysis
- 1.3.6 Named Entity Recognition (NER)
- 1.3.7 Part-of-speech (POS) Tagging
- 1.3.8 Question Answering (QA)
- 1.3.9 Question Classification
- 1.3.10 Sentiment Analysis
- 1.3.11 Emotion Detection
- 1.3.12 Topic Classification
- 1.3.13 Text Summarization
- 1.3.14 Transliteration
- 1.3.15 Semantic Role Labeling (SRL)
- 1.3.16 Coreference Resolution
- 1.3.17 Relation Extraction
- 1.3.18 Text Classification
- 1.3.19 Discourse Analysis
- 1.3.20 Dialogue Modeling
- 1.3.21 Machine Translation
- 1.3.22 Readability Assessment
- 1.4 Aligned/Parallel Corpora
- 1.5 Recorded Speech and Audio Corpora
- 2 Lexical Resources
- 3 Models and Tools
- 3.1 Models and Tools by Task
- 3.1.1 Text Preprocessing and Morphological Analysis
- 3.1.1.1 Tokenization
- 3.1.1.2 Transliteration
- 3.1.1.3 Morphological Analysis
- 3.1.1.4 Morphological Inflection
- 3.1.1.5 Morphological Segmentation
- 3.1.1.6 Part-of-speech (POS) Tagging
- 3.1.1.7 Stemming and Lemmatization
- 3.1.1.8 Dependency Parsing
- 3.1.1.9 Spell Checking and Correction
- 3.1.1.10 Diacritization/Vocalization
- 3.1.1.11 Stopwords Removal
- 3.1.1.12 Language modeling
- 3.1.1.13 Text Normalization
- 3.1.2 Text Analysis
- 3.1.2.1 Content Moderation
- 3.1.2.2 Dialect Identification
- 3.1.2.3 Question Answering (QA)
- 3.1.2.4 Sentiment Analysis
- 3.1.2.5 Emotion Detection
- 3.1.2.6 Text Summarization
- 3.1.2.7 Text Classification
- 3.1.2.8 Topic Classification
- 3.1.2.9 Topic Modeling
- 3.1.2.10 Irony/Sarcasm Detection
- 3.1.2.11 Discourse Analysis
- 3.1.2.12 Dialogue Modeling
- 3.1.3 Information Extraction
- 3.1.4 Speech and Image Processing
- 3.1.1 Text Preprocessing and Morphological Analysis
- 3.2 Models by Type
- 3.1 Models and Tools by Task
- 4 Commercial and Online Services
- 5 Annotation Tools
- 6 Evaluation
- 7 Labs & Organizations
- 8 Courses, Presentations and Meetups
- Arabic Stories {Apache License 2.0} - 146 Arabic children stories (MSA).
- OSAC {?} - 22,000 text documents, each belonging to 1 of 10 categories: Economics, History, Entertainments, etc (MSA).
- Shami {Apache License 2.0} - A Corpus of Levantine Arabic Dialects. 117,805 natural sentences from conversations in various Levantine dialects: Jordania, Palestinian, Lebanese, Syrian.
- Abuelkhair Corpus {?} - More than 5 million newspaper articles in MSA.
- ArCOV-19 {?} - The First Arabic COVID-19 Twitter Datast with Propagation Networks. About 3.2M tweets in mixed dialect Arabic associated with COVID-19, an ongoing collection starting at January 2020.
- Habibi {?} - a multi Dialect multi National Arabic Song Lyrics Corpus. More than 30,000 Arabic song lyrics in 6 Arabic dialects (Egyptian, Levantine, etc.) for singers from 18 different Arabic countries, segmented into sentences and words and labeled with song information.
- ArabicWeb16 {?} - A New Crawl for Today’s Arabic Web. 150M Arabic Web pages with high coverage of dialectal Arabic, Egyptian, Gulf, Levantine (~7M) and Maghrebi, as well as MSA, from a variety of sources - Wikipedia, Alexa, ArClueWeb09, and Twitter, etc.
- Arabic Wiki Data Dumps {?} - Wikipedia, the free encyclopedia, publishes dumps of its content as XML files on a monthly basis.
- OSCAR {CC BY 4.0} - OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
- CC100 {MIT} - This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises monolingual data for 100+ languages, including Hebrew. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots.
- WikiQAar {?} - a bilingual English-Arabic Question Answering corpus built on top of WIKIQA. See also: https://huggingface.co/datasets/wiki_qa_ar
- AraCOVID19-MFH {CC BY-NC-SA 4.0} - Arabic COVID-19 Multi-label Fake News & Hate Speech Detection Dataset. 10,828 mixed dialect Arabic tweets annotated with 10 different labels concerning fake news and hate speech.
- L-HSAB {?} - A Levantine Twitter Dataset for Hate Speech and Abusive Language. 5,846 Syrian/Lebanese political tweets labeled as normal, abusive or hate.
- Let-Mi {?} - An Arabic Levantine Twitter Dataset for Misogynistic Language. 6,603 tweets in Levantine Arabic annotated as either non-misogynistic or one of seven misogynistic language categories.
- MPOLD {`Apache License 2.0
`_} - Arabic Offensive Comments dataset from Multiple Social Media Platforms. Annotated social media comment dataset with (not) offensive language tags for Arabic social media comments collected from three different online platforms: Twitter, Facebook and YouTube.
- A Corpus of Offensive Language in Arabic {?} - 16,000 comments on YouTube videos from different nationalities annotated for offensive language.
- Religious Hate Speech Detection for Arabic Tweets {?} - Tweets in MSA and Dialectal Arabic annotated for hate speech, training dataset contains 5,569 examples, while the testing dataset contains 567 examples.
- COVID-FAKES {?} - Bilingual (Arabic/English) COVID-19 Twitter dataset for misleading information detection. Automatically annotated Arabic/English COVID-19 Twitter dataset, using the shared information on the official websites Twitter accounts of the WHO, UNICEF, and UN as a source of reliable information, tweets annotated using 13 different machine learning algorithms and employing 7 different feature extraction technique.
- Adult Content Detection on Arabic Twitter {?} - 6k manually annotated Twitter accounts who post adult content and 44k ordinary Twitter accounts in addition to a tweet from each account, in mixed dialectal Arabic.
- Fine-Grained Hate Speech Detection on Arabic Twitter {CC BY 4.0} - 12,700 tweets in mixed dialect Arabic, no bias towards specific topics, genres, or dialects, each judged by 3 annotators for offensiveness classified into one of the hate speech types: Race, Religion, Ideology, Disability, Social Class, and Gender, and also judged whether a tweet has vulgar language or violence.
- ArCOV19-Rumors {?} - An Arabic COVID-19 Twitter dataset for misinformation detection. 138 verified claims, mostly from popular fact-checking websites, and identified 9.4K relevant tweets to those claims, then manually-annotated the tweets by veracity to support research on misinformation detection.
- AraFacts Dataset {CC BY-NC 4.0} - an Arabic dataset of naturally-occurring professionally-verified claims. A dataset of 6,222 claims collected from 5 Arabic fact-checking websites: Misbar, Verify-sy, Fatabyyano, FactuelAFP and Maharat-news, that have been standardized and made available for research purposes.
- Prague Arabic Dependency Treebank 1.0 {Custom Terms of Use} - Language resource for Arabic natural language processing (NLP), a collection of parsed sentences annotated with syntactic structures.
- UD_Arabic-PADT {`CC BY-NC-SA 3.0`_} - The Arabic-PADT UD treebank is based on the Prague Arabic Dependency Treebank (PADT), created at the Charles University in Prague. The treebank consists of 7,664 sentences (282,384 tokens) and its domain is mainly newswire.
- PADIC {GPLv3} - A multilingual Parallel Arabic DIalectal Corpus. A parallel corpus of 6,400 sentences in multiple Arabic dialects: Algerian, Maghreb, Syrian, Palestinian and MSA, for dialect detection and machine translation.
- DART {?} - A Large Dataset of Dialectal Arabic Tweets. About 25K tweets that are annotated via crowdsourcing for 5 Arabic dialects: Egyptian, Maghrebi, Levantine, Gulf, and Iraqi.
- The MADAR Arabic Dialect Corpus {Custom Terms of Use} - A collection of ~12,000 parallel sentences covering the dialects of 25 cities from the Arab World, in addition to English, French, and MSA.
- AOC {?} - Arabic Commentary Dataset. 108K sentences in mixed-dialect informal Arabic labeled for dialectal content.
- MSDA {?} - An open access NLP dataset for Arabic dialects. +50K tweets in five (5) national dialects, labeled for several applications: dialect detection, topic detection and sentiment analysis.
- Dialectal Arabic Code-Switching Dataset {MIT} - Transcribed audio in Egyptian dialect annotated at word-level for Code Switching (CS).
- BAEC {?} - The Bangor Arabic–English Code-switching (BAEC) corpus. 45,251 words manually annotated for code-switching between Saudi, Egyptian and MSA Arabic and English.
- Tashkeela {GPLv2} - Arabic diacritization corpus. Data is a collection of Arabic vocalized texts, which covers modern and classical Arabic language. The Data contains over 75 million of fully vocalized words obtained from 97 books, structured in text files. The corpus is collected mostly from Islamic classical books [14], and using semi-automatic web crawling process. The Modern Standard Arabic texts crawled from the Internet represent 1.15% of the corpus, about 867,913 words, while the most part is collected from Shamela Library, which represent 98.85%, with 74,762,008 words contained in 97 books.
- Annotated Shami Corpus {?} - Lebanese Arabic corpus annotated for numerous morphological features and for orthography standardization.
- ANERcorp {`CC BY-SA 4.0`_} - 300 documents annotated for entity recognition.
- KALIMAT {?} - 20,200 from the Omani newspaper Al Watan with summaries, named entities, art-of-speech tagging, and morphological analysis.
- KALIMAT {?} - 20,200 from the Omani newspaper Al Watan with summaries, named entities, art-of-speech tagging, and morphological analysis.
- Dialectal Arabic Datasets {Apache License 2.0} - 1,400 manually segmented and POS tagged tweets in four dialects, Egyptian, Levantine, Gulf, and Maghrebi.
- WikiQAar {?} - a bilingual English-Arabic Question Answering corpus built on top of WIKIQA. See also: https://huggingface.co/datasets/wiki_qa_ar
- ARCD {MIT} - Wikipedia open-domain Question Answering. 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of SQuAD.
- DAWQAS {MIT} - A Dataset for Arabic Why Question Answering System. 3,205 why question-answer pairs scraped from public Arabic websites.
- TyDiQA {Apache License 2.0} - A Dataset for Arabic Why Question Answering System. 3,205 why question-answer pairs scraped from public Arabic websites. Arabic dataset is 15,645 question-answer pairs.
- AQAD {?} - 17,000+ Arabic Questions & Answers dataset. 17,000+ questions, collected via fully automated data collector on a set of Arabic Wikipedia articles for extractive question answering task
- Journalist Questions on Twitter {?} - 10,000 mixed dialect Arabic tweets manually annotated for question type.
- Arabic 100k Reviews {?} - Reviews with three classes from different services. 100k good, bad and medium reviews in Arabic from different services.
HARD: Hotel Arabic-Reviews Dataset <https://github.com/elnagara/HARD-Arabic-Dataset> BRAD: Books Reviews in Arabic Dataset <https://github.com/elnagara/BRAD-Arabic-Dataset> Large Arabic Sentiment Analysis Resouces <https://github.com/hadyelsahar/large-arabic-sentiment-analysis-resouces/tree/master/datasets> - 3K Automatically annotated Reviews in Domains of Movies, Hotels, Restaurants and Products
- AGJT {?} - Arabic Twitter Corpus. 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
- ArSAS {?} - An Arabic Speech-Act and Sentiment Corpus of Tweets. 21,000 tweets manually annotated for six different classes of speech-act labels.
- ASTD {GPLv2} - Arabic Sentiment Tweets Dataset. 10,000 tweets classified as objective, subjective positive, subjective negative, and subjective mixed.
- AraSenCorpus {MIT} - 4.5 million tweets annotated positive, negative and neutral.
- LABR {GPLv2} - A Large-SCale Arabic Book Reviews Dataset. 63,000 book reviews in mixed dialect Arabic for sentiment analysis.
- `Arabic Sentiment Analysis and Cross-lingual Sentiment Resources<https://saifmohammad.com/WebPages/ArabicSA.html>`_ {Custom Terms of Use} - `BBN Blog Posts Sentiment Corpus<https://www.google.com/url?q=https://saifmohammad.com/WebDocs/Arabic-Sentiment-Corpora/bbn_shared-2.xls&sa=D&source=docs&ust=1687688165193615&usg=AOvVaw1ID4x19RecFLrun8-7cBiN>`_ - A random subset of 1200 Levantine dialectal sentences. `Syria Tweets Sentiment Corpus<https://saifmohammad.com/WebDocs/Arabic-Sentiment-Corpora/syr_twts%20_shared.xlsx>`_ - A dataset of 2000 tweets originating from Syria.
- TEAD {GPLv3} - 6 million mixed dialect Arabic tweets with a vocabulary of 602,721 distinct entities, annotated by emojis and sentiment lexicon as subjective positive, subjective negative and neutral, dialectal tweets “translated” into MSA.
- MASC {?} - Multi-domain Arabic Sentiment Corpus. 8,860 positive and negative reviews from different domains, in a variety of dialects, as well as a list of 3,880 positive and negative synsets annotated with their part of speech, polarity scores, dialects synsets and inflected forms.
- MSDA {?} - An open access NLP dataset for Arabic dialects. +50K tweets in five (5) national dialects, labeled for several applications: dialect detection, topic detection and sentiment analysis.
- OCLAR {?} - Opinion Corpus for Lebanese Arabic Reviews. 3900 Arabic customer reviews, on a wide scope of domain, including restaurants, hotels, hospitals, local shops, etc.
- Omcca {?} - Opinion Mining: Analysis of Comments Written in Arabic Colloquial. 28,576 reviews, which represents sentiments of 5,422 different reviewers, covering 27 different categories, collected from Jeeran web site, in Saudi and Jordanian Arabic.
- NSAR {?} - Negation and Speculation in Arabic Review. 3K review sentences annotated with negation and speculation in Egyptian dialect.
- DAICT {?} - A Dialectal Arabic Irony Corpus Extracted from Twitter. 5,588 tweets - written in both MSA and mixed dialectal Arabic - manually annotated by two professional linguistics from HBKU for irony.
- IDAT {GPLv3} - Irony Detection in Arabic Tweets. ~5.5k mixed dialect Arabic tweets annotated by two native Arabic speakers appended with another randomly 5.5k sampled tweets from the original unannotated corpus.
- iSarcasm {MIT} - A Dataset of Intended Sarcasm. Dataset of tweets in Arabic and English labeled for sarcasm directly by their authors.
- AraCovid19-SSD {CC BY-NC-SA 4.0} - Arabic COVID-19 Sentiment and Sarcasm Detection Dataset. Manually annotated multi-label Arabic COVID-19 Sentiment and Sarcasm Detection Dataset. The dataset contains 5,162 annotated tweets.
- Arabic Sentiment Analysis {Apache License 2.0} - 36K tweets labeled into positive and negative, employed distant supervision and self-training approaches into the corpus to annotate it. 8K tweets manually annotated as a gold standard. Corpus evaluated intrinsically by comparing it to human classification and pre-trained sentiment analysis models. Extrinsic evaluation methods exploiting sentiment analysis task applied, achieving an accuracy of 86%.
- MSDA {?} - An open access NLP dataset for Arabic dialects. +50K tweets in five (5) national dialects, labeled for several applications: dialect detection, topic detection and sentiment analysis.
- Kawarith {CC BY-NC 4.0} - an Arabic Twitter Corpus for Crisis Events. A large-scale crisis-related multi-dialect Arabic Twitter corpus of 1,658,795 unique tweets from 22 emergency events.
- Arabic Twitter Corpus for Flood Detection {?} - 4,037 human-labelled Arabic Twitter messages in Middle Eastern dialects, for four high-risk flood events that occurred in 2018, labelled based on relatedness to the crisis and information type.
- KALIMAT {?} - 20,200 from the Omani newspaper Al Watan with summaries, named entities, art-of-speech tagging, and morphological analysis.
- BOLT {Custom Terms of Use} - Egyptian Arabic SMS/Chat and Transliteration. 1,856 naturally-occuring Arabizi conversations transliterated from the original romanized Arabizi script into standard Arabic orthography.
- Satirical Fake News Dataset {?} - Scraped from two satirical news websites, Al-Hudood and Al-Ahram Al-Mexici, for training fake news classifier/identifier.
- ARC-WMI {CC BY-NC-SA 4.0} - Arabic collection of written medicine information annotated with readability levels, contains 4476 sentences with over 61k words, extracted from 94 sources of Arabic written medicine information, annotated and assigned a readability level by a panel of health-care professionals.
- DiaCorpus {`CC BY-SA 4.0`_} - The DiaCorpus project is a collaboration between the Data Science Institute (DSI) and Israeli Innovation authority. The purpose of the project is to create a first of a kind Arabic textual repository, in a local dialect (Israeli / Palestinian). This project is part of the National Language Processing plan of Israel.
- QASR {?} - QCRI Al Jazeera Speech Resource. The largest transcribed Arabic speech corpus with around 2,000 hours with multi-layer annotation, in multi-dialect and code-switching speech, crawled from the Al Jazeera news channel, for speech recognition, dialect identification, punctuation restoration, speaker identification, speaker linking, etc.
- CHILDES {?} - Egyptian Arabic Salama Corpus. Transcripted utterances + audio by children in Egyptian dialect.
- NileULex {?} - Nile University's Arabic sentiment Lexicon. Egyptian Arabic and Modern Standard Arabic sentiment words and their polarity, available for research, commercial use requires author permission.
- SenZi {?} - A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi) - Lebanese dialect Arabizi sentiment lexicon, sentiment annotated datasets, and a Facebook corpus.
- Maknuune {`CC BY-SA 4.0`_} - A large open lexicon for the Palestinian Arabic dialect. Maknuune has over 36K entries from 17K lemmas, and 3.7K roots. All entries include diacritized Arabic orthography, phonological transcription and English glosses. Some entries are enriched with additional information
- word2word {Apache License 2.0} - Easy-to-use word-to-word translations for 3,564 language pairs. Hebrew is one of the 62 supported languages, and thus word-to-word translation to/from Hebrew is supported for 61 languages.
- Arabizi-Transliteration Corpus {?} - the first large-scale "Arabizi to Arabic script" parallel corpus focusing on the Jordanian dialect and consisting of more than 25k pairs carefully created and inspected by native speakers to ensure highest quality, taken from Twitter, Facebook and ASK.
- Arabic Stop Words {?} - A list of ~750 possible stop words in Arabic.
- Arabic Stop Words {?} - A list of ~750 possible stop words in Arabic.
- Buckwalter’s list of Arabic roots {?}
- Spark NLP for Arabic {Multiple} - 45 pre-trained models covering Named entity recognition, Translation, Word and sentence embeddings, Named Entity Recognition (NER), Stop words removal, Part-of-Speech (POS), and Lemmatization.
- DiaLex {?} - A testbank of word pairs for six syntactic and semantic relations across five important Arabic dialects was created, and used to evaluate a set of existing and new Arabic word embeddings.
- Light10 {Apache License 2.0} - A tokenizer and stemmer for Arabic based on Lucene's UTF-8 tokenizer and ArabicStemmer. The ArabicStemmer is Lucene's implementation of Larkey’s light stemmer Light10.
- MADAMIRA {Custom Terms of Use} - MADAMIRA is a morphological analyzer that provides tokenization, part-of-speech tagging, Morphological disambiguation for full range of morphological features, lemmatization, diacritization, named entity recognition and base phrase chunking.
- CAMeL Tools {MIT} - an open-source Python toolkit that supports Arabic and Arabic dialect pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis. CAMeL Tools provides command-line interfaces (CLIs) and application programming interfaces (APIs) covering these utilities.
- SAFAR (Demo) {Multiple} - Software Architecture For ARrabic. It is open source, cross-platform, modular, and provides an integrated development environment (IDE). It includes: 1) resources needed for different treatments of Arabic NLP, 2) basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics, and 3) applications for the ANLP. All integrated tools and resources remain under the copyright of their original authors. Each layer is developed as a set of reusable Java API: 1) Tools: includes a range of technical services (statistical functions, test tools, tokenization, sentences splitting etc.). 2) Resource Services: Provides resource language consultation such as lexicons and corpora. 3) NLP services: Contains three layers of processing language Regular (morphology, syntax and semantics). 4) Applications: Contains high-level applications that use layers listed above. 5) Client: In case the user needs to directly use the services layer.
- CAMeL Tools {MIT} - an open-source Python toolkit that supports Arabic and Arabic dialect pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis. CAMeL Tools provides command-line interfaces (CLIs) and application programming interfaces (APIs) covering these utilities.
- ElixirFM {?} - ElixirFM is a functional morphological analyzer that utilizes syntactic features to distinguish a word's sense. ElixirFM uses the correlation between Arabic grammar and morphology to improve the root extraction process; it uses Prague Arabic Dependency Treebank (PADT) to provide annotated syntactic features associated with stem dictionary (ElixirFM lexicon) for additional morphological knowledge. The lexicon of ElixirFM is derived from the open source Buckwalter lexicon.
- SAFAR (Demo) {Multiple} - Software Architecture For ARrabic. It is open source, cross-platform, modular, and provides an integrated development environment (IDE). It includes: 1) resources needed for different treatments of Arabic NLP, 2) basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics, and 3) applications for the ANLP. All integrated tools and resources remain under the copyright of their original authors. Each layer is developed as a set of reusable Java API: 1) Tools: includes a range of technical services (statistical functions, test tools, tokenization, sentences splitting etc.). 2) Resource Services: Provides resource language consultation such as lexicons and corpora. 3) NLP services: Contains three layers of processing language Regular (morphology, syntax and semantics). 4) Applications: Contains high-level applications that use layers listed above. 5) Client: In case the user needs to directly use the services layer.
- ATAR {?} - An ATtention-based LSTM model for ARabizi transliteration.
- Tafqit {MIT} - Transliteration of numbers to words.
- BAMA 2.0 {Custom Terms of Use} - Buckwalter Arabic Morphological Analyzer Version 2.0. A stem-based morphological analyzer (stemmer).
- SAMA 3.1 {Custom Terms of Use} - Standard Arabic Morphological Analyzer (SAMA) Version 3.1. The LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 is based on, and updates, Buckwalter Arabic Morphological Analyzer (BAMA) 2.0. SAMA is a software tool for the morphological analysis of Standard Arabic. It considers each Arabic word token in all possible prefix-stem-suffix segmentations, and lists all known/possible annotation solutions, with assignment of all diacritic marks, morpheme boundaries (separating clitics and inflectional morphemes from stems), and all Part-of-Speech (POS) labels and glosses for each morpheme segment. The generated output may then be reviewed by users, and the most appropriate annotation selected from among several choices. The input format, output format, and data layer of SAMA 3.1 were designed to be backward compatible with BAMA. Incremental changes to the data layer in SAMA have resulted in: 1) increased lexicon coverage in the dictionary files, 2) important changes and additions to the inventory of POS tags, and 3) more possible solutions generated for numerous word forms.
- AlKhalil {Apache License 2.0} - A diacritizer, POS-Tagger, root extractor, stemmer, lemmatizer, and morphosyntactic analyzer.
- MADAMIRA {Custom Terms of Use} - MADAMIRA is a morphological analyzer that provides tokenization, part-of-speech tagging, Morphological disambiguation for full range of morphological features, lemmatization, diacritization, named entity recognition and base phrase chunking.
- CAMeL Tools {MIT} - an open-source Python toolkit that supports Arabic and Arabic dialect pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis. CAMeL Tools provides command-line interfaces (CLIs) and application programming interfaces (APIs) covering these utilities.
- ElixirFM {?} - ElixirFM is a functional morphological analyzer that utilizes syntactic features to distinguish a word's sense. ElixirFM uses the correlation between Arabic grammar and morphology to improve the root extraction process; it uses Prague Arabic Dependency Treebank (PADT) to provide annotated syntactic features associated with stem dictionary (ElixirFM lexicon) for additional morphological knowledge. The lexicon of ElixirFM is derived from the open source Buckwalter lexicon.
- ADAM {Custom Terms of Use} - Analyzer for Dialectal Arabic Morphology. ADAM is built based on the `SAMA<https://catalog.ldc.upenn.edu/LDC2010L01>`_ database, and can analyze both Egyptian and Levantine dialects.
- Qutuf {Apache License 2.0} - An Arabic Morphological Analyzer (Including Stemming and Root Extraction) and Part-Of-Speech Tagger as an Expert System. Qutuf is aimed to be the Core of a Framework for Arabic NLP. At Qutuf, some new concepts have been identified and implemented. Like First Normalization and Second Normalization text forms at the preprocessing phase and the Premature and Overdue Tagging at the Part-Of-Speech tagging task. Moreover, the POS tagging is designed and implemented as a rule-based expert system. A POS tagset, which is built based on a morphological feature tagset, has been designed and used in Qutuf. Morphological Analysis Includes both Stemming (light stemming) and Root Extraction (heavy stemming). It achieves this by using finite state automata and rules for agreement developed for cliticization parsing. It also uses AlKhalil Morpho Sys open source database for root extraction, pattern matching, morphological feature and POS assignment and closed nouns after enriching it. See also online interface.
- Qalsadi {GPL} - Arabic morphological analyzer Library for python.
- SAFAR (Demo) {Multiple} - Software Architecture For ARrabic. It is open source, cross-platform, modular, and provides an integrated development environment (IDE). It includes: 1) resources needed for different treatments of Arabic NLP, 2) basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics, and 3) applications for the ANLP. All integrated tools and resources remain under the copyright of their original authors. Each layer is developed as a set of reusable Java API: 1) Tools: includes a range of technical services (statistical functions, test tools, tokenization, sentences splitting etc.). 2) Resource Services: Provides resource language consultation such as lexicons and corpora. 3) NLP services: Contains three layers of processing language Regular (morphology, syntax and semantics). 4) Applications: Contains high-level applications that use layers listed above. 5) Client: In case the user needs to directly use the services layer.
- MADA+TOKAN {Custom Terms of Use} - A versatile and freely available system that can derive extensive morphological and contextual information from raw Arabic text, and then use this information for a multitude of crucial NLP tasks. Applications include high-accuracy part-of-speech tagging, diacritization, lemmatization, disambiguation, stemming, and glossing.
- ElixirFM {?} - ElixirFM is a functional morphological analyzer that utilizes syntactic features to distinguish a word's sense. ElixirFM uses the correlation between Arabic grammar and morphology to improve the root extraction process; it uses Prague Arabic Dependency Treebank (PADT) to provide annotated syntactic features associated with stem dictionary (ElixirFM lexicon) for additional morphological knowledge. The lexicon of ElixirFM is derived from the open source Buckwalter lexicon.
- Qutrub {GPL} - An Arabic verb conjugation software.
- Tashaphyne {GPLv3} - Arabic light stemmer and segmenter. It mainly supports light stemming (removing prefixes and suffixes) and gives all possible segmentations. It uses a modified finite state Automaton which allows generating all segmentations. It extracts all possible affixation from a word and provides all possible segmentations of a given word. To extract stem, Tashaphyne removes the longest affix from the word, then the affixes can be validated against a valid affixes list.
- CAMeLBERT {MIT} - A collection of pre-trained models for Arabic NLP tasks. The models were fine-tuned for Sentiment Analysis, Dialect Identification, Poetry Classification, NER, POS Tagging.
- Spark NLP for Arabic {Multiple} - 45 pre-trained models covering Named entity recognition, Translation, Word and sentence embeddings, Named Entity Recognition (NER), Stop words removal, Part-of-Speech (POS), and Lemmatization.
- AlKhalil {Apache License 2.0} - A diacritizer, POS-Tagger, root extractor, stemmer, lemmatizer, and morphosyntactic analyzer.
- MADAMIRA {Custom Terms of Use} - MADAMIRA is a morphological analyzer that provides tokenization, part-of-speech tagging, Morphological disambiguation for full range of morphological features, lemmatization, diacritization, named entity recognition and base phrase chunking.
- ElixirFM {?} - ElixirFM is a functional morphological analyzer that utilizes syntactic features to distinguish a word's sense. ElixirFM uses the correlation between Arabic grammar and morphology to improve the root extraction process; it uses Prague Arabic Dependency Treebank (PADT) to provide annotated syntactic features associated with stem dictionary (ElixirFM lexicon) for additional morphological knowledge. The lexicon of ElixirFM is derived from the open source Buckwalter lexicon.
- ADAM {Custom Terms of Use} - Analyzer for Dialectal Arabic Morphology. ADAM is built based on the `SAMA<https://catalog.ldc.upenn.edu/LDC2010L01>`_ database, and can analyze both Egyptian and Levantine dialects.
- Qutuf {Apache License 2.0} - An Arabic Morphological Analyzer (Including Stemming and Root Extraction) and Part-Of-Speech Tagger as an Expert System. Qutuf is aimed to be the Core of a Framework for Arabic NLP. At Qutuf, some new concepts have been identified and implemented. Like First Normalization and Second Normalization text forms at the preprocessing phase and the Premature and Overdue Tagging at the Part-Of-Speech tagging task. Moreover, the POS tagging is designed and implemented as a rule-based expert system. A POS tagset, which is built based on a morphological feature tagset, has been designed and used in Qutuf. Morphological Analysis Includes both Stemming (light stemming) and Root Extraction (heavy stemming). It achieves this by using finite state automata and rules for agreement developed for cliticization parsing. It also uses AlKhalil Morpho Sys open source database for root extraction, pattern matching, morphological feature and POS assignment and closed nouns after enriching it. See also online interface.
- Qalsadi {GPL} - Arabic morphological analyzer Library for python.
- SAFAR (Demo) {Multiple} - Software Architecture For ARrabic. It is open source, cross-platform, modular, and provides an integrated development environment (IDE). It includes: 1) resources needed for different treatments of Arabic NLP, 2) basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics, and 3) applications for the ANLP. All integrated tools and resources remain under the copyright of their original authors. Each layer is developed as a set of reusable Java API: 1) Tools: includes a range of technical services (statistical functions, test tools, tokenization, sentences splitting etc.). 2) Resource Services: Provides resource language consultation such as lexicons and corpora. 3) NLP services: Contains three layers of processing language Regular (morphology, syntax and semantics). 4) Applications: Contains high-level applications that use layers listed above. 5) Client: In case the user needs to directly use the services layer.
- MADA+TOKAN {Custom Terms of Use} - A versatile and freely available system that can derive extensive morphological and contextual information from raw Arabic text, and then use this information for a multitude of crucial NLP tasks. Applications include high-accuracy part-of-speech tagging, diacritization, lemmatization, disambiguation, stemming, and glossing.
- Spark NLP for Arabic {Multiple} - 45 pre-trained models covering Named entity recognition, Translation, Word and sentence embeddings, Named Entity Recognition (NER), Stop words removal, Part-of-Speech (POS), and Lemmatization.
- Khoja {Apache License 2.0} - A root-based stemmer (heavy stemming; root extractor; rule-based). The algorithm was widely used in Arabic IR. It renders inflectional forms of words to produce their roots by removing their longest prefixes and suffixes, at first. The resulting word is then matched with some predefined patterns and some list-driven roots. The selected pattern depends on the length of the extracted word. Finally, in the algorithm, the extracted root is compared to a list of roots to check its validity.
- Light10 {Apache License 2.0} - A tokenizer and stemmer for Arabic based on Lucene's UTF-8 tokenizer and ArabicStemmer. The ArabicStemmer is Lucene's implementation of Larkey’s light stemmer Light10.
- BAMA 2.0 {Custom Terms of Use} - Buckwalter Arabic Morphological Analyzer Version 2.0. A stem-based morphological analyzer (stemmer).
- SAMA 3.1 {Custom Terms of Use} - Standard Arabic Morphological Analyzer (SAMA) Version 3.1. The LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 is based on, and updates, Buckwalter Arabic Morphological Analyzer (BAMA) 2.0. SAMA is a software tool for the morphological analysis of Standard Arabic. It considers each Arabic word token in all possible prefix-stem-suffix segmentations, and lists all known/possible annotation solutions, with assignment of all diacritic marks, morpheme boundaries (separating clitics and inflectional morphemes from stems), and all Part-of-Speech (POS) labels and glosses for each morpheme segment. The generated output may then be reviewed by users, and the most appropriate annotation selected from among several choices. The input format, output format, and data layer of SAMA 3.1 were designed to be backward compatible with BAMA. Incremental changes to the data layer in SAMA have resulted in: 1) increased lexicon coverage in the dictionary files, 2) important changes and additions to the inventory of POS tags, and 3) more possible solutions generated for numerous word forms.
- Tashaphyne {GPLv3} - Arabic light stemmer and segmenter. It mainly supports light stemming (removing prefixes and suffixes) and gives all possible segmentations. It uses a modified finite state Automaton which allows generating all segmentations. It extracts all possible affixation from a word and provides all possible segmentations of a given word. To extract stem, Tashaphyne removes the longest affix from the word, then the affixes can be validated against a valid affixes list.
- Assem {Custom Terms of Use} - Assem's Arabic Light Stemmer is a snowball-based stemming algorithm for Arabic aimed mainly to improve search. Assem stemmer is fast and can be generated in many programming languages through Snowball (a small string processing language designed for creating stemming algorithms to be used in IR systems). Assem stemmer offers light stemming and text normalization. It can be configured to run as root extractor or stemmer, but in two separate packages, because the Snowball framework does not support stemming and rooting at the same time. See code: https://arabicstemmer.com/
- MOTAZ {Apache License 2.0} - Motaz stemmer provides both root extraction and light stemming. The root extraction part is an implementation of Khoja stemmer with the only difference being using another stopwords list. For the light stemming part, it is an implementation of the Light10 Arabic light stemming algorithm proposed by Larkey and colleagues. Before applying the Light10 algorithm, Motaz stemmer normalizes the input word by removing diacritics, replacing all the forms of Hamza with ا, replacing ة with ه and replacing ى with ي.
- Al-Stem (Darwish) {?} - Al-stem is a light stemmer, which lightly chops off the following prefixes but in order from right to left (وال، فال، بال، بت، يت، لت، مت، وت، ست، نت، بم، لم، وم، كم، فم، ال، لل، في، وا، وا، فا، لا،با) plus the following suffixes starting from right to left, too (ات، وا، ون، وه، ان، تي، ته، تم، كم، هم، هن، ها، ية، تك، نا، ين، يه، ة، هـ، ي، ا). Darwish and Oard used Al-stem in their experiment to develop a technique for Arabic-English cross-language information retrieval at TREC 2002. By the term cross-language IR, it means the query is written in a language that is different from the documents’ language. Later, Al-Stem has been modified by David Graff from the Linguistic data Consortium (LDC) to strip-off the suffixes (تا and ا) and the prefixes (سي and تت) from the list of suffixes in Al-Stem.
- Sebawai (Darwish) {?} - a root-based analyzer that is based on automatically derived rules and statistics. Sebawai has two main modules: The first module constructs a list of “word-root” pairs, using a morphological analyzer called ALPNET. Then, it extracts a list of prefixes, suffixes and stem templates, and estimates the probability that a prefix, suffix or stem template would occur. The second module takes a word and produces the possible combinations among prefixes, suffixes and templates. These combinations are obtained by eliminating prefixes and suffixes from words and then comparing all the produced stems to templates. As a result, a list of ranked roots is produced. These roots will be matched automatically against the list of the 10,000 roots extracted from an electronic copy of Lisan Al-Arab to confirm their existence.
- AlKhalil {Apache License 2.0} - A diacritizer, POS-Tagger, root extractor, stemmer, lemmatizer, and morphosyntactic analyzer.
- MADAMIRA {Custom Terms of Use} - MADAMIRA is a morphological analyzer that provides tokenization, part-of-speech tagging, Morphological disambiguation for full range of morphological features, lemmatization, diacritization, named entity recognition and base phrase chunking.
- CAMeL Tools {MIT} - an open-source Python toolkit that supports Arabic and Arabic dialect pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis. CAMeL Tools provides command-line interfaces (CLIs) and application programming interfaces (APIs) covering these utilities.
- ElixirFM {?} - ElixirFM is a functional morphological analyzer that utilizes syntactic features to distinguish a word's sense. ElixirFM uses the correlation between Arabic grammar and morphology to improve the root extraction process; it uses Prague Arabic Dependency Treebank (PADT) to provide annotated syntactic features associated with stem dictionary (ElixirFM lexicon) for additional morphological knowledge. The lexicon of ElixirFM is derived from the open source Buckwalter lexicon.
- ADAM {Custom Terms of Use} - Analyzer for Dialectal Arabic Morphology. ADAM is built based on the `SAMA<https://catalog.ldc.upenn.edu/LDC2010L01>`_ database, and can analyze both Egyptian and Levantine dialects.
- Qutuf {Apache License 2.0} - An Arabic Morphological Analyzer (Including Stemming and Root Extraction) and Part-Of-Speech Tagger as an Expert System. Qutuf is aimed to be the Core of a Framework for Arabic NLP. At Qutuf, some new concepts have been identified and implemented. Like First Normalization and Second Normalization text forms at the preprocessing phase and the Premature and Overdue Tagging at the Part-Of-Speech tagging task. Moreover, the POS tagging is designed and implemented as a rule-based expert system. A POS tagset, which is built based on a morphological feature tagset, has been designed and used in Qutuf. Morphological Analysis Includes both Stemming (light stemming) and Root Extraction (heavy stemming). It achieves this by using finite state automata and rules for agreement developed for cliticization parsing. It also uses AlKhalil Morpho Sys open source database for root extraction, pattern matching, morphological feature and POS assignment and closed nouns after enriching it. See also online interface.
- Qalsadi {GPL} - Arabic morphological analyzer Library for python.
- SAFAR (Demo) {Multiple} - Software Architecture For ARrabic. It is open source, cross-platform, modular, and provides an integrated development environment (IDE). It includes: 1) resources needed for different treatments of Arabic NLP, 2) basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics, and 3) applications for the ANLP. All integrated tools and resources remain under the copyright of their original authors. Each layer is developed as a set of reusable Java API: 1) Tools: includes a range of technical services (statistical functions, test tools, tokenization, sentences splitting etc.). 2) Resource Services: Provides resource language consultation such as lexicons and corpora. 3) NLP services: Contains three layers of processing language Regular (morphology, syntax and semantics). 4) Applications: Contains high-level applications that use layers listed above. 5) Client: In case the user needs to directly use the services layer.
- MADA+TOKAN {Custom Terms of Use} - A versatile and freely available system that can derive extensive morphological and contextual information from raw Arabic text, and then use this information for a multitude of crucial NLP tasks. Applications include high-accuracy part-of-speech tagging, diacritization, lemmatization, disambiguation, stemming, and glossing.
- ElixirFM {?} - ElixirFM is a functional morphological analyzer that utilizes syntactic features to distinguish a word's sense. ElixirFM uses the correlation between Arabic grammar and morphology to improve the root extraction process; it uses Prague Arabic Dependency Treebank (PADT) to provide annotated syntactic features associated with stem dictionary (ElixirFM lexicon) for additional morphological knowledge. The lexicon of ElixirFM is derived from the open source Buckwalter lexicon.
- SAFAR (Demo) {Multiple} - Software Architecture For ARrabic. It is open source, cross-platform, modular, and provides an integrated development environment (IDE). It includes: 1) resources needed for different treatments of Arabic NLP, 2) basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics, and 3) applications for the ANLP. All integrated tools and resources remain under the copyright of their original authors. Each layer is developed as a set of reusable Java API: 1) Tools: includes a range of technical services (statistical functions, test tools, tokenization, sentences splitting etc.). 2) Resource Services: Provides resource language consultation such as lexicons and corpora. 3) NLP services: Contains three layers of processing language Regular (morphology, syntax and semantics). 4) Applications: Contains high-level applications that use layers listed above. 5) Client: In case the user needs to directly use the services layer.
- Arabic-Tashkeela-Model {?} - A diacritization model for Arabic language. This model was built/trained using the Tashkeela: the Arabic diacritization corpus on Kaggle.
- AlKhalil {Apache License 2.0} - A diacritizer, POS-Tagger, root extractor, stemmer, lemmatizer, and morphosyntactic analyzer.
- MADAMIRA {Custom Terms of Use} - MADAMIRA is a morphological analyzer that provides tokenization, part-of-speech tagging, Morphological disambiguation for full range of morphological features, lemmatization, diacritization, named entity recognition and base phrase chunking.
- CAMeL Tools {MIT} - an open-source Python toolkit that supports Arabic and Arabic dialect pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis. CAMeL Tools provides command-line interfaces (CLIs) and application programming interfaces (APIs) covering these utilities.
- MADA+TOKAN {Custom Terms of Use} - A versatile and freely available system that can derive extensive morphological and contextual information from raw Arabic text, and then use this information for a multitude of crucial NLP tasks. Applications include high-accuracy part-of-speech tagging, diacritization, lemmatization, disambiguation, stemming, and glossing.
- Spark NLP for Arabic {Multiple} - 45 pre-trained models covering Named entity recognition, Translation, Word and sentence embeddings, Named Entity Recognition (NER), Stop words removal, Part-of-Speech (POS), and Lemmatization.
- SAFAR (Demo) {Multiple} - Software Architecture For ARrabic. It is open source, cross-platform, modular, and provides an integrated development environment (IDE). It includes: 1) resources needed for different treatments of Arabic NLP, 2) basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics, and 3) applications for the ANLP. All integrated tools and resources remain under the copyright of their original authors. Each layer is developed as a set of reusable Java API: 1) Tools: includes a range of technical services (statistical functions, test tools, tokenization, sentences splitting etc.). 2) Resource Services: Provides resource language consultation such as lexicons and corpora. 3) NLP services: Contains three layers of processing language Regular (morphology, syntax and semantics). 4) Applications: Contains high-level applications that use layers listed above. 5) Client: In case the user needs to directly use the services layer.
- Assem {Custom Terms of Use} - Assem's Arabic Light Stemmer is a snowball-based stemming algorithm for Arabic aimed mainly to improve search. Assem stemmer is fast and can be generated in many programming languages through Snowball (a small string processing language designed for creating stemming algorithms to be used in IR systems). Assem stemmer offers light stemming and text normalization. It can be configured to run as root extractor or stemmer, but in two separate packages, because the Snowball framework does not support stemming and rooting at the same time. See code: https://arabicstemmer.com/
- CAMeL Tools {MIT} - an open-source Python toolkit that supports Arabic and Arabic dialect pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis. CAMeL Tools provides command-line interfaces (CLIs) and application programming interfaces (APIs) covering these utilities.
- Qutuf {Apache License 2.0} - An Arabic Morphological Analyzer (Including Stemming and Root Extraction) and Part-Of-Speech Tagger as an Expert System. Qutuf is aimed to be the Core of a Framework for Arabic NLP. At Qutuf, some new concepts have been identified and implemented. Like First Normalization and Second Normalization text forms at the preprocessing phase and the Premature and Overdue Tagging at the Part-Of-Speech tagging task. Moreover, the POS tagging is designed and implemented as a rule-based expert system. A POS tagset, which is built based on a morphological feature tagset, has been designed and used in Qutuf. Morphological Analysis Includes both Stemming (light stemming) and Root Extraction (heavy stemming). It achieves this by using finite state automata and rules for agreement developed for cliticization parsing. It also uses AlKhalil Morpho Sys open source database for root extraction, pattern matching, morphological feature and POS assignment and closed nouns after enriching it. See also online interface.
- SAFAR (Demo) {Multiple} - Software Architecture For ARrabic. It is open source, cross-platform, modular, and provides an integrated development environment (IDE). It includes: 1) resources needed for different treatments of Arabic NLP, 2) basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics, and 3) applications for the ANLP. All integrated tools and resources remain under the copyright of their original authors. Each layer is developed as a set of reusable Java API: 1) Tools: includes a range of technical services (statistical functions, test tools, tokenization, sentences splitting etc.). 2) Resource Services: Provides resource language consultation such as lexicons and corpora. 3) NLP services: Contains three layers of processing language Regular (morphology, syntax and semantics). 4) Applications: Contains high-level applications that use layers listed above. 5) Client: In case the user needs to directly use the services layer.
- ARBERT & MARBERT {?} - a large scale pre-training masked language model focused on both Dialectal Arabic (DA) and MSA; fine-tuned on ArBench: Sentiment Analysis, Social Meaning, Topic Classification, Dialect Identification, Named Entity Recogntion.
- CAMeLBERT {MIT} - A collection of pre-trained models for Arabic NLP tasks. The models were fine-tuned for Sentiment Analysis, Dialect Identification, Poetry Classification, NER, POS Tagging.
- ARBERT & MARBERT {?} - a large scale pre-training masked language model focused on both Dialectal Arabic (DA) and MSA; fine-tuned on ArBench: Sentiment Analysis, Social Meaning, Topic Classification, Dialect Identification, Named Entity Recogntion.
- CAMeL Tools {MIT} - an open-source Python toolkit that supports Arabic and Arabic dialect pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis. CAMeL Tools provides command-line interfaces (CLIs) and application programming interfaces (APIs) covering these utilities.
- AraBERT {Multiple} - Transformer-based Model for Arabic Language Understanding.
- AraELECTRA {Custom Terms of Use} - An Arabic language representation model, pretrained using the replaced token detection objective on large Arabic text corpora. ARAELECTRA’s performance is validated on three Arabic NLP tasks i.e. question answering (QA), sentiment analysis (SA) and named-entity recognition (NER).
- AraBERT {Multiple} - Transformer-based Model for Arabic Language Understanding.
- CAMeLBERT {MIT} - A collection of pre-trained models for Arabic NLP tasks. The models were fine-tuned for Sentiment Analysis, Dialect Identification, Poetry Classification, NER, POS Tagging.
- ARBERT & MARBERT {?} - a large scale pre-training masked language model focused on both Dialectal Arabic (DA) and MSA; fine-tuned on ArBench: Sentiment Analysis, Social Meaning, Topic Classification, Dialect Identification, Named Entity Recogntion.
- AraELECTRA {Custom Terms of Use} - An Arabic language representation model, pretrained using the replaced token detection objective on large Arabic text corpora. ARAELECTRA’s performance is validated on three Arabic NLP tasks i.e. question answering (QA), sentiment analysis (SA) and named-entity recognition (NER).
- CAMeL Tools {MIT} - an open-source Python toolkit that supports Arabic and Arabic dialect pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis. CAMeL Tools provides command-line interfaces (CLIs) and application programming interfaces (APIs) covering these utilities.
- ARBERT & MARBERT {?} - a large scale pre-training masked language model focused on both Dialectal Arabic (DA) and MSA; fine-tuned on ArBench: Sentiment Analysis, Social Meaning, Topic Classification, Dialect Identification, Named Entity Recogntion.
- CAMeLBERT {MIT} - A collection of pre-trained models for Arabic NLP tasks. The models were fine-tuned for Sentiment Analysis, Dialect Identification, Poetry Classification, NER, POS Tagging.
- ARBERT & MARBERT {?} - a large scale pre-training masked language model focused on both Dialectal Arabic (DA) and MSA; fine-tuned on ArBench: Sentiment Analysis, Social Meaning, Topic Classification, Dialect Identification, Named Entity Recogntion.
- AraBERT {Multiple} - Transformer-based Model for Arabic Language Understanding.
- CAMeLBERT {MIT} - A collection of pre-trained models for Arabic NLP tasks. The models were fine-tuned for Sentiment Analysis, Dialect Identification, Poetry Classification, NER, POS Tagging.
- Spark NLP for Arabic {Multiple} - 45 pre-trained models covering Named entity recognition, Translation, Word and sentence embeddings, Named Entity Recognition (NER), Stop words removal, Part-of-Speech (POS), and Lemmatization.
- ARBERT & MARBERT {?} - a large scale pre-training masked language model focused on both Dialectal Arabic (DA) and MSA; fine-tuned on ArBench: Sentiment Analysis, Social Meaning, Topic Classification, Dialect Identification, Named Entity Recogntion.
- AraELECTRA {Custom Terms of Use} - An Arabic language representation model, pretrained using the replaced token detection objective on large Arabic text corpora. ARAELECTRA’s performance is validated on three Arabic NLP tasks i.e. question answering (QA), sentiment analysis (SA) and named-entity recognition (NER).
- MADAMIRA {Custom Terms of Use} - MADAMIRA is a morphological analyzer that provides tokenization, part-of-speech tagging, Morphological disambiguation for full range of morphological features, lemmatization, diacritization, named entity recognition and base phrase chunking.
- CAMeL Tools {MIT} - an open-source Python toolkit that supports Arabic and Arabic dialect pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis. CAMeL Tools provides command-line interfaces (CLIs) and application programming interfaces (APIs) covering these utilities.
- WhisperLevantineArabic {CC BY 4.0} - A fine-tuned version of the Whisper medium model, specifically optimized for transcribing Levantine Arabic with a focus on the Israeli dialect. This model aims to improve automatic speech recognition (ASR) performance for this specific variant of Arabic.
- AraGPT2 {Custom Terms of Use} - Pre-Trained Transformer for Arabic Language Generation.
- Spark NLP for Arabic {Multiple} - 45 pre-trained models covering Named entity recognition, Translation, Word and sentence embeddings, Named Entity Recognition (NER), Stop words removal, Part-of-Speech (POS), and Lemmatization.
- Turjuman {Apache License 2.0} - a neural machine translation toolkit. It translates from 20 languages into Modern Standard Arabic (MSA).
- AraBERT {Multiple} - Transformer-based Model for Arabic Language Understanding.
- CAMeLBERT {MIT} - A collection of pre-trained models for Arabic NLP tasks. The models were fine-tuned for Sentiment Analysis, Dialect Identification, Poetry Classification, NER, POS Tagging.
- Spark NLP for Arabic {Multiple} - 45 pre-trained models covering Named entity recognition, Translation, Word and sentence embeddings, Named Entity Recognition (NER), Stop words removal, Part-of-Speech (POS), and Lemmatization.
- ARBERT & MARBERT {?} - A large scale pre-training masked language model focused on both Dialectal Arabic (DA) and MSA; fine-tuned on ArBench: Sentiment Analysis, Social Meaning, Topic Classification, Dialect Identification, Named Entity Recogntion.
- AraELECTRA {Custom Terms of Use} - An Arabic language representation model, pretrained using the replaced token detection objective on large Arabic text corpora. ARAELECTRA’s performance is validated on three Arabic NLP tasks i.e. question answering (QA), sentiment analysis (SA) and named-entity recognition (NER).
- AraGPT2 {Custom Terms of Use} - Pre-Trained Transformer for Arabic Language Generation.
- QARiB {Apache License 2.0} - QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
- AraBERT {Multiple} - Transformer-based Model for Arabic Language Understanding.
- CAMeLBERT {MIT} - A collection of pre-trained models for Arabic NLP tasks. The models were fine-tuned for Sentiment Analysis, Dialect Identification, Poetry Classification, NER, POS Tagging.
- Spark NLP for Arabic {Multiple} - 45 pre-trained models covering Named entity recognition, Translation, Word and sentence embeddings, Named Entity Recognition (NER), Stop words removal, Part-of-Speech (POS), and Lemmatization.
- ARBERT & MARBERT {?} - a large scale pre-training masked language model focused on both Dialectal Arabic (DA) and MSA; fine-tuned on ArBench: Sentiment Analysis, Social Meaning, Topic Classification, Dialect Identification, Named Entity Recogntion.
- AraELECTRA {Custom Terms of Use} - An Arabic language representation model, pretrained using the replaced token detection objective on large Arabic text corpora. ARAELECTRA’s performance is validated on three Arabic NLP tasks i.e. question answering (QA), sentiment analysis (SA) and named-entity recognition (NER).
- BERT's multilingual model - Trained (also) on Hebrew.
- MADAMIRA {Custom Terms of Use} - MADAMIRA is a morphological analyzer that provides tokenization, part-of-speech tagging, Morphological disambiguation for full range of morphological features, lemmatization, diacritization, named entity recognition and base phrase chunking.
- CAMeL Tools {MIT} - an open-source Python toolkit that supports Arabic and Arabic dialect pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis. CAMeL Tools provides command-line interfaces (CLIs) and application programming interfaces (APIs) covering these utilities.
- ElixirFM {?} - ElixirFM is a functional morphological analyzer that utilizes syntactic features to distinguish a word's sense. ElixirFM uses the correlation between Arabic grammar and morphology to improve the root extraction process; it uses Prague Arabic Dependency Treebank (PADT) to provide annotated syntactic features associated with stem dictionary (ElixirFM lexicon) for additional morphological knowledge. The lexicon of ElixirFM is derived from the open source Buckwalter lexicon.
- AraGPT2 {Custom Terms of Use} - Pre-Trained Transformer for Arabic Language Generation.
- verbit.ai - Transcription.
- Text Analytics for health containers
- LightTag - A tool for managing annotation projects. Handles right-to-left and part-of-word marking. Tutorial video: https://www.youtube.com/watch?v=eTlrTC_n_yg
- Recogito [Scala, JavaScript, HTML] {Apache License 2.0} - A tool for linked data annotation.
- CATMA [HTML, Java] {unclear} - A web-based tool for research and collaboration over text data. Handles right-to-left and part-of-word marking. See the system itself here: http://portal.catma.de/catma/, and the code here: https://github.com/mpetris/catma
- WebAnno [Java] {Apache License 2.0} - Web-based. Support RTL and project management. Repository: https://github.com/webanno/webanno
- Arethusa: Annotation Environment [JavaScript] {MIT} - A backend-independent client-side annotation framework. Repository here.
- rasa-nlu-trainer [JavaScript] {MIT} - A tool to edit training examples for rasa NLU. Handles right-to-left and part-of-word marking.
- brat [Python, JavaScript] {MIT} - An online environment for collaborative text annotation. Does not support right-to-left. Repository here.
- openNLP [Java] {Apache License 2.0} - OpenNLP has a tagging tool.
- opeNER [Ruby, HTML, Java, Python] - opeNER has a tagging tool.
- pybossa [Python] {AGPL-3.0} - A framework for crowdsourcing of data analysis and enrichment tasks. GitHub.
- TextThrasher [JavaScript, Python] - A crowdsourced text annotator. Built with React and Redux (possibly also with pybossa).
- doccano {MIT} - an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequence to sequence tasks. So, you can create labeled data for sentiment analysis, named entity recognition, text summarization and so on.
- CAMeL Labs - NLP projects by NYU Abu-Dabi.
- SIGARAB - The Special Interest Group of the Association for Computational Linguistics for researchers concerned with all aspects of Arabic NLP.
- Stanford University NLP Group - Articles and tools for Arabic NLP.
- John Snow Labs - Spark NLP for Arabic.
- Project RBZ - Resourcing Arabizi for NLP.
- CADIM Consortium - Computational Approaches to Arabic & Arabic Dialect Modeling Consortium.