From 2f7d2d826d600e06bb95a2ea7ce7a7507148c6d2 Mon Sep 17 00:00:00 2001 From: gramirez-prompsit <32385845+gramirez-prompsit@users.noreply.github.com> Date: Tue, 29 Aug 2023 15:36:09 +0200 Subject: [PATCH] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a7b9a7d4..66905302 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,7 @@ Code and data are located in `/work` - Sentence length distribution: tokens per sentence for each language, showing total, unique and duplicate sentences. - Language distribution: shows percentage of automatically identified languages. - Quality Score distribution: as per language models (monolingual) or bicleaner scores (tool that computes the likelihood of two sentences of being mutual translations) -- Noise distribution: the result of applying hard rules and computing which percentage is affected by them (too short or too long sentences, sentences being URLs, sentences containing poor language, etc.) +- Noise distribution: the result of applying hard rules and computing which percentage is affected by them (too short or too long sentences, sentences being URLs, bad encoding, sentences containing poor language, etc.) - Common n-grams: 1-5 more frequent n-grams - MORE TO BE ADDED, SUGGESTIONS WELCOME!