N-gram frequencies

Simple utility to find the most common word n-grams from a corpus. Good for language learning, etc.

Usage

Find a monolingual corpus in your target language. I recommend using the OpenSubtitles corpus. Click on your language in the leftmost column in the lower table to download a .txt file.
Create a folder for your language in the repo and put the corpus file in it. Example: .en/en.txt
Run ngram_freqs.py with the following arguments:
- _language - your target language (must be available for NLTK), ex: english
- _corpus - path to your data, ex: en/en.txt
- _output_folder - path to store the resulting n-grams, ex: en/results/ (be sure to include the slash at the end)

Example code: python ngram_freqs.py english en/en.txt en/results/

How it works

The script streams all the lines from your corpus path to create a list of tokens with NLTK. Depending on the size of your file, this can take up a lot of memory and take several hours. To help you follow the progress, the script prints out a checkpoint every 100 000 lines. The tokens are then used to create 3-grams, 4-grams, 5-grams and 6-grams, by default including the 1000 most common ones for each type (you can change this with the argument top_n in the function write_ngrams).

Contribution

Feel free to make a pull request if you have any improvements, especially regarding memory usage. Thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
ngram_freqs.py		ngram_freqs.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

N-gram frequencies

Usage

How it works

Contribution

About

Releases

Packages

Languages

krkryger/ngram_freqs

Folders and files

Latest commit

History

Repository files navigation

N-gram frequencies

Usage

How it works

Contribution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages