Datasets built from various Japanese language corpora
https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.
You can download the datasets here: https://github.com/scriptin/kanji-frequency/tree/master/data
You'll need Node.js 18 or later.
See scripts
section in package.json.
Aozora:
aozora:download
- use crawler/scraper to collect the dataaozora:gaiji:extract
- extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent themaozora:gaiji:replacements
- build gaiji replacements file - produces only partial results, which may need to be manually completedaozora:clean
- clean the scraped pages (apply gaiji replacements)aozora:count
- create the dataset
Wikipedia:
wikipedia:fetch
- fetch random pages using MediaWiki APIwikipedia:count
- create the dataset
News:
news:wikinews:fetch
- fetch random pages from Wikinews using MediaWiki APInews:count
- create the datasetnews:dates
- create additional file with dates of articles
See Astro docs and the scripts
section in package.json.