This repository is dedicated to formatting data from Aozora Bunko (青空文庫), a website that compiles public domain books in Japan. The data will be converted into a convenient and user-friendly format, making it ideal for Machine Learning applications.
The dataset processed by this code is made available on HuggingFace: globis-university/aozorabunko-clean.
[For Japanese] 日本語での概要説明を Qiita に記載しました: https://qiita.com/akeyhero/items/b53eae1c0bc4d54e321f
First, the CSV file that lists all works is downloaded.
This information is then incorporated into the meta
field. Non-public-domain books are filtered out.
The main text for each book corresponding to every row in the CSV is retrieved and incorporated into the text
field.
Entries where the 図書カードURL
(Library Card URL) in the CSV does not match the 作品ID
(Work ID) and 人物ID
(Person ID) are removed.
In addition, any rows with text identical to those found earlier are discarded.
The text
field data undergoes cleaning in the following sequence:
- Convert new lines to
\n
- Remove headers
- Remove footnotes and add them to the
footnote
field - Convert inserted notes into regular parenthetical text
- Remove ruby (phonetic guides)
- Convert specific characters, such as external characters and iteration marks, into standard Unicode characters
- Remove any remaining markup
- Remove leading and trailing new lines and horizontal rules
bundle install
bundle exec ./save_as_jsonl.rb --public > tmp/aozorabunko.jsonl
Without --public
flag, the output will contain non-public-domain or non-CC data.
This removes all redundant entries with identical text
field values.
bundle exec ./deduplicate_books.rb --in tmp/aozorabunko.jsonl > tmp/aozorabunko-dedupe.jsonl
This removes Aozora-Bunko-specific markups in the text
fields as much as possible.
bundle exec ./clean_text_in_jsonl.rb --in tmp/aozorabunko-dedupe.jsonl > tmp/aozorabunko-dedupe-clean.jsonl
This collects chat data using a heuristic approach, specifically by collecting consecutive utterances denoted with brackets as 「...」
.
bundle exec ./extract_chats.rb --in tmp/aozorabunko-dedupe-clean.jsonl > tmp/aozorabunko-dedupe-clean-chats.jsonl