青空文庫 (Aozora Bunko) Extractor

This repository is dedicated to formatting data from Aozora Bunko (青空文庫), a website that compiles public domain books in Japan. The data will be converted into a convenient and user-friendly format, making it ideal for Machine Learning applications.

The dataset processed by this code is made available on HuggingFace: globis-university/aozorabunko-clean.

[For Japanese] 日本語での概要説明を Qiita に記載しました: https://qiita.com/akeyhero/items/b53eae1c0bc4d54e321f

Methodology

1. Data collection

First, the CSV file that lists all works is downloaded. This information is then incorporated into the meta field. Non-public-domain books are filtered out. The main text for each book corresponding to every row in the CSV is retrieved and incorporated into the text field.

2. Deduplication

Entries where the 図書カードURL (Library Card URL) in the CSV does not match the 作品ID (Work ID) and 人物ID (Person ID) are removed. In addition, any rows with text identical to those found earlier are discarded.

3. Cleaning

The text field data undergoes cleaning in the following sequence:

Convert new lines to \n
Remove headers
Remove footnotes and add them to the footnote field
Convert inserted notes into regular parenthetical text
Remove ruby (phonetic guides)
Convert specific characters, such as external characters and iteration marks, into standard Unicode characters
Remove any remaining markup
Remove leading and trailing new lines and horizontal rules

Usage

1. Include submodule

2. Bundle install

bundle install

3. Download Aozora Bunko data

bundle exec ./save_as_jsonl.rb --public > tmp/aozorabunko.jsonl

Without --public flag, the output will contain non-public-domain or non-CC data.

4. Deduplicate books

This removes all redundant entries with identical text field values.

bundle exec ./deduplicate_books.rb --in tmp/aozorabunko.jsonl > tmp/aozorabunko-dedupe.jsonl

5. Clean up texts

This removes Aozora-Bunko-specific markups in the text fields as much as possible.

bundle exec ./clean_text_in_jsonl.rb --in tmp/aozorabunko-dedupe.jsonl > tmp/aozorabunko-dedupe-clean.jsonl

Extra: Extract chats

This collects chat data using a heuristic approach, specifically by collecting consecutive utterances denoted with brackets as 「...」.

bundle exec ./extract_chats.rb --in tmp/aozorabunko-dedupe-clean.jsonl > tmp/aozorabunko-dedupe-clean-chats.jsonl

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
aozorabunko_text @ 0984f7d		aozorabunko_text @ 0984f7d
tmp		tmp
.gitmodules		.gitmodules
.ruby-version		.ruby-version
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
clean_text_in_jsonl.rb		clean_text_in_jsonl.rb
deduplicate_books.rb		deduplicate_books.rb
extract_chats.rb		extract_chats.rb
save_as_jsonl.rb		save_as_jsonl.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

青空文庫 (Aozora Bunko) Extractor

Methodology

1. Data collection

2. Deduplication

3. Cleaning

Usage

1. Include submodule

2. Bundle install

3. Download Aozora Bunko data

4. Deduplicate books

5. Clean up texts

Extra: Extract chats

About

Releases

Packages

Languages

License

globis-org/aozorabunko-extractor

Folders and files

Latest commit

History

Repository files navigation

青空文庫 (Aozora Bunko) Extractor

Methodology

1. Data collection

2. Deduplication

3. Cleaning

Usage

1. Include submodule

2. Bundle install

3. Download Aozora Bunko data

4. Deduplicate books

5. Clean up texts

Extra: Extract chats

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages