Skip to content

Commit

Permalink
Merge pull request #40 from tma15/feature/jglue-example
Browse files Browse the repository at this point in the history
Feature/jglue example
  • Loading branch information
tma15 authored Feb 11, 2024
2 parents ead609b + bbbefa1 commit 6c61d2e
Show file tree
Hide file tree
Showing 7 changed files with 206 additions and 10 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
*egg-info
build
dist
.mypy_cache

*cpp

Expand Down
42 changes: 32 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Bunruija
[![PyPI version](https://badge.fury.io/py/bunruija.svg)](https://badge.fury.io/py/bunruija)

Bunruija is a text classification toolkit.
Bunruija aims at enabling pre-processing, training and evaluation of text classification models with **minimum coding effort**.
Bunruija is mainly focusing on Japanese though it is also applicable to other languages.
Expand All @@ -20,9 +22,9 @@ Example of `sklearn.svm.SVC`

```yaml
data:
train: train.csv
dev: dev.csv
test: test.csv
train: train.jsonl
dev: dev.jsonl
test: test.jsonl

output_dir: models/svm-model

Expand Down Expand Up @@ -51,9 +53,9 @@ Example of BERT
```yaml
data:
train: train.csv
dev: dev.csv
test: test.csv
train: train.jsonl
dev: dev.jsonl
test: test.jsonl

output_dir: models/transformer-model

Expand Down Expand Up @@ -94,9 +96,9 @@ You can set data-related settings in `data`.

```yaml
data:
train: train.csv # training data
dev: dev.csv # development data
test: test.csv # test data
train: train.jsonl # training data
dev: dev.jsonl # development data
test: test.jsonl # test data
label_column: label
text_column: text
```
Expand Down Expand Up @@ -127,8 +129,28 @@ Format of `jsonl`:
```

### pipeline
You can set pipeline of your model in `pipeline`
You can set pipeline of your model in `pipeline` section.
It is a list of components that are used in your model.

For each component, `type` is a module path and `args` is arguments for the module.
For instance, when you set the first component as follows, [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) is instanciated with given arguments, and then applied to data at first in your model.

```yaml
- type: sklearn.feature_extraction.text.TfidfVectorizer
args:
tokenizer:
type: bunruija.tokenizers.mecab_tokenizer.MeCabTokenizer
args:
lemmatize: true
exclude_pos:
- 助詞
- 助動詞
max_features: 10000
min_df: 3
ngram_range:
- 1
- 3
```

## Prediction using the trained classifier in Python code
After you trained a classification model, you can use that model for prediction as follows:
Expand Down
38 changes: 38 additions & 0 deletions example/jglue/jcola/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Evaluation Results

## Linear SVM
### Config
```yaml
pipeline:
- type: sklearn.feature_extraction.text.TfidfVectorizer
args:
tokenizer:
type: bunruija.tokenizers.mecab_tokenizer.MeCabTokenizer
args:
lemmatize: true
exclude_pos:
- 助詞
- 助動詞
max_features: 10000
min_df: 3
ngram_range:
- 1
- 3
- type: sklearn.svm.LinearSVC
args:
verbose: 10
C: 10.
```
### Results
```
F-score on dev: 0.7514450867052023
precision recall f1-score support

acceptable 0.86 0.85 0.85 733
unacceptable 0.20 0.21 0.21 132

accuracy 0.75 865
macro avg 0.53 0.53 0.53 865
weighted avg 0.76 0.75 0.75 865
```
36 changes: 36 additions & 0 deletions example/jglue/jcola/create_jcola_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import json
from argparse import ArgumentParser
from pathlib import Path

from datasets import Dataset, load_dataset
from loguru import logger # type: ignore


def write_json(ds: Dataset, name: Path):
with open(name, "w") as f:
for sample in ds:
category: str = ds.features["label"].names[sample["label"]]
sample_ = {
"text": sample["sentence"],
"label": category,
}
print(json.dumps(sample_), file=f)
logger.info(f"{name}")


def main():
parser = ArgumentParser()
parser.add_argument("--output_dir", default="example/jglue/jcola/data", type=Path)
args = parser.parse_args()

if not args.output_dir.exists():
args.output_dir.mkdir(parents=True)

dataset = load_dataset("shunk031/JGLUE", name="JCoLA")

write_json(dataset["train"], args.output_dir / "train.jsonl")
write_json(dataset["validation"], args.output_dir / "dev.jsonl")


if __name__ == "__main__":
main()
38 changes: 38 additions & 0 deletions example/jglue/marc_ja/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Evaluation Results

## Linear SVM
### Config
```yaml
pipeline:
- type: sklearn.feature_extraction.text.TfidfVectorizer
args:
tokenizer:
type: bunruija.tokenizers.mecab_tokenizer.MeCabTokenizer
args:
lemmatize: true
exclude_pos:
- 助詞
- 助動詞
max_features: 10000
min_df: 3
ngram_range:
- 1
- 3
- type: sklearn.svm.LinearSVC
args:
verbose: 10
C: 10.
```
### Results
```
F-score on dev: 0.9225327201980899
precision recall f1-score support

negative 0.56 0.85 0.68 542
positive 0.98 0.93 0.96 5112

accuracy 0.92 5654
macro avg 0.77 0.89 0.82 5654
weighted avg 0.94 0.92 0.93 5654
```
36 changes: 36 additions & 0 deletions example/jglue/marc_ja/create_marc_ja_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import json
from argparse import ArgumentParser
from pathlib import Path

from datasets import Dataset, load_dataset
from loguru import logger # type: ignore


def write_json(ds: Dataset, name: Path):
with open(name, "w") as f:
for sample in ds:
category: str = ds.features["label"].names[sample["label"]]
sample_ = {
"text": sample["sentence"],
"label": category,
}
print(json.dumps(sample_), file=f)
logger.info(f"{name}")


def main():
parser = ArgumentParser()
parser.add_argument("--output_dir", default="example/jglue/jcola/data", type=Path)
args = parser.parse_args()

if not args.output_dir.exists():
args.output_dir.mkdir(parents=True)

dataset = load_dataset("shunk031/JGLUE", name="MARC-ja")

write_json(dataset["train"], args.output_dir / "train.jsonl")
write_json(dataset["validation"], args.output_dir / "dev.jsonl")


if __name__ == "__main__":
main()
25 changes: 25 additions & 0 deletions example/jglue/settings/svm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
data:
train: data/train.jsonl
dev: data/dev.jsonl

output_dir: models/svm-model

pipeline:
- type: sklearn.feature_extraction.text.TfidfVectorizer
args:
tokenizer:
type: bunruija.tokenizers.mecab_tokenizer.MeCabTokenizer
args:
lemmatize: true
exclude_pos:
- 助詞
- 助動詞
max_features: 10000
min_df: 3
ngram_range:
- 1
- 3
- type: sklearn.svm.LinearSVC
args:
verbose: 10
C: 10.

0 comments on commit 6c61d2e

Please sign in to comment.