Skip to content

Commit

Permalink
Fix dataset doc and fix roberta tokenizer and update SQuAD example (P…
Browse files Browse the repository at this point in the history
…addlePaddle#42)

* fix dataset doc and fix roberta tokenizer and update SQuAD example

* Change to relative link

* Update annotation.

* Minor fix
  • Loading branch information
smallv0221 authored Feb 28, 2021
1 parent 2dcef6a commit 1e955cb
Show file tree
Hide file tree
Showing 9 changed files with 794 additions and 576 deletions.
8 changes: 4 additions & 4 deletions docs/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ PaddleNLP提供了
| ---- | --------- | ------ |
| [Conll05](https://www.cs.upc.edu/~srlconll/spec.html) | 语义角色标注数据集| `paddle.text.datasets.Conll05st`|
| [MSRA_NER](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra) | MSRA 命名实体识别数据集| `paddlenlp.datasets.MSRA_NER`|
| [Express_Ner](https://aistudio.baidu.com/aistudio/projectdetail/131360?channelType=0&channel=-1) | 快递单命名实体识别数据集| [express_ner](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/examples/named_entity_recognition/express_ner/data)|
| [Express_Ner](https://aistudio.baidu.com/aistudio/projectdetail/131360?channelType=0&channel=-1) | 快递单命名实体识别数据集| [express_ner](../examples/named_entity_recognition/express_ner/data)|

## 机器翻译

Expand All @@ -47,13 +47,13 @@ PaddleNLP提供了

| 数据集名称 | 简介 | 调用方法 |
| ---- | --------- | ------ |
| [CSSE COVID-19](https://github.com/CSSEGISandData/COVID-19) |约翰·霍普金斯大学系统科学与工程中心新冠病例数据 | [time_series](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/examples/time_series)|
| [CSSE COVID-19](../examples/time_series) |约翰·霍普金斯大学系统科学与工程中心新冠病例数据 | [time_series](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/examples/time_series)|
| [UCIHousing](https://archive.ics.uci.edu/ml/datasets/Housing) | 波士顿房价预测数据集 | `paddle.text.datasets.UCIHousing`|

## 语料库

| 数据集名称 | 简介 | 调用方法 |
| ---- | --------- | ------ |
| [yahoo](https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&guccounter=1) | 雅虎英文语料库 | [VAE](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/examples/text_generation/vae-seq2seq)|
| [yahoo](https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&guccounter=1) | 雅虎英文语料库 | [VAE](../examples/text_generation/vae-seq2seq)|
| [PTB](http://www.fit.vutbr.cz/~imikolov/rnnlm/) | Penn Treebank Dataset | `paddlenlp.datasets.PTB`|
| [1 Billon words](https://opensource.google/projects/lm-benchmark) | 1 Billion Word Language Model Benchmark R13 Output 基准语料库| [ELMo](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/examples/language_model/elmo)|
| [1 Billon words](https://opensource.google/projects/lm-benchmark) | 1 Billion Word Language Model Benchmark R13 Output 基准语料库| [ELMo](../examples/language_model/elmo)|
401 changes: 311 additions & 90 deletions examples/experimental/run_squad_test.py

Large diffs are not rendered by default.

8 changes: 7 additions & 1 deletion examples/machine_reading_comprehension/SQuAD/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ SQuAD v2.0


### 数据准备
为了方便开发者进行测试,我们内置了数据下载脚本,用户可以通过命令行传入`--version_2_with_negative`控制所需要的SQuAD数据集版本,也可以通过`--data_path`传入本地数据集的位置,数据集需保证与SQuAD数据集格式一致。
为了方便开发者进行测试,我们内置了数据下载脚本,用户可以通过命令行传入`--version_2_with_negative`控制所需要的SQuAD数据集版本,也可以通过`--train_file``--prediction_file`传入本地数据集的位置,数据集需保证与SQuAD数据集格式一致。


### Fine-tune
Expand All @@ -61,12 +61,16 @@ python -u ./run_squad.py \
--warmup_proportion 0.1 \
--weight_decay 0.01 \
--output_dir ./tmp/squad/ \
--do_train \
--do_pred \
--n_gpu 1
```

* `model_type`: 预训练模型的种类。如bert,ernie,roberta等。
* `model_name_or_path`: 预训练模型的具体名称。如bert-base-uncased,bert-large-cased等。或者是模型文件的本地路径。
* `output_dir`: 保存模型checkpoint的路径。
* `do_train`: 是否进行训练。
* `do_pred`: 是否进行预测。

训练结束后模型会自动对结果进行评估,得到类似如下的输出:

Expand Down Expand Up @@ -97,6 +101,8 @@ python -u ./run_squad.py \
--weight_decay 0.01 \
--output_dir ./tmp/squad/ \
--n_gpu 1 \
--do_train \
--do_pred \
--version_2_with_negative
```

Expand Down
16 changes: 13 additions & 3 deletions examples/machine_reading_comprehension/SQuAD/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,17 @@
def parse_args():
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--data_path",
"--train_file",
type=str,
required=False,
default=None,
help="Directory of all the data for train, valid, test.")
help="Train data path.")
parser.add_argument(
"--predict_file",
type=str,
required=False,
default=None,
help="Predict data path.")
parser.add_argument(
"--model_type",
default=None,
Expand Down Expand Up @@ -123,6 +130,9 @@ def parse_args():
action='store_true',
help="If true, the SQuAD examples contain some that do not have an answer. If using squad v2.0, it should be set true."
)

parser.add_argument(
"--do_train", action='store_true', help="Whether to train the model.")
parser.add_argument(
"--do_pred", action='store_true', help="Whether to predict.")
args = parser.parse_args()
return args
Loading

0 comments on commit 1e955cb

Please sign in to comment.