Fix dataset doc and fix roberta tokenizer and update SQuAD example (P…

…addlePaddle#42) * fix dataset doc and fix roberta tokenizer and update SQuAD example * Change to relative link * Update annotation. * Minor fix
wangxicoding · Feb 28, 2021 · 1e955cb · 1e955cb
1 parent 2dcef6a
commit 1e955cb
Show file tree

Hide file tree

Showing 9 changed files with 794 additions and 576 deletions.
diff --git a/docs/datasets.md b/docs/datasets.md
@@ -34,7 +34,7 @@ PaddleNLP提供了
 |  ----  | --------- | ------ |
 |  [Conll05](https://www.cs.upc.edu/~srlconll/spec.html) | 语义角色标注数据集| `paddle.text.datasets.Conll05st`|
 |  [MSRA_NER](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra) | MSRA 命名实体识别数据集| `paddlenlp.datasets.MSRA_NER`|
-|  [Express_Ner](https://aistudio.baidu.com/aistudio/projectdetail/131360?channelType=0&channel=-1) | 快递单命名实体识别数据集| [express_ner](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/examples/named_entity_recognition/express_ner/data)|
+|  [Express_Ner](https://aistudio.baidu.com/aistudio/projectdetail/131360?channelType=0&channel=-1) | 快递单命名实体识别数据集| [express_ner](../examples/named_entity_recognition/express_ner/data)|
 
 ## 机器翻译
 
@@ -47,13 +47,13 @@ PaddleNLP提供了
 
 | 数据集名称  | 简介 | 调用方法 |
 | ----  | --------- | ------ |
-|  [CSSE COVID-19](https://github.com/CSSEGISandData/COVID-19) |约翰·霍普金斯大学系统科学与工程中心新冠病例数据 | [time_series](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/examples/time_series)|
+|  [CSSE COVID-19](../examples/time_series) |约翰·霍普金斯大学系统科学与工程中心新冠病例数据 | [time_series](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/examples/time_series)|
 |  [UCIHousing](https://archive.ics.uci.edu/ml/datasets/Housing) | 波士顿房价预测数据集 | `paddle.text.datasets.UCIHousing`|
 
 ## 语料库
 
 | 数据集名称  | 简介 | 调用方法 |
 | ----  | --------- | ------ |
-|  [yahoo](https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&guccounter=1) | 雅虎英文语料库 | [VAE](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/examples/text_generation/vae-seq2seq)|
+|  [yahoo](https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&guccounter=1) | 雅虎英文语料库 | [VAE](../examples/text_generation/vae-seq2seq)|
 |  [PTB](http://www.fit.vutbr.cz/~imikolov/rnnlm/) | Penn Treebank Dataset | `paddlenlp.datasets.PTB`|
-|  [1 Billon words](https://opensource.google/projects/lm-benchmark) | 1 Billion Word Language Model Benchmark R13 Output 基准语料库| [ELMo](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/examples/language_model/elmo)|
+|  [1 Billon words](https://opensource.google/projects/lm-benchmark) | 1 Billion Word Language Model Benchmark R13 Output 基准语料库| [ELMo](../examples/language_model/elmo)|
diff --git a/examples/experimental/run_squad_test.py b/examples/experimental/run_squad_test.py
diff --git a/examples/machine_reading_comprehension/SQuAD/README.md b/examples/machine_reading_comprehension/SQuAD/README.md
@@ -41,7 +41,7 @@ SQuAD v2.0
 
 
 ### 数据准备
-为了方便开发者进行测试，我们内置了数据下载脚本，用户可以通过命令行传入`--version_2_with_negative`控制所需要的SQuAD数据集版本，也可以通过`--data_path`传入本地数据集的位置，数据集需保证与SQuAD数据集格式一致。
+为了方便开发者进行测试，我们内置了数据下载脚本，用户可以通过命令行传入`--version_2_with_negative`控制所需要的SQuAD数据集版本，也可以通过`--train_file`和`--prediction_file`传入本地数据集的位置，数据集需保证与SQuAD数据集格式一致。
 
 
 ### Fine-tune
@@ -61,12 +61,16 @@ python -u ./run_squad.py \
     --warmup_proportion 0.1 \
     --weight_decay 0.01 \
     --output_dir ./tmp/squad/ \
+    --do_train \
+    --do_pred \
     --n_gpu 1
  ```
 
 * `model_type`: 预训练模型的种类。如bert，ernie，roberta等。
 * `model_name_or_path`: 预训练模型的具体名称。如bert-base-uncased，bert-large-cased等。或者是模型文件的本地路径。
 * `output_dir`: 保存模型checkpoint的路径。
+* `do_train`: 是否进行训练。
+* `do_pred`: 是否进行预测。
 
 训练结束后模型会自动对结果进行评估，得到类似如下的输出：
 
@@ -97,6 +101,8 @@ python -u ./run_squad.py \
     --weight_decay 0.01 \
     --output_dir ./tmp/squad/ \
     --n_gpu 1 \
+    --do_train \
+    --do_pred \
     --version_2_with_negative
  ```
 

diff --git a/examples/machine_reading_comprehension/SQuAD/args.py b/examples/machine_reading_comprehension/SQuAD/args.py
@@ -4,10 +4,17 @@
 def parse_args():
     parser = argparse.ArgumentParser(description=__doc__)
     parser.add_argument(
-        "--data_path",
+        "--train_file",
         type=str,
+        required=False,
         default=None,
-        help="Directory of all the data for train, valid, test.")
+        help="Train data path.")
+    parser.add_argument(
+        "--predict_file",
+        type=str,
+        required=False,
+        default=None,
+        help="Predict data path.")
     parser.add_argument(
         "--model_type",
         default=None,
@@ -123,6 +130,9 @@ def parse_args():
         action='store_true',
         help="If true, the SQuAD examples contain some that do not have an answer. If using squad v2.0, it should be set true."
     )
-
+    parser.add_argument(
+        "--do_train", action='store_true', help="Whether to train the model.")
+    parser.add_argument(
+        "--do_pred", action='store_true', help="Whether to predict.")
     args = parser.parse_args()
     return args