Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[Fix][Docker] Fix the docker image + Fix pretrain_corpus document. #1378

Merged
merged 115 commits into from
Oct 15, 2020
Merged
Show file tree
Hide file tree
Changes from 110 commits
Commits
Show all changes
115 commits
Select commit Hold shift + click to select a range
df1480f
update
sxjscience Sep 29, 2020
120c4f4
Update ubuntu18.04-devel-gpu.Dockerfile
sxjscience Sep 29, 2020
9f0b129
fix the docker image
sxjscience Sep 30, 2020
47c1676
Update README.md
sxjscience Sep 30, 2020
be6aa35
Update ubuntu18.04-devel-gpu.Dockerfile
sxjscience Sep 30, 2020
07d9e0f
Update README.md
sxjscience Sep 30, 2020
3d18977
fix readme
sxjscience Oct 1, 2020
146b826
Add CPU DockerFile
sxjscience Oct 8, 2020
487e88e
update
sxjscience Oct 8, 2020
0fbecd4
update
sxjscience Oct 8, 2020
9b454bd
Update ubuntu18.04-devel-gpu.Dockerfile
sxjscience Oct 8, 2020
0e6d40a
update
sxjscience Oct 9, 2020
4d221cf
prepare to add TVM to docker
sxjscience Oct 9, 2020
029cb05
try to update
sxjscience Oct 10, 2020
5a69ff8
Update ubuntu18.04-devel-gpu.Dockerfile
sxjscience Oct 10, 2020
35c3e1c
Update ubuntu18.04-devel-gpu.Dockerfile
sxjscience Oct 10, 2020
fc66551
Update install_openmpi.sh
sxjscience Oct 10, 2020
2006d0b
update
sxjscience Oct 10, 2020
8f0fa41
Create install_llvm.sh
sxjscience Oct 10, 2020
80bc071
Update ubuntu18.04-base-gpu.Dockerfile
sxjscience Oct 10, 2020
ee3d27b
Update ubuntu18.04-base-gpu.Dockerfile
sxjscience Oct 10, 2020
5790d6b
Update run_squad2_albert_base.sh
sxjscience Oct 10, 2020
ae8b2cc
Update prepare_squad.py
sxjscience Oct 10, 2020
0555216
Update prepare_squad.py
sxjscience Oct 10, 2020
43d4198
Update prepare_squad.py
sxjscience Oct 10, 2020
4dc0024
fix
sxjscience Oct 10, 2020
8d8fbb7
Update README.md
sxjscience Oct 11, 2020
5aa0fcb
update
sxjscience Oct 11, 2020
704117d
update
sxjscience Oct 11, 2020
be03a49
Update README.md
sxjscience Oct 11, 2020
eb7d782
Update README.md
sxjscience Oct 11, 2020
515dd10
Update ubuntu18.04-devel-gpu.Dockerfile
sxjscience Oct 11, 2020
202d89f
update
sxjscience Oct 11, 2020
633005e
Update README.md
sxjscience Oct 11, 2020
8fd9db7
fix
sxjscience Oct 11, 2020
bc72cbe
Update ubuntu18.04-base-cpu.Dockerfile
sxjscience Oct 11, 2020
0f6067b
update
sxjscience Oct 11, 2020
2620dfd
add tvm to lazy import
sxjscience Oct 11, 2020
2d58e0c
update
sxjscience Oct 11, 2020
8234215
Update README.md
sxjscience Oct 11, 2020
7dada1d
update
sxjscience Oct 11, 2020
9fbaf77
Update README.md
sxjscience Oct 11, 2020
c62639d
Update run_squad2_albert_base.sh
sxjscience Oct 11, 2020
7e810ad
update
sxjscience Oct 11, 2020
2cb007d
update
sxjscience Oct 11, 2020
d52075d
update
sxjscience Oct 11, 2020
028a0e5
update
sxjscience Oct 11, 2020
f448df5
update
sxjscience Oct 11, 2020
83e96c0
Update README.md
sxjscience Oct 11, 2020
ed80b9f
Update install_ubuntu18.04_core.sh
sxjscience Oct 11, 2020
f8d09a0
update
sxjscience Oct 11, 2020
26ef33c
update
sxjscience Oct 11, 2020
d33834b
update
sxjscience Oct 11, 2020
a689265
fix
sxjscience Oct 11, 2020
9653d7a
Update README.md
sxjscience Oct 11, 2020
0b8f37d
Update run_batch_squad.sh
sxjscience Oct 11, 2020
8c38f98
update
sxjscience Oct 11, 2020
a605e3a
Update run_batch_squad.sh
sxjscience Oct 11, 2020
36628ac
Update run_batch_squad.sh
sxjscience Oct 11, 2020
d850924
update
sxjscience Oct 12, 2020
d629235
Update README.md
sxjscience Oct 12, 2020
ab0a183
fix
sxjscience Oct 12, 2020
74e2966
Update gluon_nlp_job.sh
sxjscience Oct 12, 2020
ab24028
update
sxjscience Oct 12, 2020
2f0c048
Update README.md
sxjscience Oct 12, 2020
296bc7e
Update README.md
sxjscience Oct 12, 2020
cc62fde
Update README.md
sxjscience Oct 12, 2020
0650674
update
sxjscience Oct 12, 2020
644618a
Update README.md
sxjscience Oct 12, 2020
7b7f42f
update
sxjscience Oct 12, 2020
0e169c9
Update install_python_packages.sh
sxjscience Oct 12, 2020
49d1453
Update install_llvm.sh
sxjscience Oct 12, 2020
c6c131d
Update install_python_packages.sh
sxjscience Oct 12, 2020
efbd7f5
Update install_llvm.sh
sxjscience Oct 12, 2020
522fa85
update
sxjscience Oct 12, 2020
6d53466
Update install_ubuntu18.04_core.sh
sxjscience Oct 12, 2020
1fcf8a3
fix
sxjscience Oct 12, 2020
450d08e
Update submit-job.py
sxjscience Oct 13, 2020
207d0d0
Update submit-job.py
sxjscience Oct 13, 2020
ad7dd82
Update README.md
sxjscience Oct 13, 2020
d751387
Update README.md
sxjscience Oct 13, 2020
73437bc
Update prepare_gutenberg.py
sxjscience Oct 13, 2020
ae137d2
Delete gluon_nlp_cpu_job.sh
sxjscience Oct 13, 2020
7e8947a
Update prepare_gutenberg.py
sxjscience Oct 13, 2020
c512fac
Update prepare_gutenberg.py
sxjscience Oct 13, 2020
0ebfcd7
Update prepare_gutenberg.py
sxjscience Oct 13, 2020
cd4b24d
Update conf.py
sxjscience Oct 13, 2020
19324d9
update
sxjscience Oct 13, 2020
6532042
Update generate_commands.py
sxjscience Oct 13, 2020
33e2575
fix readme
sxjscience Oct 13, 2020
8e439c4
use os.link for hard link
sxjscience Oct 13, 2020
276d6d1
Update README.md
sxjscience Oct 13, 2020
9127fde
Update README.md
sxjscience Oct 13, 2020
5ff701d
Update gluon_nlp_job.sh
sxjscience Oct 13, 2020
bc07886
Update __init__.py
sxjscience Oct 13, 2020
9233326
Update benchmark_utils.py
sxjscience Oct 13, 2020
6c604ea
try to use multi-stage build
sxjscience Oct 13, 2020
fe4d089
Update benchmark_utils.py
sxjscience Oct 14, 2020
c381eae
multi-stage build
sxjscience Oct 14, 2020
eadf268
Update README.md
sxjscience Oct 14, 2020
aadd03d
Update README.md
sxjscience Oct 14, 2020
207d018
update
sxjscience Oct 14, 2020
2c9e84e
Update submit-job.py
sxjscience Oct 14, 2020
bee78a6
fix documentation
sxjscience Oct 14, 2020
e9889ec
fix
sxjscience Oct 14, 2020
f52fbf6
update
sxjscience Oct 14, 2020
bbe13f7
Update test.sh
sxjscience Oct 14, 2020
b8046a0
Update test.sh
sxjscience Oct 14, 2020
ce551c8
Update test.sh
sxjscience Oct 14, 2020
3ac97b6
Update test.sh
sxjscience Oct 14, 2020
d34d693
Update README.md
sxjscience Oct 14, 2020
899c613
Update test.sh
sxjscience Oct 14, 2020
c73d3ed
fix
sxjscience Oct 14, 2020
42c8e41
Update README.md
sxjscience Oct 14, 2020
3e1a326
Update gluon_nlp_job.sh
sxjscience Oct 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,16 +34,16 @@ First of all, install the latest MXNet. You may use the following commands:

```bash
# Install the version with CUDA 10.0
python3 -m pip install -U --pre "mxnet-cu100>=2.0.0b20200802" -f https://dist.mxnet.io/python
python3 -m pip install -U --pre "mxnet-cu100>=2.0.0b20200926" -f https://dist.mxnet.io/python

# Install the version with CUDA 10.1
python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20200802" -f https://dist.mxnet.io/python
python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20200926" -f https://dist.mxnet.io/python

# Install the version with CUDA 10.2
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20200802" -f https://dist.mxnet.io/python
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20200926" -f https://dist.mxnet.io/python

# Install the cpu-only version
python3 -m pip install -U --pre "mxnet>=2.0.0b20200802" -f https://dist.mxnet.io/python
python3 -m pip install -U --pre "mxnet>=2.0.0b20200926" -f https://dist.mxnet.io/python
```


Expand Down Expand Up @@ -92,8 +92,13 @@ You may go to [tests](tests) to see how to run the unittests.
You can use Docker to launch a JupyterLab development environment with GluonNLP installed.

```
# GPU Instance
docker pull gluonai/gluon-nlp:gpu-latest
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 --shm-size=4g gluonai/gluon-nlp:gpu-latest

# CPU Instance
docker pull gluonai/gluon-nlp:cpu-latest
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 --shm-size=4g gluonai/gluon-nlp:cpu-latest
```

For more details, you can refer to the guidance in [tools/docker](tools/docker).
8 changes: 4 additions & 4 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,10 +234,10 @@ def setup(app):
'auto_doc_ref': True
}, True)
app.add_transform(AutoStructify)
app.add_javascript('google_analytics.js')
app.add_javascript('hidebib.js')
app.add_javascript('install-options.js')
app.add_stylesheet('custom.css')
app.add_js_file('google_analytics.js')
app.add_js_file('hidebib.js')
app.add_js_file('install-options.js')
app.add_css_file('custom.css')


sphinx_gallery_conf = {
Expand Down
9 changes: 1 addition & 8 deletions scripts/benchmarks/benchmark_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -792,12 +792,9 @@ def train_step():
raise NotImplementedError
timeit.repeat(train_step, repeat=1, number=3)
mxnet.npx.waitall()
for ctx in mx_all_contexts:
ctx.empty_cache()
sxjscience marked this conversation as resolved.
Show resolved Hide resolved
runtimes = timeit.repeat(train_step, repeat=self._repeat, number=3)
mxnet.npx.waitall()
for ctx in mx_all_contexts:
ctx.empty_cache()
ctx.empty_cache()
mxnet.npx.waitall()
# Profile memory
if self._use_gpu:
Expand Down Expand Up @@ -844,8 +841,6 @@ def run(self):
infer_time = np.nan
infer_memory = np.nan
inference_result[model_name][workload] = (infer_time, infer_memory)
for ctx in mx_all_contexts:
ctx.empty_cache()
mxnet.npx.waitall()
self.save_to_csv(inference_result, self._inference_out_csv_file)
if self._profile_train:
Expand All @@ -858,8 +853,6 @@ def run(self):
train_time = np.nan
train_memory = np.nan
train_result[model_name][workload] = (train_time, train_memory)
for ctx in mx_all_contexts:
ctx.empty_cache()
mxnet.npx.waitall()
self.save_to_csv(train_result, self._train_out_csv_file)

Expand Down
14 changes: 7 additions & 7 deletions scripts/datasets/general_nlp_benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,13 +112,13 @@ benchmarking. We select the classical datasets that are also used in

| Dataset | #Train | #Test | Columns | Metrics |
|---------------|---------|---------|-----------------|-----------------|
| AG | 120000 | 7600 | content, label | acc |
| IMDB | 25000 | 25000 | content, label | acc |
| DBpedia | 560000 | 70000 | content, label | acc |
| Yelp2 | 560000 | 38000 | content, label | acc |
| Yelp5 | 650000 | 50000 | content, label | acc |
| Amazon2 | 3600000 | 400000 | content, label | acc |
| Amazon5 | 3000000 | 650000 | content, label | acc |
| AG | 120,000 | 7,600 | content, label | acc |
| IMDB | 25,000 | 25,000 | content, label | acc |
| DBpedia | 560,000 | 70,000 | content, label | acc |
| Yelp2 | 560,000 | 38,000 | content, label | acc |
| Yelp5 | 650,000 | 50,000 | content, label | acc |
| Amazon2 | 3,600,000 | 400,000 | content, label | acc |
| Amazon5 | 3,000,000 | 65,0000 | content, label | acc |

To obtain the datasets, run:

Expand Down
12 changes: 8 additions & 4 deletions scripts/datasets/pretrain_corpus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@

We provide a series of shared scripts for downloading/preparing the text corpus for pretraining NLP models.
This helps create a unified text corpus for studying the performance of different pretraining algorithms.
When releasing the datasets, we follow the [FAIR principle](https://www.go-fair.org/fair-principles/),
When picking the datasets to support, we follow the [FAIR principle](https://www.go-fair.org/fair-principles/),
i.e., the dataset needs to be findable, accessible, interoperable, and reusable.

For all scripts, we can either use `nlp_data SCRIPT_NAME`, or directly call the script.

## Gutenberg BookCorpus
Unfortunately, we are unable to provide the [Toronto BookCorpus dataset](https://yknzhu.wixsite.com/mbweb) due to licensing issues.

Expand All @@ -16,14 +18,14 @@ Thus, we utilize the [Project Gutenberg](https://www.gutenberg.org/) as an alter
You can use the following command to download and prepare the Gutenberg corpus.

```bash
python3 prepare_bookcorpus.py --dataset gutenberg
python3 prepare_gutenberg.py --save_dir gutenberg
```

Also, you should follow the [license](https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License) for using the data.

## Wikipedia

Please install [attardi/wikiextractor](https://github.com/attardi/wikiextractor) for preparing the data.
We used the [attardi/wikiextractor](https://github.com/attardi/wikiextractor) package for preparing the data.

```bash
# Download
Expand All @@ -33,7 +35,9 @@ python3 prepare_wikipedia.py --mode download --lang en --date latest -o ./
python3 prepare_wikipedia.py --mode format -i [path-to-wiki.xml.bz2] -o ./

```
The process of downloading and formatting is time consuming, and we offer an alternative solution to download the prepared raw text file from S3 bucket. This raw text file is in English and was dumped at 2020-06-20 being formated by the above very process (` --lang en --date 20200620`).
The process of downloading and formatting is time consuming, and we offer an alternative
solution to download the prepared raw text file from S3 bucket. This raw text file is in English and
was dumped at 2020-06-20 being formatted by the above process (` --lang en --date 20200620`).

```bash
python3 prepare_wikipedia.py --mode download_prepared -o ./
Expand Down
7 changes: 5 additions & 2 deletions scripts/datasets/pretrain_corpus/prepare_gutenberg.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import zipfile
from gluonnlp.base import get_data_home_dir
from gluonnlp.utils.misc import download, load_checksum_stats

import shutil

_CITATIONS = r"""
@InProceedings{lahiri:2014:SRW,
Expand Down Expand Up @@ -59,11 +59,14 @@ def main(args):
save_dir = args.dataset if args.save_dir is None else args.save_dir
if not os.path.exists(save_dir):
os.makedirs(save_dir, exist_ok=True)
print(f'Save to {save_dir}')
with zipfile.ZipFile(target_download_location) as f:
for name in f.namelist():
if name.endswith('.txt'):
filename = os.path.basename(name)
f.extract(name, os.path.join(save_dir, filename))
with f.open(name) as in_file:
with open(os.path.join(save_dir, filename.replace(' ', '_')), 'wb') as out_file:
shutil.copyfileobj(in_file, out_file)


def cli_main():
Expand Down
8 changes: 5 additions & 3 deletions scripts/datasets/question_answering/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Question Answering


## SQuAD
SQuAD datasets is distributed under the [CC BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0/legalcode) license.

Expand Down Expand Up @@ -39,7 +40,7 @@ python3 prepare_searchqa.py
nlp_data prepare_searchqa
```

Directory structure of the searchqa dataset will be as follows
Directory structure of the SearchQA dataset will be as follows
```
searchqa
├── train.txt
Expand All @@ -48,9 +49,10 @@ searchqa
```

## TriviaQA
[TriviaQA](https://nlp.cs.washington.edu/triviaqa/) is an open domain QA dataset. See more useful scripts in [Offical Github](https://github.com/mandarjoshi90/triviaqa)
[TriviaQA](https://nlp.cs.washington.edu/triviaqa/) is an open domain QA dataset.
See more useful scripts in [Offical Github](https://github.com/mandarjoshi90/triviaqa).

Run the following command to download triviaqa
Run the following command to download TriviaQA

```bash
python3 prepare_triviaqa.py --version rc # Download TriviaQA version 1.0 for RC (2.5G)
Expand Down
8 changes: 4 additions & 4 deletions scripts/datasets/question_answering/prepare_searchqa.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os
import argparse
from gluonnlp.utils.misc import download, load_checksum_stats
from gluonnlp.base import get_data_home_dir
from gluonnlp.base import get_data_home_dir, get_repo_url

_CURR_DIR = os.path.realpath(os.path.dirname(os.path.realpath(__file__)))
_BASE_DATASET_PATH = os.path.join(get_data_home_dir(), 'searchqa')
Expand All @@ -20,9 +20,9 @@
"""

_URLS = {
'train': 's3://gluonnlp-numpy-data/datasets/question_answering/searchqa/train.txt',
'val': 's3://gluonnlp-numpy-data/datasets/question_answering/searchqa/val.txt',
'test': 's3://gluonnlp-numpy-data/datasets/question_answering/searchqa/test.txt'
'train': get_repo_url() + 'datasets/question_answering/searchqa/train.txt',
'val': get_repo_url() + 'datasets/question_answering/searchqa/val.txt',
'test': get_repo_url() + 'datasets/question_answering/searchqa/test.txt'
}


Expand Down
17 changes: 11 additions & 6 deletions scripts/datasets/question_answering/prepare_squad.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import os
import argparse
import shutil
from gluonnlp.utils.misc import download, load_checksum_stats
from gluonnlp.base import get_data_home_dir

Expand Down Expand Up @@ -58,14 +59,18 @@ def main(args):
download(dev_url, path=os.path.join(args.cache_path, dev_file_name))
if not os.path.exists(args.save_path):
os.makedirs(args.save_path)
if not os.path.exists(os.path.join(args.save_path, train_file_name))\
if not os.path.exists(os.path.join(args.save_path, train_file_name)) \
or (args.overwrite and args.save_path != args.cache_path):
os.symlink(os.path.join(args.cache_path, train_file_name),
os.path.join(args.save_path, train_file_name))
if not os.path.exists(os.path.join(args.save_path, dev_file_name))\
os.link(os.path.join(args.cache_path, train_file_name),
os.path.join(args.save_path, train_file_name))
else:
print(f'Found {os.path.join(args.save_path, train_file_name)}...skip')
if not os.path.exists(os.path.join(args.save_path, dev_file_name)) \
or (args.overwrite and args.save_path != args.cache_path):
os.symlink(os.path.join(args.cache_path, dev_file_name),
os.path.join(args.save_path, dev_file_name))
os.link(os.path.join(args.cache_path, dev_file_name),
os.path.join(args.save_path, dev_file_name))
else:
print(f'Found {os.path.join(args.save_path, dev_file_name)}...skip')


def cli_main():
Expand Down
6 changes: 3 additions & 3 deletions scripts/datasets/url_checksums/searchqa.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
s3://gluonnlp-numpy-data/datasets/question_answering/searchqa/train.txt c7e1eb8c34d0525547b91e18b3f8f4d855e35c16 1226681217
s3://gluonnlp-numpy-data/datasets/question_answering/searchqa/test.txt 08a928e0f8c129d5b3ca43bf46df117e38be0c27 332064988
s3://gluonnlp-numpy-data/datasets/question_answering/searchqa/val.txt c2f65d6b83c26188d5998ab96bc6a38c1a127fcc 170835902
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/question_answering/searchqa/train.txt c7e1eb8c34d0525547b91e18b3f8f4d855e35c16 1226681217
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/question_answering/searchqa/test.txt 08a928e0f8c129d5b3ca43bf46df117e38be0c27 332064988
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/question_answering/searchqa/val.txt c2f65d6b83c26188d5998ab96bc6a38c1a127fcc 170835902
8 changes: 8 additions & 0 deletions scripts/question_answering/commands/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Commands For Training on SQuAD

All commands are generated by parsing the template in [run_squad.template](run_squad.template).
To generate all commands, use the following code.

```bash
python3 generate_commands.py
```
Loading