forked from opensourceware/Neural-ParsCit
-
Notifications
You must be signed in to change notification settings - Fork 17
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve the setup documentation for the pristine version (#2)
* Python dependencies for pristine Neural-ParsCit * Docker configuration for pristine Neural-ParsCit * Updated to include instruction to install using virtualenv and Docker * Configure Theano to use OpenBLAS and default float precision is set to 32-bit
- Loading branch information
Showing
6 changed files
with
34 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
*.pyc | ||
|
||
.venv |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
FROM python:2 | ||
|
||
WORKDIR /usr/src | ||
|
||
RUN apt-get update \ | ||
&& apt-get install -y libopenblas-dev \ | ||
&& apt-get clean | ||
|
||
RUN pip install --no-cache-dir Theano==0.10.0beta4 numpy==1.13.3 gensim==0.13.2 | ||
|
||
RUN echo "[global]\nfloatX = float32" >> ~/.theanorc | ||
RUN echo "[blas]\nldflags = -lblas -lgfortran" >> ~/.theanorc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,6 +7,18 @@ Neural ParsCit is a citation string parser which parses reference strings into i | |
|
||
To use the tagger, you need Python 2.7, with Numpy, Theano and Gensim installed. | ||
|
||
### Using virtualenv in Linux systems | ||
|
||
``` | ||
virtualenv -ppython2.7 .venv | ||
source .venv/bin/activate | ||
pip install -r requirements.txt | ||
``` | ||
|
||
### Using Docker | ||
|
||
1. Build the image: `docker build -t theano-gensim - < Dockerfile` | ||
1. Run the repo mounted to the container: `docker run -it -v /path/to/Neural-ParsCit:/usr/src --name np theano-gensim:latest /bin/bash` | ||
|
||
## Parse citation strings | ||
|
||
|
@@ -16,7 +28,7 @@ The fastest way to use the parser is to run state-of-the-art pretrained model as | |
./run.py --model_path models/neuralParsCit/ --pre_emb <vectors.bin> --run shell | ||
./run.py --model_path models/neuralParsCit/ --pre_emb <vectors.bin> --run file -i input_file -o output_file | ||
``` | ||
The script can run interactively or input can be passed in a file. In the interactive session, the strings are passed one by one. The result is displayed on standard output. If the file option is chosen, the input is given in a file specified by -i option and the output is stored in the directed file. Using the file option, multiple citation strings can be parsed. | ||
The script can run interactively or input can be passed in a file. In the interactive session, the strings are passed one by one. The result is displayed on standard output. If the file option is chosen, the input is given in a file specified by -i option and the output is stored in the directed file. Using the file option, multiple citation strings can be parsed. | ||
|
||
The state-of-the-art trained model is provided in the models folder and is named neuralParsCit. The binary file for word embeddings is provided in the docker image of the current version of neural ParsCit. The hyper parameter ```discarded``` is the number of embeddings not used in our model. Retained words have a frequency of more than 0 in the ACM citation literature from 1994-2014. | ||
|
||
|
@@ -39,7 +51,7 @@ There are many parameters you can tune (CRF, dropout rate, embedding dimension, | |
Input files for the training script have to follow the following format: each word of the citation string and its corresponding tag has to be on a separate line. All citation strings must be separated by a blank line. | ||
|
||
|
||
If you want to use the word embeddings trained on ACM refrences, and the freq., please download from WING homepage: http://wing.comp.nus.edu.sg/?page_id=158 (currently not avaible due to space issue, mail [email protected], [email protected] for a copy) | ||
If you want to use the word embeddings trained on ACM refrences, and the freq., please download from WING homepage: http://wing.comp.nus.edu.sg/?page_id=158 (currently not avaible due to space issue, mail [email protected], [email protected] for a copy) | ||
|
||
Details about the training data, experiments can be found in the following article. Traning data and CRF baseline can be downloaded from https://github.com/knmnyn/ParsCit. Please consider citing following piblication(s) if you use Neural ParsCit: | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
-r requirements/prod.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
-r prod.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
gensim==0.13.2 | ||
theano==0.10.b4 | ||
numpy==1.13.3 |