Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyzer - multiple languages and nlp engines #312

Merged
merged 16 commits into from
Jul 22, 2020
2 changes: 2 additions & 0 deletions Dockerfile.python.deps
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ RUN pip install pipenv
RUN pip install --upgrade setuptools
# Installing specified packages from Pipfile.lock
RUN bash -c 'PIPENV_VENV_IN_PROJECT=1 pipenv sync'
# Install for tests, consider making this optional
RUN pipenv run python -m spacy download en_core_web_lg

# Print to screen the installed packages for easy debugging
RUN pipenv run pip freeze
Expand Down
20 changes: 12 additions & 8 deletions build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,22 @@

# Build the images

export DOCKER_REGISTRY=presidio
export PRESIDIO_LABEL=latest
DOCKER_REGISTRY=${DOCKER_REGISTRY:-presidio}
PRESIDIO_LABEL=${PRESIDIO_LABEL:-latest}
make DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_LABEL=${PRESIDIO_LABEL} docker-build-deps
make DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_LABEL=${PRESIDIO_LABEL} docker-build

# Run the containers

docker network create mynetwork
docker run --rm --name redis --network mynetwork -d -p 6379:6379 redis
docker run --rm --name presidio-analyzer --network mynetwork -d -p 3000:3000 -e GRPC_PORT=3000 -e RECOGNIZERS_STORE_SVC_ADDRESS=presidio-recognizers-store:3004 ${DOCKER_REGISTRY}/presidio-analyzer:${PRESIDIO_LABEL}
docker run --rm --name presidio-anonymizer --network mynetwork -d -p 3001:3001 -e GRPC_PORT=3001 ${DOCKER_REGISTRY}/presidio-anonymizer:${PRESIDIO_LABEL}
docker run --rm --name presidio-recognizers-store --network mynetwork -d -p 3004:3004 -e GRPC_PORT=3004 -e REDIS_URL=redis:6379 ${DOCKER_REGISTRY}/presidio-recognizers-store:${PRESIDIO_LABEL}
NETWORKNAME=${NETWORKNAME:-presidio-network}
if [[ ! "$(docker network ls)" =~ (^|[[:space:]])"$NETWORKNAME"($|[[:space:]]) ]]; then
docker network create $NETWORKNAME
fi
docker run --rm --name redis --network $NETWORKNAME -d -p 6379:6379 redis
docker run --rm --name presidio-analyzer --network $NETWORKNAME -d -p 3000:3000 -e GRPC_PORT=3000 -e RECOGNIZERS_STORE_SVC_ADDRESS=presidio-recognizers-store:3004 ${DOCKER_REGISTRY}/presidio-analyzer:${PRESIDIO_LABEL}
docker run --rm --name presidio-anonymizer --network $NETWORKNAME -d -p 3001:3001 -e GRPC_PORT=3001 ${DOCKER_REGISTRY}/presidio-anonymizer:${PRESIDIO_LABEL}
docker run --rm --name presidio-recognizers-store --network $NETWORKNAME -d -p 3004:3004 -e GRPC_PORT=3004 -e REDIS_URL=redis:6379 ${DOCKER_REGISTRY}/presidio-recognizers-store:${PRESIDIO_LABEL}

echo "waiting 30 seconds for analyzer model to load..."
sleep 30 # Wait for the analyzer model to load
docker run --rm --name presidio-api --network mynetwork -d -p 8080:8080 -e WEB_PORT=8080 -e ANALYZER_SVC_ADDRESS=presidio-analyzer:3000 -e ANONYMIZER_SVC_ADDRESS=presidio-anonymizer:3001 -e RECOGNIZERS_STORE_SVC_ADDRESS=presidio-recognizers-store:3004 ${DOCKER_REGISTRY}/presidio-api:${PRESIDIO_LABEL}
docker run --rm --name presidio-api --network $NETWORKNAME -d -p 8080:8080 -e WEB_PORT=8080 -e ANALYZER_SVC_ADDRESS=presidio-analyzer:3000 -e ANONYMIZER_SVC_ADDRESS=presidio-anonymizer:3001 -e RECOGNIZERS_STORE_SVC_ADDRESS=presidio-recognizers-store:3004 ${DOCKER_REGISTRY}/presidio-api:${PRESIDIO_LABEL}
57 changes: 46 additions & 11 deletions docs/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,21 +54,24 @@ Most of Presidio's services are written in Go. The `presidio-analyzer` module, i
Additional installation instructions: https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv

3. Create virtualenv for the project and install all requirements in the Pipfile, including dev requirements. In the `presidio-analyzer` folder, run:

```
pipenv install --dev --sequential --skip-lock
```

4. Run all tests
4. Download spacy model
```
pipenv run python -m spacy download en_core_web_lg
```

```
pipenv run pytest
```
5. Run all tests
```
pipenv run pytest
```

5. To run arbitrary scripts within the virtual env, start the command with `pipenv run`. For example:
1. `pipenv run flake8 analyzer --exclude "*pb2*.py"`
2. `pipenv run pylint analyzer`
3. `pipenv run pip freeze`
6. To run arbitrary scripts within the virtual env, start the command with `pipenv run`. For example:
1. `pipenv run flake8 analyzer --exclude "*pb2*.py"`
2. `pipenv run pylint analyzer`
3. `pipenv run pip freeze`

#### Alternatively, activate the virtual environment and use the commands by starting a pipenv shell:

Expand Down Expand Up @@ -144,13 +147,13 @@ pipenv install --dev --sequential
3. If you want to experiment with `analyze` requests, navigate into the `analyzer` folder and start serving the analyzer service:

```sh
pipenv run python __main__.py serve --grpc-port 3000
pipenv run python app.py serve --grpc-port 3000
```

4. In a new `pipenv shell` window you can run `analyze` requests, for example:

```
pipenv run python __main__.py analyze --text "John Smith drivers license is AC432223" --fields "PERSON" "US_DRIVER_LICENSE" --grpc-port 3000
pipenv run python app.py analyze --text "John Smith drivers license is AC432223" --fields "PERSON" "US_DRIVER_LICENSE" --grpc-port 3000
```

## Load test
Expand All @@ -175,3 +178,35 @@ Edit [charts/presidio/values.yaml](../charts/presidio/values.yaml) to:
- Setup secret name (for private registries)
- Change presidio services version
- Change default scale


## NLP Engine Configuration

1. The nlp engines deployed are set on start up based on the yaml configuration files in `presidio-analyzer/conf/`. The default nlp engine is the large English SpaCy model (`en_core_web_lg`) set in `default.yaml`.

2. The format of the yaml file is as follows:

```yaml
nlp_engine_name: spacy # {spacy, stanza}
models:
-
lang_code: en # code corresponds to `supported_language` in any custom recognizers
model_name: en_core_web_lg # the name of the SpaCy or Stanza model
-
lang_code: de # more than one model is optional, just add more items
model_name: de
```

3. By default, we call the method `load_predefined_recognizers` of the `RecognizerRegistry` class to load language specific and language agnostic recognizers.

4. Downloading additional engines.
* SpaCy NLP Models: [models download page](https://spacy.io/usage/models)
* Stanza NLP Models: [models download page](https://stanfordnlp.github.io/stanza/available_models.html)

```sh
# download models - tldr
# spacy
python -m spacy download en_core_web_lg
# stanza
python -c 'import stanza; stanza.download("en");'
```
2 changes: 1 addition & 1 deletion docs/interpretability_logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ The `textual_explanation` field in `AnalysisExplanation` class allows you to add
Interpretability traces are enabled by default. Disable App Tracing by setting the `enabled` constructor parameter to `False`.
PII entities are not stored in the Traces by default. Enable it by either set an evironment variable `ENABLE_TRACE_PII` to `True`, or you can set it directly in the command line, using the `enable-trace-pii` argument as follows:
```bash
pipenv run python __main__.py serve --grpc-port 3001 --enable-trace-pii True
pipenv run python app.py serve --grpc-port 3001 --enable-trace-pii True
```

## Notes
Expand Down
1 change: 1 addition & 0 deletions pipelines/templates/build-python-template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ steps:
# regex
pipenv sync --dev --sequential
pipenv install --dev --skip-lock regex pytest-azurepipelines
pipenv run python -m spacy download en_core_web_lg
- task: Bash@3
displayName: 'Lint'
inputs:
Expand Down
3 changes: 2 additions & 1 deletion presidio-analyzer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ FROM ${REGISTRY}/presidio-python-deps:${PRESIDIO_DEPS_LABEL}

ARG NAME=presidio-analyzer
ADD ./${NAME}/presidio_analyzer /usr/bin/${NAME}/presidio_analyzer
ADD ./${NAME}/conf /usr/bin/${NAME}/presidio_analyzer/conf
WORKDIR /usr/bin/${NAME}/presidio_analyzer

CMD pipenv run python __main__.py serve --env-grpc-port
CMD pipenv run python app.py serve --env-grpc-port
3 changes: 2 additions & 1 deletion presidio-analyzer/Dockerfile.local
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ FROM ${REGISTRY}/presidio-python-deps:${PRESIDIO_DEPS_LABEL}

ARG NAME=presidio-analyzer
ADD ./${NAME}/presidio_analyzer /usr/bin/${NAME}/presidio_analyzer
ADD ./${NAME}/conf /usr/bin/${NAME}/presidio_analyzer/conf
WORKDIR /usr/bin/${NAME}/presidio_analyzer

CMD pipenv run python __main__.py serve --env-grpc-port
CMD pipenv run python app.py serve --env-grpc-port
3 changes: 1 addition & 2 deletions presidio-analyzer/Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,7 @@ name = "pypi"

[packages]
cython = "*"
spacy = "==2.2.3"
en_core_web_lg = {file = "https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz"}
omri374 marked this conversation as resolved.
Show resolved Hide resolved
spacy = "==2.2.4"
regex = "*"
pyre2 = {file = "https://github.com/torosent/pyre2/archive/release/0.2.23.zip"}
grpcio = "*"
Expand Down
6 changes: 6 additions & 0 deletions presidio-analyzer/conf/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
nlp_engine_name: spacy
models:
-
lang_code: en
model_name: en_core_web_lg

5 changes: 5 additions & 0 deletions presidio-analyzer/conf/spacy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
nlp_engine_name: spacy
models:
-
lang_code: en
model_name: en_core_web_sm
8 changes: 8 additions & 0 deletions presidio-analyzer/conf/spacy_multilingual.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
nlp_engine_name: spacy
models:
-
lang_code: en
model_name: en
-
lang_code: de
model_name: de
6 changes: 6 additions & 0 deletions presidio-analyzer/conf/stanza.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
nlp_engine_name: stanza
models:
-
lang_code: en
model_name: en

9 changes: 9 additions & 0 deletions presidio-analyzer/conf/stanza_multilingual.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
nlp_engine_name: stanza
models:
-
lang_code: en
model_name: en
-
lang_code: de
model_name: de

161 changes: 0 additions & 161 deletions presidio-analyzer/presidio_analyzer/__main__.py

This file was deleted.

Loading