Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conversational entity linking into REL #150

Merged
merged 16 commits into from
Jan 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions docs/tutorials/conversations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Conversational entity linking

The `crel` submodule the conversational entity linking tool trained on the [ConEL-2 dataset](https://github.com/informagi/conversational-entity-linking-2022#conel-2-conversational-entity-linking-dataset).

Unlike existing EL methods, `crel` is developed to identify both named entities and concepts.
It also uses coreference resolution techniques to identify personal entities and references to the explicit entity mentions in the conversations.

This tutorial describes how to start with conversational entity linking on a local machine.

For more information, see the original [repository on conversational entity linking](https://github.com/informagi/conversational-entity-linking-2022).

## Start with your local environment

### Step 1: Download models

First, download the models below:

- **MD for concepts and NEs**:
+ [Click here to download models](https://drive.google.com/file/d/1OoC2XZp4uBy0eB_EIuIhEHdcLEry2LtU/view?usp=sharing)
+ Extract `bert_conv-td` to your `base_url`
- **Personal Entity Linking**:
+ [Click here to download models](https://drive.google.com/file/d/1-jW8xkxh5GV-OuUBfMeT2Tk7tEzvH181/view?usp=sharing)
+ Extract `s2e_ast_onto` to your `base_url`

Additionally, conversational entity linking uses the wiki 2019 dataset. For more information on where to place the data and the `base_url`, check out [this page](../how_to_get_started). If setup correctly, your `base_url` should contain these directories:


```bash
.
└── bert_conv-td
└── s2e_ast_onto
└── wiki_2019
```


### Step 2: Example code

This example shows how to link a short conversation. Note that the speakers must be named "USER" and "SPEAKER".


```python
>>> from REL.crel.conv_el import ConvEL
>>>
>>> cel = ConvEL(base_url="C:/path/to/base_url/")
>>>
>>> conversation = [
>>> {"speaker": "USER",
>>> "utterance": "I am allergic to tomatoes but we have a lot of famous Italian restaurants here in London.",},
>>>
>>> {"speaker": "SYSTEM",
>>> "utterance": "Some people are allergic to histamine in tomatoes.",},
>>>
>>> {"speaker": "USER",
>>> "utterance": "Talking of food, can you recommend me a restaurant in my city for our anniversary?",},
>>> ]
>>>
>>> annotated = cel.annotate(conversation)
>>> [item for item in annotated if item['speaker'] == 'USER']
[{'speaker': 'USER',
'utterance': 'I am allergic to tomatoes but we have a lot of famous Italian restaurants here in London.',
'annotations': [[17, 8, 'tomatoes', 'Tomato'],
[54, 19, 'Italian restaurants', 'Italian_cuisine'],
[82, 6, 'London', 'London']]},
{'speaker': 'USER',
'utterance': 'Talking of food, can you recommend me a restaurant in my city for our anniversary?',
'annotations': [[11, 4, 'food', 'Food'],
[40, 10, 'restaurant', 'Restaurant'],
[54, 7, 'my city', 'London']]}]

```

1 change: 1 addition & 0 deletions docs/tutorials/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@ The remainder of the tutorials are optional and for users who wish to e.g. train
5. [Reproducing our results](reproducing_our_results/)
6. [REL as systemd service](systemd_instructions/)
7. [Notes on using custom models](custom_models/)
7. [Conversational entity linking](conversations/)
4 changes: 2 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ nav:
- tutorials/reproducing_our_results.md
- tutorials/systemd_instructions.md
- tutorials/custom_models.md
- tutorials/conversations.md
- Python API reference:
- api/entity_disambiguation.md
- api/generate_train_test.md
Expand Down Expand Up @@ -72,11 +73,10 @@ plugins:
- https://numpy.org/doc/stable/objects.inv
- https://docs.scipy.org/doc/scipy/objects.inv
- https://pandas.pydata.org/docs/objects.inv
selection:
options:
docstring_style: sphinx
docstring_options:
ignore_init_summary: yes
rendering:
show_submodules: no
show_source: true
docstring_section_style: list
Expand Down
9 changes: 6 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
anyascii
colorama
konoha
fastapi
flair>=0.11
konoha
nltk
pydantic
segtok
torch
nltk
anyascii
uvicorn
9 changes: 6 additions & 3 deletions scripts/efficiency_test.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
import numpy as np
import requests
import os

from REL.training_datasets import TrainingEvaluationDatasets

np.random.seed(seed=42)

base_url = "/Users/vanhulsm/Desktop/projects/data/"
wiki_version = "wiki_2014"
base_url = os.environ.get("REL_BASE_URL")
wiki_version = "wiki_2019"
host = 'localhost'
port = '5555'
datasets = TrainingEvaluationDatasets(base_url, wiki_version).load()["aida_testB"]

# random_docs = np.random.choice(list(datasets.keys()), 50)
Expand Down Expand Up @@ -40,7 +43,7 @@
print(myjson)

print("Output API:")
print(requests.post("http://192.168.178.11:1235", json=myjson).json())
print(requests.post(f"http://{host}:{port}", json=myjson).json())
print("----------------------------")


Expand Down
62 changes: 62 additions & 0 deletions scripts/test_server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import os
import requests

# Script for testing the implementation of the conversational entity linking API
#
# To run:
#
# python .\src\REL\server.py $REL_BASE_URL wiki_2019
# or
# python .\src\REL\server.py $env:REL_BASE_URL wiki_2019
#
# Set $REL_BASE_URL to where your data are stored (`base_url`)
#
# These paths must exist:
# - `$REL_BASE_URL/bert_conv`
# - `$REL_BASE_URL/s2e_ast_onto `
#
# (see https://github.com/informagi/conversational-entity-linking-2022/tree/main/tool#step-1-download-models)
#


host = 'localhost'
port = '5555'

text1 = {
"text": "REL is a modular Entity Linking package that can both be integrated in existing pipelines or be used as an API.",
"spans": []
}

conv1 = {
"text" : [
{
"speaker":
"USER",
"utterance":
"I am allergic to tomatoes but we have a lot of famous Italian restaurants here in London.",
},
{
"speaker": "SYSTEM",
"utterance": "Some people are allergic to histamine in tomatoes.",
},
{
"speaker":
"USER",
"utterance":
"Talking of food, can you recommend me a restaurant in my city for our anniversary?",
},
]
}


for endpoint, myjson in (
('', text1),
('conversation/', conv1)
):
print('Input API:')
print(myjson)
print()
print('Output API:')
print(requests.post(f"http://{host}:{port}/{endpoint}", json=myjson).json())
print('----------------------------')

10 changes: 7 additions & 3 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,17 @@ package_dir =
= src
include_package_data = True
install_requires =
anyascii
colorama
konoha
fastapi
flair>=0.11
konoha
nltk
pydantic
segtok
spacy
torch
nltk
anyascii
uvicorn

[options.extras_require]
develop =
Expand Down
Empty file added src/REL/crel/__init__.py
Empty file.
94 changes: 94 additions & 0 deletions src/REL/crel/bert_md.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline


class BERT_MD:
def __init__(self, file_pretrained):
"""

Args:
file_pretrained = "./tmp/ft-conel/"

Note:
The output of self.ner_model(s_input) is like
- s_input: e.g, 'Burger King franchise'
- return: e.g., [{'entity': 'B-ment', 'score': 0.99364895, 'index': 1, 'word': 'Burger', 'start': 0, 'end': 6}, ...]
"""

model = AutoModelForTokenClassification.from_pretrained(file_pretrained)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(file_pretrained)
self.ner_model = pipeline(
"ner",
model=model,
tokenizer=tokenizer,
device=device.index if device.index != None else -1,
ignore_labels=[],
)

def md(self, s, flag_warning=False):
"""Perform mention detection

Args:
s: input string
flag_warning: if True, print warning message

Returns:
REL style annotation results: [(start_position, length, mention), ...]
E.g., [[0, 15, 'The Netherlands'], ...]
"""

ann = self.ner_model(s) # Get ann results from BERT-NER model

ret = []
pos_start, pos_end = -1, -1 # Initialize variables

for i in range(len(ann)):
w, ner = ann[i]["word"], ann[i]["entity"]
assert ner in [
"B-ment",
"I-ment",
"O",
], f"Unexpected ner tag: {ner}. If you use BERT-NER as it is, then you should flag_use_normal_bert_ner_tag=True."
if ner == "B-ment" and w[:2] != "##":
if (pos_start != -1) and (pos_end != -1): # If B-ment is already found
ret.append(
[pos_start, pos_end - pos_start, s[pos_start:pos_end]]
) # save the previously identified mention
pos_start, pos_end = -1, -1 # Initialize
pos_start, pos_end = ann[i]["start"], ann[i]["end"]

elif ner == "B-ment" and w[:2] == "##":
if (ann[i]["index"] == ann[i - 1]["index"] + 1) and (
ann[i - 1]["entity"] != "B-ment"
): # If previous token has an entity (ner) label and it is NOT "B-ment" (i.e., ##xxx should not be the begin of the entity)
if flag_warning:
print(
f"WARNING: ##xxx (in this case {w}) should not be the begin of the entity"
)

elif (
i > 0
and (ner == "I-ment")
and (ann[i]["index"] == ann[i - 1]["index"] + 1)
): # If w is I-ment and previous word's index (i.e., ann[i-1]['index']) is also a mention
pos_end = ann[i]["end"] # update pos_end

# This only happens when flag_ignore_o is False
elif (
ner == "O"
and w[:2] == "##"
and (
ann[i - 1]["entity"] == "B-ment" or ann[i - 1]["entity"] == "I-ment"
)
): # If w is "O" and ##xxx, and previous token's index (i.e., ann[i-1]['index']) is B-ment or I-ment
pos_end = ann[i]["end"] # update pos_end

# Append remaining ment
if (pos_start != -1) and (pos_end != -1):
ret.append(
[pos_start, pos_end - pos_start, s[pos_start:pos_end]]
) # Save last mention

return ret
Loading