Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the expected result for ORGs with apostrophe s? #939

Closed
pwichmann opened this issue Jul 31, 2019 · 4 comments
Closed

What is the expected result for ORGs with apostrophe s? #939

pwichmann opened this issue Jul 31, 2019 · 4 comments
Labels
question Further information is requested

Comments

@pwichmann
Copy link

pwichmann commented Jul 31, 2019

What is the expected result of the NE segmentation for ORG entities with apostrophe or apostrophe "s"?

E.g. "Toyota's headquarters is not here." --> Named entity = Toyota or Toyota's
I found that Flair often includes the apostrophe s in the named entity text and often even confuses country + apostrophe s with an ORG entity, e.g. "China's".

I used the tokenizer that Flair uses. May this also be caused by non-standard apostrophes, like ’ ? E.g. the named entity text for GE’s was GE’s, not just GE. But then Russia's led to the same result: the whole string (Russia's) got detected as one ORG entity, rather than just Russia and as a country.

@pwichmann pwichmann added the question Further information is requested label Jul 31, 2019
@stefan-it
Copy link
Member

stefan-it commented Aug 1, 2019

Hi @pwichmann,

that's a good question! I looked at the CoNLL-2003 dataset for English and found some examples:

Germany NNP I-NP I-LOC
's POS B-NP O
representative NN I-NP O
to TO I-PP O
the DT I-NP O
European NNP I-NP I-ORG
Union NNP I-NP I-ORG
's POS B-NP O
veterinary JJ I-NP O
committee NN I-NP O

so the 's is tokenized and is a new token. It will get the O outside tag then.

Let's take this input sentence as an example:

s = Sentence("Germany's weather.", use_tokenizer=True)

This will tokenize the sentence into four tokens:

Sentence: "Germany 's weather ." - 4 Tokens

Your example sentence will be splitted into the following tokens:

Sentence: "Toyota 's headquarters is not here ." - 7 Tokens

:)

@pwichmann
Copy link
Author

pwichmann commented Aug 1, 2019

Interesting. I used the segtok.tokenizer to tokenise my text before I feed it into Flair. I do this to make sure I get the token positions right and Flair does not internally mess with my tokens without me seeing the tokenised sentence. And this tokenizer does it differently and does not split apostrophe and apostrophe s.

Do you get the same result if you use:

from segtok.tokenizer import word_tokenizer
print(word_tokenizer("Germany's weather."))

I certainly don't. I only get three tokens. Germany's is one token.

I had read that Flair uses the segtok one internally (#394).

This is curious. Also, it causes massive headaches at my end because the apostrophes indicate possessives that I need for my relation extraction. If apostrophes and apostrophe s become part of the named entity, they become invisible for my relation classifier.

@alanakbik
Copy link
Collaborator

Hello @pwichmann you can get the same results by calling segtok the same way we are. Specifically, we don't only use the word_tokenizer function, but also use split_contractions to get the apostrophe stuff and split_single to split sentences. Here's an example script:

# your example sentence
example_text = "Germany's weather."

# option 1: only use word_tokenizer
from segtok.tokenizer import word_tokenizer
print(word_tokenizer(example_text))

# option 2: use split_single to detect sentences, then use both split_contractions and word_tokenizer
from segtok.segmenter import split_single
from segtok.tokenizer import split_contractions

tokens = []
sentences = split_single(example_text)
for sentence in sentences:
    contractions = split_contractions(word_tokenizer(sentence))
    tokens.extend(contractions)

print(tokens)

@pwichmann
Copy link
Author

Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants