What is the expected result for ORGs with apostrophe s? #939

pwichmann · 2019-07-31T16:55:23Z

What is the expected result of the NE segmentation for ORG entities with apostrophe or apostrophe "s"?

E.g. "Toyota's headquarters is not here." --> Named entity = Toyota or Toyota's
I found that Flair often includes the apostrophe s in the named entity text and often even confuses country + apostrophe s with an ORG entity, e.g. "China's".

I used the tokenizer that Flair uses. May this also be caused by non-standard apostrophes, like ’ ? E.g. the named entity text for GE’s was GE’s, not just GE. But then Russia's led to the same result: the whole string (Russia's) got detected as one ORG entity, rather than just Russia and as a country.

stefan-it · 2019-08-01T00:23:15Z

Hi @pwichmann,

that's a good question! I looked at the CoNLL-2003 dataset for English and found some examples:

Germany NNP I-NP I-LOC
's POS B-NP O
representative NN I-NP O
to TO I-PP O
the DT I-NP O
European NNP I-NP I-ORG
Union NNP I-NP I-ORG
's POS B-NP O
veterinary JJ I-NP O
committee NN I-NP O

so the 's is tokenized and is a new token. It will get the O outside tag then.

Let's take this input sentence as an example:

s = Sentence("Germany's weather.", use_tokenizer=True)

This will tokenize the sentence into four tokens:

Sentence: "Germany 's weather ." - 4 Tokens

Your example sentence will be splitted into the following tokens:

Sentence: "Toyota 's headquarters is not here ." - 7 Tokens

:)

pwichmann · 2019-08-01T01:06:29Z

Interesting. I used the segtok.tokenizer to tokenise my text before I feed it into Flair. I do this to make sure I get the token positions right and Flair does not internally mess with my tokens without me seeing the tokenised sentence. And this tokenizer does it differently and does not split apostrophe and apostrophe s.

Do you get the same result if you use:

from segtok.tokenizer import word_tokenizer
print(word_tokenizer("Germany's weather."))

I certainly don't. I only get three tokens. Germany's is one token.

I had read that Flair uses the segtok one internally (#394).

This is curious. Also, it causes massive headaches at my end because the apostrophes indicate possessives that I need for my relation extraction. If apostrophes and apostrophe s become part of the named entity, they become invisible for my relation classifier.

alanakbik · 2019-08-04T13:03:52Z

Hello @pwichmann you can get the same results by calling segtok the same way we are. Specifically, we don't only use the word_tokenizer function, but also use split_contractions to get the apostrophe stuff and split_single to split sentences. Here's an example script:

# your example sentence
example_text = "Germany's weather."

# option 1: only use word_tokenizer
from segtok.tokenizer import word_tokenizer
print(word_tokenizer(example_text))

# option 2: use split_single to detect sentences, then use both split_contractions and word_tokenizer
from segtok.segmenter import split_single
from segtok.tokenizer import split_contractions

tokens = []
sentences = split_single(example_text)
for sentence in sentences:
    contractions = split_contractions(word_tokenizer(sentence))
    tokens.extend(contractions)

print(tokens)

pwichmann · 2019-08-04T14:12:30Z

Thank you so much!

pwichmann added the question Further information is requested label Jul 31, 2019

pwichmann closed this as completed Aug 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the expected result for ORGs with apostrophe s? #939

What is the expected result for ORGs with apostrophe s? #939

pwichmann commented Jul 31, 2019 •

edited

Loading

stefan-it commented Aug 1, 2019 •

edited

Loading

pwichmann commented Aug 1, 2019 •

edited

Loading

alanakbik commented Aug 4, 2019

pwichmann commented Aug 4, 2019

What is the expected result for ORGs with apostrophe s? #939

What is the expected result for ORGs with apostrophe s? #939

Comments

pwichmann commented Jul 31, 2019 • edited Loading

stefan-it commented Aug 1, 2019 • edited Loading

pwichmann commented Aug 1, 2019 • edited Loading

alanakbik commented Aug 4, 2019

pwichmann commented Aug 4, 2019

pwichmann commented Jul 31, 2019 •

edited

Loading

stefan-it commented Aug 1, 2019 •

edited

Loading

pwichmann commented Aug 1, 2019 •

edited

Loading