-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the expected result for ORGs with apostrophe s? #939
Comments
Hi @pwichmann, that's a good question! I looked at the CoNLL-2003 dataset for English and found some examples:
so the Let's take this input sentence as an example: s = Sentence("Germany's weather.", use_tokenizer=True) This will tokenize the sentence into four tokens: Sentence: "Germany 's weather ." - 4 Tokens Your example sentence will be splitted into the following tokens: Sentence: "Toyota 's headquarters is not here ." - 7 Tokens :) |
Interesting. I used the segtok.tokenizer to tokenise my text before I feed it into Flair. I do this to make sure I get the token positions right and Flair does not internally mess with my tokens without me seeing the tokenised sentence. And this tokenizer does it differently and does not split apostrophe and apostrophe s. Do you get the same result if you use:
I certainly don't. I only get three tokens. Germany's is one token. I had read that Flair uses the segtok one internally (#394). This is curious. Also, it causes massive headaches at my end because the apostrophes indicate possessives that I need for my relation extraction. If apostrophes and apostrophe s become part of the named entity, they become invisible for my relation classifier. |
Hello @pwichmann you can get the same results by calling # your example sentence
example_text = "Germany's weather."
# option 1: only use word_tokenizer
from segtok.tokenizer import word_tokenizer
print(word_tokenizer(example_text))
# option 2: use split_single to detect sentences, then use both split_contractions and word_tokenizer
from segtok.segmenter import split_single
from segtok.tokenizer import split_contractions
tokens = []
sentences = split_single(example_text)
for sentence in sentences:
contractions = split_contractions(word_tokenizer(sentence))
tokens.extend(contractions)
print(tokens) |
Thank you so much! |
What is the expected result of the NE segmentation for ORG entities with apostrophe or apostrophe "s"?
E.g. "Toyota's headquarters is not here." --> Named entity = Toyota or Toyota's
I found that Flair often includes the apostrophe s in the named entity text and often even confuses country + apostrophe s with an ORG entity, e.g. "China's".
I used the tokenizer that Flair uses. May this also be caused by non-standard apostrophes, like ’ ? E.g. the named entity text for GE’s was GE’s, not just GE. But then Russia's led to the same result: the whole string (Russia's) got detected as one ORG entity, rather than just Russia and as a country.
The text was updated successfully, but these errors were encountered: