Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER misses - appears much worse than https://demos.explosion.ai/displacy-ent/ #977

Closed
is55555 opened this issue Apr 13, 2017 · 4 comments
Closed
Labels
lang / en English language data and models models Issues related to the statistical models

Comments

@is55555
Copy link

is55555 commented Apr 13, 2017

Bug (?)

python -m spacy info --markdown

Info about spaCy

  • spaCy version: 1.7.3
  • Platform: Darwin-16.5.0-x86_64-i386-64bit
  • Python version: 3.6.0
  • Installed models: en, en_core_web_md, en_depent_web_md, en_vectors_glove_md

(replicated in Linux as well)


I'm trying Spacy for ORG and PERSON detection after seeing satisfactory results in the demo site. ( https://demos.explosion.ai/displacy-ent/ )

As a test I run Spacy over a large repository with companies tagged by OpenCalais, and I got really bad overlap (around 85% complete misses). Going manually through examples I've found that the demo website does much better. I will include a concrete example in a comment not to make this one too lengthy.

So I tried to manually include a few company names to see if I could ameliorate this somewhat, and I found that a (possibly) similar problem happens and it happens to others as well. In issue #105 (closed), at the bottom comment by m93s 21 days ago, he describes the same outputs I'm getting rather than the ones I'm supposed to get in the very example code provided. The matcher is missing even the match just provided in the code, so something seems definitely off.

Running https://github.com/explosion/spaCy/blob/master/examples/matcher_example.py I get this output:

Before
Google Now PERSON ['NNP', 'RB']
After
Google Now PERSON ['NNP', 'RB']
Sydney True
sydney False
Sydney True
sydney True
SYDNEY True
the Brisbane Broncos ORG

And with 'en_depent_web_md' I get:
Before
Google PERSON ['NNP']
After
Google PERSON ['NNP']
Sydney True
sydney False
Sydney True
sydney True
SYDNEY True

Note that this misses even the Google Now PRODUCT [u'NNP', u'RB'] just inserted in the very example. Maybe there has been a change that affects the models?

@is55555
Copy link
Author

is55555 commented Apr 13, 2017

Running the DisplaCy demo website with the following text:

--

Five years ago, AI was struggling to identify cats. Now it’s trying to tackle 5000 species In 2012, Google made a breakthrough: It trained its AI to recognize cats in YouTube videos. Google’s neural network, software which uses statistics to approximate how the brain learns, taught itself to detect the shapes of cats and humans with more than 70% accuracy. It was a 70% improvement over any other machine learning at the time. Five years later, a contest Google is sponsoring speaks volumes about the field’s advancement. Instead of finding cats, researchers will be required to train an AI to identify more than 5000 different species of plants and animals. The contest, called iNat, will open in June and conclude in July. “Over the last five years it’s been pretty incredible, the progress of deep [neural] nets,” says Grant Van Horn, lead competition organizer and graduate student at California Institute of Technology. “I think bigger and more complex datasets are the way to go make sure we keep making progress.” To get an idea of what competitors will face, iNat organizers trained their own network on the data and turned in an impressive performance. Using open-source neural networks from Google, the team achieved up to 60% accuracy when given one chance to predict the answer, and more than 80% when given five chances. (For the AI-knowledgable, these results are on Google’s Inception network, and results were measured against the validation set.) Benchmarks like the famous ImageNet competition improve by a few percentage points each year, and some argue that simple competitions can’t recognize the true “intelligence” of an algorithm. But the competitions do indicate overall trends. Van Horn says this latest Google competition differs from ImageNet, which forces algorithms to identify a wide variety of objects like cars and houses and boats, because iNat requires AI to examine the “nitty-gritty details” that separate one species from another. This field is called fine-grain image classification. The data is provided by iNaturalist.org, a website used by nature enthusiasts to upload pictures in order to correctly identify different species as a community. The dataset consists of more than 575,000 images, and nearly 100,000 images to validate that the AI actually learned. On a scale from general image recognition (ImageNet) to specific (facial recognition,where most faces generally look the same and only slight variations matter), iNat lies somewhere in the middle, Van Horn says. Artificial intelligence research has progressed more in the last five years than in the last 50 in part because so much more data is available to use in training the AI. Much of that progress can be seen in products from Google, Amazon, and Facebook: Your photos can be tagged automatically, your email app knows how you like to respond to emails, or a new smart speaker can use AI to recognize what you’re saying. Van Horn, who has specialized in building AI that distinguishes differences between birds, said that the iNat competition illustrates how AI is beginning to help people learn about the world around them, rather than just help them organize their photos, for instance. iNat may build the software into an app that could help people identify plants and animals by just taking a picture. “You start building algorithms that can actually answer the questions that people have,” Van Horn says. “Like what bird am I actually looking at? I know it’s a bird. Please computer, don’t just tell me it’s a bird.”

--

I get Facebook tagged as an ORG, which I haven't been able to replicate with any of the new models (tried them all). I get it consistently tagged as a PERSON. Generally I'm getting worse results than the website.

@is55555
Copy link
Author

is55555 commented Apr 13, 2017

Hi @honnibal - on the performance note, is it a performance problem when the matcher fails to find a string you just gave it in a controlled example? (see the last part of the original message). Seems like a bug to me.

@ines ines added docs Documentation and website models Issues related to the statistical models lang / en English language data and models and removed docs Documentation and website models Issues related to the statistical models labels May 13, 2017
@ines
Copy link
Member

ines commented May 13, 2017

Closing this and making #1057 the master issue – work in progress for spaCy v2.0!

@ines ines closed this as completed May 13, 2017
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / en English language data and models models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests

3 participants