-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"instance" linked in "For instance" #43
Comments
Moving to 3.0 milestone. |
I am revisiting the accuracy issues at the moment. This report reduces to a linguistic deficiency - NNexus does not currently recognize prepositional phrases. It is easy to image a document where "instance" is used as a term, and additionalyl "for instance" is used separately to provide an example. So this is a legitimate bug that requires enhancing NNexus with more linguistic capabilities. With the exception of phrases containing pronouns, most propositional phrases form a closed set in English and are relatively well capture by Wiktionary (they have 701 of them here). I was reading recently that the mantra which works for a lot of startups is "do the simplest approach first", so introducing a hardcoded list of phrases to avoid (ignoring pronoun variation for the moment) could be the easiest solution here. The "correct" solution of course is to have part of speech information and only treat regular Noun Phrases (NPs) as concept candidates. But we don't have a reliably part-of-speech tagger for mathematical texts yet. |
So, on the POS tagger front, the conventionally accepted "best" free tool is the Stanford tagger. An important discovery I just made is that someone has gone through the effort of creating a self-contained Perl wrapper around the Stanford Core NLP tools (133 MB in size!) and published it on CPAN. So that makes it easy to acquire a tagger as a dependency. Currently trying that out. |
But it also requires a Java SDK, so NNexus gets a total of ~200 MB heavier in size. Interesting to see if we gain anything in result. |
@dginev - if we ever get around to integrating the "recommender system" that I worked on in my Day Job (2013 edition), https://github.com/kmi/decipher we would also have a Java dependency there. I can imagine having a dedicated (virtual) server for running web services. |
I have found a possibly perfect match for augmenting NNexus with POS tags, namely the SENNA toolkit. It is both efficient and has state-of-art precision and recall, which makes it a perfect fit. Using native C I could process a large arXiv document (6500 words) in 3 seconds, including the parsing overhead. So I have the feeling for regular NNexus jobs the POS parsing might be only an insignificant hit to the overall runtime. I am currently writing a Perl wrapper for the library, in order to easily leverage SENNA in NNexus. My other experiments were performed in the context of LLaMaPUn and my general PhD work. |
Target of the link is: http://planetmath.org/substitutionsinpropositionallogic
Source article (place where the link lives) is: http://planetmath.org/topicentryoncomplexanalysis
The text was updated successfully, but these errors were encountered: