-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
class attributes for links #29
Comments
or for the "light-weight linguistic tagging" application, to distinguish between various kinds of NNexus-created links, e.g. |
Joe, currently, and probably in general, NNexus won't be able to distinguish between degrees of formality of the phrases it annotates, at least given my current knowledge. The closest line of work that I am aware of (and please feel free to bring any other resources to my attention) is "statistical term extraction" such as the C/NC-value algorithm, that Magda introduced me to back in 2009. In truth, I would love to have Magda's guidance once we get into doing statistical methods. In short, those methods assign a probability between 0 and 1 indicating their confidence that a certain text fragment is a term. I wouldn't model those with separate class attributes though. I have added a generic |
I'm fine with moving more detailed classifications to NNexus 3.0. I do have a couple preliminary questions and thoughts in mind though! One of these might impact the basic data model: Does NNexus keep track of the different words that link to a given concept? Of course it has some concept of the words-to-be-linked (title, defines, synonyms), but does it know about the actual linking words? If so, we could compute the entropy of the inbound links (entropy.pl, c/o Linguistics 696f). For some of the work I did in my thesis, I looked at actual link words -- both autolinked terms and human-added. I feel like we might be able to make a preliminary slice into the problem by computing the ratio of # of links to perplexity (exponentiated entropy). This could allow us to identify links that are "hot" and unstable vs links that are "cool" and stable. Group
AbelianGroup2
AlgebraicStructure
InfixNotation
(I'm not sure that the formula links/perplexity is actually the best one to use, but the basic point is that it would be nice to have some per-concept and per-link measures, and this is something that we should be able to do right away without using any intensive linguistics, as long as the data model is rich enough.) |
updated math above to use perplexity instead of entropy, to avoid division by zero and replace it with a division by one. |
Frequency-based metrics like the two you mention are exactly what's behind the C-value algorithm that I, in turn, linked to. So I am all for experimenting with such things. My biggest problem with the idea (both for term extraction and the metrics you mention) is that one typically needs to have a $total over an ideally large corpus, again ideally close in content to what will be auto-linked in the future. That's something that can bog NNexus down, to the point of it getting unreasonably slow, but more importantly - it does not feel as part of the core NNexus processing, somehow. I would prefer preparing any statistical honey pots in advance (via a separate script, or via LaMaPUn) and using that data in NNexus, the way I simply used the MSC similarity metric, but did not estimate it in NNexus itself. |
I feel like entropy could be a derived statistic, coming from information we already have, or that we've explicitly thrown away. Even so, I do agree with you that there could be a good "in principle" argument to do this sort of stuff with another tool -- as long as NNexus actually does it's part, by gathering and sharing the data needed by the other services. What I think I'd like to move toward would be some short one- (or two-) line specifications of the services that NNexus and LaMaPUn will provide (along the lines of the "PlanetMath is..." one-liners I added here: holtzermann17/planetmath-docs#34 (comment)). Even better, we might devise a UML-style diagram showing the sorts of services we'd like to build (holtzermann17/planetmath-docs#18). I'll give it a try for NNexus:
|
Incidentally, @dginev - I think these entropy-based measures would be quite interesting to look at on an per-MSC basis. Thus, "group" would presumably mostly be linked to by objects from algebra. It might be mis-linked from some article on basic arithmetic or whatever. It could eventually be quite fun to have a visual map of the "spread" of terms/concepts -- some would be highly-specialized, others would be more generic. As we saw with the example of "continuous function", we do need to distinguish between terms and concepts, since the 03-XX version of continuous function and the 54-XX/26-XX version are quite different. But a term-based entropy model might discover the two "clusters". |
The real problem with that is that we don't have enough data - PlanetMath is too small to reliably extract statistical information for such metrics. I was thinking of using arXiv and ZBL in order to identify more commonly used words, but I never thought to make finer distinctions based on the MSC classes - the data is just too little to do that reliably. |
...Probably too small to be useful for making serious linguistics conclusions - but I think it's enough to provide some "heuristic" information for users. We could even try to do some visualization work with the new http://map.mathweb.org/ to show the diffusion of terms defined in one area to other areas. And here's another related application, which I think would users would like.
A new user might like to know things like: "the last time someone with an interest in hyperbolic functions logged on was 3.5 minutes ago, but the last time a user with interest in variational problems in infinite-dimensional spaces logged on was never." |
Bruce made a great point of people might wanting to style the NNexus-created links to distinguish them from the originally placed links in the article.
Essentially NNexus needs to deposit links that have a class attribute:
We might leverage parts of this for the multi-link situations where we want to deposit links for more than one domain at the same location.
The text was updated successfully, but these errors were encountered: