Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

class attributes for links #29

Closed
dginev opened this issue Apr 25, 2013 · 9 comments
Closed

class attributes for links #29

dginev opened this issue Apr 25, 2013 · 9 comments
Assignees
Milestone

Comments

@dginev
Copy link
Owner

dginev commented Apr 25, 2013

Bruce made a great point of people might wanting to style the NNexus-created links to distinguish them from the originally placed links in the article.

Essentially NNexus needs to deposit links that have a class attribute:

<a href="url" class="nnexus_concept">phrase</a>

We might leverage parts of this for the multi-link situations where we want to deposit links for more than one domain at the same location.

@ghost ghost assigned dginev Apr 25, 2013
@holtzermann17
Copy link
Collaborator

or for the "light-weight linguistic tagging" application, to distinguish between various kinds of NNexus-created links, e.g. class="nnexus_technicalterm" class="nnexus_semitechnical" etc.

@dginev
Copy link
Owner Author

dginev commented May 3, 2013

Joe, currently, and probably in general, NNexus won't be able to distinguish between degrees of formality of the phrases it annotates, at least given my current knowledge.

The closest line of work that I am aware of (and please feel free to bring any other resources to my attention) is "statistical term extraction" such as the C/NC-value algorithm, that Magda introduced me to back in 2009. In truth, I would love to have Magda's guidance once we get into doing statistical methods.

In short, those methods assign a probability between 0 and 1 indicating their confidence that a certain text fragment is a term. I wouldn't model those with separate class attributes though.

I have added a generic nexus_concept class to all NNexus-created links now, closing here. We can continue the discussion in some issue of the NNexus 3 milestone, when the time is ripe.

@dginev dginev closed this as completed May 3, 2013
@holtzermann17
Copy link
Collaborator

I'm fine with moving more detailed classifications to NNexus 3.0. I do have a couple preliminary questions and thoughts in mind though! One of these might impact the basic data model: Does NNexus keep track of the different words that link to a given concept? Of course it has some concept of the words-to-be-linked (title, defines, synonyms), but does it know about the actual linking words?

If so, we could compute the entropy of the inbound links (entropy.pl, c/o Linguistics 696f).

For some of the work I did in my thesis, I looked at actual link words -- both autolinked terms and human-added. I feel like we might be able to make a preliminary slice into the problem by computing the ratio of # of links to perplexity (exponentiated entropy). This could allow us to identify links that are "hot" and unstable vs links that are "cool" and stable.

Group

# links: 895
perplexity: 3.4648743465749
ratio: 258

AbelianGroup2

# links: 246
perplexity: 3.6928049373895
ratio: 66

AlgebraicStructure

# links: 114
perplexity: 8.71083640268101
ratio: 13

InfixNotation

# links: 13
perplexity: 2.62234281337809
ratio: 5

(I'm not sure that the formula links/perplexity is actually the best one to use, but the basic point is that it would be nice to have some per-concept and per-link measures, and this is something that we should be able to do right away without using any intensive linguistics, as long as the data model is rich enough.)

@holtzermann17
Copy link
Collaborator

updated math above to use perplexity instead of entropy, to avoid division by zero and replace it with a division by one.

@dginev
Copy link
Owner Author

dginev commented May 3, 2013

Frequency-based metrics like the two you mention are exactly what's behind the C-value algorithm that I, in turn, linked to. So I am all for experimenting with such things.

My biggest problem with the idea (both for term extraction and the metrics you mention) is that one typically needs to have a $total over an ideally large corpus, again ideally close in content to what will be auto-linked in the future. That's something that can bog NNexus down, to the point of it getting unreasonably slow, but more importantly - it does not feel as part of the core NNexus processing, somehow. I would prefer preparing any statistical honey pots in advance (via a separate script, or via LaMaPUn) and using that data in NNexus, the way I simply used the MSC similarity metric, but did not estimate it in NNexus itself.

@holtzermann17
Copy link
Collaborator

I feel like entropy could be a derived statistic, coming from information we already have, or that we've explicitly thrown away.

Even so, I do agree with you that there could be a good "in principle" argument to do this sort of stuff with another tool -- as long as NNexus actually does it's part, by gathering and sharing the data needed by the other services. What I think I'd like to move toward would be some short one- (or two-) line specifications of the services that NNexus and LaMaPUn will provide (along the lines of the "PlanetMath is..." one-liners I added here: holtzermann17/planetmath-docs#34 (comment)). Even better, we might devise a UML-style diagram showing the sorts of services we'd like to build (holtzermann17/planetmath-docs#18).

I'll give it a try for NNexus:

NNexus is a tool for automatically linking technical terms in math articles, using a metadata store that defines the terms to be linked. It has a plug-in architecture that can be used to route links based on other metadata, like categories, clusters, and previous links.

@holtzermann17
Copy link
Collaborator

Incidentally, @dginev - I think these entropy-based measures would be quite interesting to look at on an per-MSC basis. Thus, "group" would presumably mostly be linked to by objects from algebra. It might be mis-linked from some article on basic arithmetic or whatever. It could eventually be quite fun to have a visual map of the "spread" of terms/concepts -- some would be highly-specialized, others would be more generic. As we saw with the example of "continuous function", we do need to distinguish between terms and concepts, since the 03-XX version of continuous function and the 54-XX/26-XX version are quite different. But a term-based entropy model might discover the two "clusters".

@dginev
Copy link
Owner Author

dginev commented May 10, 2013

The real problem with that is that we don't have enough data - PlanetMath is too small to reliably extract statistical information for such metrics. I was thinking of using arXiv and ZBL in order to identify more commonly used words, but I never thought to make finer distinctions based on the MSC classes - the data is just too little to do that reliably.

@holtzermann17
Copy link
Collaborator

...Probably too small to be useful for making serious linguistics conclusions - but I think it's enough to provide some "heuristic" information for users. We could even try to do some visualization work with the new http://map.mathweb.org/ to show the diffusion of terms defined in one area to other areas.

And here's another related application, which I think would users would like.

  1. Let's imagine that we can assign an MSC class or set of classes in a given text by using NNexus to spot terms in the text. I'm sure we can do this with some degree of accuracy.
  2. We should then also be able to assign an MSC class to a given user (perhaps an average across all of their contributions, or a portion of a "path" through the MathMap linked above).
  3. We could then use this to automatically identify "interests" of users.
  4. The "entropy" of a user then becomes a meaningful thing to compute. I met one mathematician who said his interests were "an inch thick and a mile wide." We could see if his interactions actually map to something similar.
  5. We could then provide a sort of "heat map" of PlanetMath, showing "recently active" topics in various ways, for instance, approximating them by "recently active users" and the historical spread of interests from 4.

A new user might like to know things like: "the last time someone with an interest in hyperbolic functions logged on was 3.5 minutes ago, but the last time a user with interest in variational problems in infinite-dimensional spaces logged on was never."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants