Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idf problem #48

Open
msjoberg opened this issue Feb 15, 2016 · 4 comments
Open

idf problem #48

msjoberg opened this issue Feb 15, 2016 · 4 comments
Assignees

Comments

@msjoberg
Copy link

When a documents get a lot of ReadingEvents it (or parts of it) will be indexed many times, thus reducing the inverse document frequency. This should probably be fixed in DiMe's indexing.

One way is to somehow modify the idf function in Lucene: http://www.lucenetutorial.com/advanced-topics/scoring.html

@mvsjober mvsjober self-assigned this Feb 15, 2016
@agisbrec
Copy link
Contributor

I am not sure whether this is necessary. If a document is accessed often, then it is probably highly relevant and should be marked as such. Where it might be problematic, is if a document is accessed on a regular basis, i.e. every day, then an upper limit to the doc frequency might be necessary.

An alternative would be to mark this document as such in dime and not index it in Lucene every time.

@jmakoske
Copy link

Yes, but the effect is inverse: the ReadingEvents reduce the idf score of the terms appearing in them, as every ReadingEvent is a new document and idf gives high values to terms that appear only in a small number of documents.

@agisbrec
Copy link
Contributor

You are right, thanks for the clarification. One could overload idf() in TFIDFSimilarity to return a constant value for all terms.
https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
Unfortunately this would brake the search for all documents. I think this is not the right way to go.

Is it necessary to index the reading events? Would it make sense to enter a query and get multiple copies of the same document? I think the document should be indexed only once and the reading events should be treated as feedback, that the document is relevant.

@mvsjober
Copy link
Member

As it is currently implemented, ReadingEvents are indexed, but you do not retrieve multiple documents, instead we map them to their corresponding documents and the highest ranking version of that document is retained. In this way if the search query is matched particularly well with the text read in the ReadingEvent it will push the corresponding document up.

Of course this effect could be gained with some other mechanism, but I don't want to lose the connection to the ReadingEvent, for example some application may want to highlight the read passage for matching documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants