-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idf problem #48
Comments
I am not sure whether this is necessary. If a document is accessed often, then it is probably highly relevant and should be marked as such. Where it might be problematic, is if a document is accessed on a regular basis, i.e. every day, then an upper limit to the doc frequency might be necessary. An alternative would be to mark this document as such in dime and not index it in Lucene every time. |
Yes, but the effect is inverse: the ReadingEvents reduce the idf score of the terms appearing in them, as every ReadingEvent is a new document and idf gives high values to terms that appear only in a small number of documents. |
You are right, thanks for the clarification. One could overload idf() in TFIDFSimilarity to return a constant value for all terms. Is it necessary to index the reading events? Would it make sense to enter a query and get multiple copies of the same document? I think the document should be indexed only once and the reading events should be treated as feedback, that the document is relevant. |
As it is currently implemented, ReadingEvents are indexed, but you do not retrieve multiple documents, instead we map them to their corresponding documents and the highest ranking version of that document is retained. In this way if the search query is matched particularly well with the text read in the ReadingEvent it will push the corresponding document up. Of course this effect could be gained with some other mechanism, but I don't want to lose the connection to the ReadingEvent, for example some application may want to highlight the read passage for matching documents. |
When a documents get a lot of ReadingEvents it (or parts of it) will be indexed many times, thus reducing the inverse document frequency. This should probably be fixed in DiMe's indexing.
One way is to somehow modify the idf function in Lucene: http://www.lucenetutorial.com/advanced-topics/scoring.html
The text was updated successfully, but these errors were encountered: