idf problem #48

msjoberg · 2016-02-15T17:53:57Z

When a documents get a lot of ReadingEvents it (or parts of it) will be indexed many times, thus reducing the inverse document frequency. This should probably be fixed in DiMe's indexing.

One way is to somehow modify the idf function in Lucene: http://www.lucenetutorial.com/advanced-topics/scoring.html

agisbrec · 2016-02-16T10:51:23Z

I am not sure whether this is necessary. If a document is accessed often, then it is probably highly relevant and should be marked as such. Where it might be problematic, is if a document is accessed on a regular basis, i.e. every day, then an upper limit to the doc frequency might be necessary.

An alternative would be to mark this document as such in dime and not index it in Lucene every time.

jmakoske · 2016-02-16T10:57:54Z

Yes, but the effect is inverse: the ReadingEvents reduce the idf score of the terms appearing in them, as every ReadingEvent is a new document and idf gives high values to terms that appear only in a small number of documents.

agisbrec · 2016-02-16T12:20:20Z

You are right, thanks for the clarification. One could overload idf() in TFIDFSimilarity to return a constant value for all terms.
https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
Unfortunately this would brake the search for all documents. I think this is not the right way to go.

Is it necessary to index the reading events? Would it make sense to enter a query and get multiple copies of the same document? I think the document should be indexed only once and the reading events should be treated as feedback, that the document is relevant.

mvsjober · 2016-02-16T12:45:02Z

As it is currently implemented, ReadingEvents are indexed, but you do not retrieve multiple documents, instead we map them to their corresponding documents and the highest ranking version of that document is retained. In this way if the search query is matched particularly well with the text read in the ReadingEvent it will push the corresponding document up.

Of course this effect could be gained with some other mechanism, but I don't want to lose the connection to the ReadingEvent, for example some application may want to highlight the read passage for matching documents.

mvsjober self-assigned this Feb 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idf problem #48

idf problem #48

msjoberg commented Feb 15, 2016

agisbrec commented Feb 16, 2016

jmakoske commented Feb 16, 2016

agisbrec commented Feb 16, 2016

mvsjober commented Feb 16, 2016

idf problem #48

idf problem #48

Comments

msjoberg commented Feb 15, 2016

agisbrec commented Feb 16, 2016

jmakoske commented Feb 16, 2016

agisbrec commented Feb 16, 2016

mvsjober commented Feb 16, 2016