Skip to content

Latest commit

 

History

History
32 lines (27 loc) · 2.77 KB

README.md

File metadata and controls

32 lines (27 loc) · 2.77 KB

IRWS-Homework

Repository including all the programming assignements given throughout the course of Information Retrieval and Web Search at the University of Mannheim during the Spring Term 2017

Homework 1: Minimum Edit Distance

For Homework 1 the (Damerau-)Levensthein distance has been implemented both with dynamic programming and recursions.

There are several flags to customize what happens at runtime

  • original: left side of the comparison
  • compare: right side of the comparison
  • recursive: true if a recursive version shall be used
  • damerau: true if the Damerau-Levensthein Distance shall be used
  • weigths: true if custom weights for transposition/ replacement shall be used

For the implementation golang was used, there are a couple of tests to show sample output and benchmark tests to see the difference in runtime between recursive and dynamic programming versions.

Homework 2: Vector Space and Probabilistic Retrieval

  • Term weighting: Compute TF-IDF for a toy document collection with different definitions for TF and IDF and rank the documents given a query with cosine similarity.
  • Distance/similarity metrics: Ranking of documents given a query and 'raw Euclidean distance', 'normalized Euclidean distance' and 'cosine similarity'
  • Optimizing vector space model: Given a toy collection of TF-IDF vectors perform random projections to reduce computation costs. Do a pre-clustering of the documents using a given set of leader vectors. Finally retrieve top 5 documents for a query vector using the random projection vectors and leader vectors with clusters.
  • Classic probabilistic retrieval: Given a query rank documents with 'Binary independence model', 'Two-Poisson model', 'BM25'
  • Unigram Likelihood Model for Information Retrieval: For the programming assignment the tasks was to build a query likelihood model based on a unigram Likelihood Model for the 20 News corpus, which is able to take ad-hoc queries and rank the documents by relevance based on the unigram model. This part is implemented using Scala and the Spark Api.

Homework 3: Semantic Retrieval, Text Clustering, and IR Evaluation

  • Latent Semantic Indexing: Computing the similarity of latent vectors for a toy collection of documents and a query
  • Text Clustering: Using 'K-Means' and 'Single Pass Clustering' to cluster a toy collection of TF-IDF vectors
  • IR Evaluation: Calculating precision, recall, F1, P@k, R-precision, average precision and mean average precision for a toy collection of retrievals and their relevance rating
  • Semantic Retrieval with Word-Embeddings: Implementation of a simple retrieval engine based on aggregation of word embeddings using the pretrained 'GloVe' word embeddings and a random subsample of 500 documents from the '20 News Groups dataset'