GitHub - mahatt/Wikipedia-Search-Engine: Information Retrieval and Extraction Project

Wikipedia Search Engine

Part 1: It included IR system preprocessing [case-folding,stemming,lemmatization,normalization] followed by parametric indexing on wikipedia dump of size 40GB. Two level indexing for keyword and Title list is generated at end of processing. System developed is Parser-Indexer mapped on Producer-Consumer exploiting full cpu utilization Performance : 100MB Processing in less than 50 sec.

Part 2: It included creation of search model based on TF-IDF ranking with defined weighting on indexing fields [outlink,title,text,info], query processing is full text search results into top 10 titles of document ranked higher. Performance: Query results produced in less than 1 sec delay.

Addition: nearest Word suggestion for wrong keywords [time consuming process]

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
DBS		DBS
IRE-LAB-1-PC		IRE-LAB-1-PC
IRE-LAB-1		IRE-LAB-1
PC-TEST		PC-TEST
UniCodeNormalizer		UniCodeNormalizer
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

mahatt/Wikipedia-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages