Web Search

This repository includes code for a Web Crawler and a Search Engine.

Features

Web Crawler

The web crawler systematically explores the internet and collects information from various websites.

Search Engine

The search engine locates web pages containing information specified by the user's query.

Ethical Web Crawling

Adheres to ethical web programming practices by crawling only URLs permitted in the robots.txt file.

Technologies/Tools

HTML5
CSS3
PHP 8.1.25
PostgreSQL v12
XAMPP Server

Implementation Details

Web Crawler starts with a seed URL
A queue is maintained to keep track of which URLs have been crawled and thus prevents requesting the same URLs
For each URL, it is checked based on a user-defined function whether the robots.txt file in the root directory of the website allows crawling the URL or not.
cURL - stands for 'Client for URLs', a command line tool for sending or receiving files using URLs over HTTP/s or FTP.
DomDocument utility is used to parse the HTML document, extract the links
Crawler then recursively takes each of these URLs and repeats the same process for each URL till a certain depth
PostgreSQL Full Text Search functionality is used for the search engine where the input query is matched against the auto-generated columns of type tsvector in the db. Further explained in next point
Each of the URL visited is stored in the postgres db with the following schema

        -- URL (TEXT) (primary key)

        -- title (TEXT)

        -- content (TEXT) 

        -- updated_on (timestamp)
        
        -- search_content (tsvector)

updated_on field would help the crawler refresh the content i.e; recrawl certain URLs
title and text content of the page is stored in the fields that help in returning the contents of the page whenever a search query returns a particular URL, thus needing not to request same URLs again
search_content is an auto-generated column of type tsvector that stores the 'content' in tsvector that is a sorted list of distinct lexemes (which are the normalized/base form of a word), which are words that have been normalized to merge different variants of the same word. Sorting and duplicate-elimination is done automatically on input.
Upon a search query, the input query string is converted to tsquery - a data type that stores lexemes that are to be searched for.
All the web pages containing a certain query string are returned as search results
Integrated logging functionality using Monolog, a powerful logging library for PHP.

Project Dependencies

Monolog - the default logging library for PHP.

Dotenv - a library that helps manage environment variables in PHP

Getting Started

Note:

You should have PHP, XAMPP server and PostgreSQL version 12+ installed

Navigate to \htdocs folder and clone the repository

cd C:\xampp\htdocs
git clone https://github.com/Zainab-Saad/web-search

Install the various composer dependencies used in the project

composer install

Create a .env file that would contain the host, username, database name and the password
Run XAMPP server
In your browser, type

http://localhost/web-crawler/index.php

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
config		config
database		database
public/css		public/css
src		src
vendor		vendor
.gitignore		.gitignore
README.md		README.md
composer.json		composer.json
composer.lock		composer.lock
index.php		index.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Search

Features

Web Crawler

Search Engine

Ethical Web Crawling

Technologies/Tools

Implementation Details

Project Dependencies

Getting Started

Note:

References

About

Releases

Packages

Languages

Zainab-Saad/web-search

Folders and files

Latest commit

History

Repository files navigation

Web Search

Features

Web Crawler

Search Engine

Ethical Web Crawling

Technologies/Tools

Implementation Details

Project Dependencies

Getting Started

Note:

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages