Web Crawler in Go

Overview

This project is a versatile web crawler written in Go that enables users to extract and analyze textual content from websites. It utilizes goroutines for concurrent crawling, supports recursive crawling, and maintains a persistent database using SQLite. The extracted data is presented on a user-friendly HTML and CSS webpage. The key features include word extraction, bigram support, wildcard search, and TF-IDF-based result sorting.

Features

1. Web Crawling

Concurrency: The crawler employs goroutines to concurrently crawl websites, significantly speeding up the process.
Recursive Crawling: Users can enable or disable recursive crawling, allowing for in-depth exploration of linked pages.

2. Database Integration

SQLite: The crawler maintains a persistent database using SQLite to store extracted words and relevant metadata.

3. User Interface

HTML and CSS Webpage: Results are presented on a simple and visually appealing HTML and CSS webpage.

4. Search Functionality

Word and Bigram Search: Users can enter any word, including bigrams, to retrieve relevant results.
Wildcard Search: A powerful feature that allows users to search for a base word and receive results that include variations (e.g., "water" yields "watercolor").

5. Result Sorting

TF-IDF Calculation: Results are sorted using TF-IDF calculations, ensuring that the most relevant content appears first in the search results.

Screenshots

Homepage

Searching

Bigram Search

Wildcard Search

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
static		static
.gitignore		.gitignore
README.md		README.md
database.go		database.go
download_crawl.go		download_crawl.go
extract_clean.go		extract_clean.go
go.mod		go.mod
go.sum		go.sum
index.go		index.go
main.go		main.go
openai.db		openai.db
robots.go		robots.go
search.go		search.go
stopwords-en.json		stopwords-en.json
testcases.png		testcases.png
tfidf.go		tfidf.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler in Go

Overview

Features

1. Web Crawling

2. Database Integration

3. User Interface

4. Search Functionality

5. Result Sorting

Screenshots

About

Releases

Packages

Languages

ncavestany/Go-Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler in Go

Overview

Features

1. Web Crawling

2. Database Integration

3. User Interface

4. Search Functionality

5. Result Sorting

Screenshots

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages