Web Crawler

This project is a multi-threaded web crawler built in C++. It uses the Curl library for making HTTP requests, the Gumbo library for parsing HTML, and the nlohmann/json library for JSON manipulation. The crawler can crawl websites up to a specified depth and save metadata such as page titles, descriptions, and links.

Setup Instructions

To set up and run the web crawler project, follow these instructions:

1. Install Dependencies

Make sure you have the following dependencies installed on your system:

C++ compiler (e.g., GCC, Clang)
CMake (version 3.14 or later)
Curl library
Gumbo library
nlohmann/json library

Installation Guide for Dependencies

1. Curl Library

On Linux:

To install the Curl library, use your package manager. For example, on Debian-based systems (like Ubuntu), run:

sudo apt-get install libcurl4-openssl-dev

On Windows:

You can install Curl using vcpkg. First, install vcpkg and then run:

vcpkg install curl

2. Gumbo Library

On Linux:

Install the Gumbo library using the following command:

sudo apt-get install libgumbo-dev

On Windows:

To install Gumbo on Windows, use vcpkg:

vcpkg install gumbo-parser

3. nlohmann/json Library

On Linux:

You can install the nlohmann/json library via a package manager:

sudo apt-get install nlohmann-json3-dev

On Windows:

Install the nlohmann/json library using vcpkg:

vcpkg install nlohmann-json

Make sure to configure your build system to include the paths to these libraries as needed.

2. Clone the Repository

git clone https://github.com/your-username/web-crawler.git
cd web-crawler

3. Run the Project

Create a obj directory, generate the necessary files with make, and compile the project:

./make.sh

4. Run the Web Crawler

To start crawling from a seed URL, thread, Max Depth, run:

./bin/web_crawler javatpoint.com 16 3

Replace https://www.javatpoint.com with the desired seed URL.

Contributing Guide (Quick Commands)

# 1. Clone your forked repo
git clone https://github.com/your-username/repo-name.git
cd repo-name

# 2. Add upstream remote to keep up with the original repo
git remote add upstream https://github.com/original-owner/repo-name.git
git fetch upstream

# 3. Create a new branch for your feature/fix
git checkout -b feature-or-fix-description

# 4. Before making changes, ensure your local main branch is up-to-date
git checkout main
git fetch upstream
git rebase upstream/main

# 5. Rebase your feature branch onto the updated main branch
git checkout feature-or-fix-description
git rebase main

# 6. Make your changes, then stage and commit with a descriptive message
git add .
git commit -m "Description of changes"

# 7. Push your branch to your fork
git push origin feature-or-fix-description

# 8. If updates happen on upstream/main while your PR is open, keep your branch updated:
git fetch upstream                # Get latest updates from the original repo
git rebase upstream/main           # Rebase your feature branch onto it
git push origin feature-or-fix-description --force  # Force-push updated branch

Conclusion

This web crawler project is a multi-threaded application that effectively crawls websites, extracts relevant metadata, and handles rate-limiting for domains. With its modular design, it is easy to extend or modify the functionality for additional use cases. You can adjust the number of worker threads, crawl depth, and domain request delay for optimized performance based on your requirements.

Feel free to contribute to the project by submitting issues or pull requests. Happy crawling!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
include		include
src		src
.gitignore		.gitignore
README.md		README.md
clean.sh		clean.sh
main		main
make.sh		make.sh
package-lock.json		package-lock.json
package.json		package.json
upload.js		upload.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Table of Contents

Setup Instructions

1. Install Dependencies

Installation Guide for Dependencies

1. Curl Library

On Linux:

On Windows:

2. Gumbo Library

On Linux:

On Windows:

3. nlohmann/json Library

On Linux:

On Windows:

2. Clone the Repository

3. Run the Project

4. Run the Web Crawler

Contributing Guide (Quick Commands)

Conclusion

About

Releases

Packages

Contributors 2

Languages

adit4443ya/Multi-Threaded-Crawling

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Table of Contents

Setup Instructions

1. Install Dependencies

Installation Guide for Dependencies

1. Curl Library

On Linux:

On Windows:

2. Gumbo Library

On Linux:

On Windows:

3. nlohmann/json Library

On Linux:

On Windows:

2. Clone the Repository

3. Run the Project

4. Run the Web Crawler

Contributing Guide (Quick Commands)

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages