Orion Crawler

This repository hosts a powerful web crawler specifically designed for monitoring activities on the hidden web. It leverages Docker Compose to seamlessly orchestrate multiple services, including MongoDB for data storage, Redis for caching and task management, and multiple Tor containers to ensure robust anonymity and secure communication. This setup provides a scalable and efficient framework for collecting and analyzing hidden web data while prioritizing privacy and security.

1. Docker-Based Deployment: Quick setup and deployment using Docker.

2. Advanced Search Functionality: Provides comprehensive search capabilities with various filters and options to refine search results.

3. Data Visualization: Generates visual representations of the data, making it easier to analyze search results.

4. Customizable Search Parsers: Allows for integrating custom parsers to refine data extraction from specific websites.

5. Integrated Machine Learning Models: Incorporates NLP and machine learning models to provide search relevance, content categorization, and detection of specific data patterns.

Prerequisites

Ensure you have the following installed on your system:

Python
Docker
Docker Compose

Installation

Step 1: Clone Repository

git clone https://github.com/msmannan00/Orion-Crawler.git
cd Orion-Crawler

Step 2: Build and Start the Docker

docker-compose up --build

This command will build and start the following services:

API Service (api): The main webcrawler service that runs according to the predefined settings.
MongoDB (mongo): Database for storing crawled data.
Redis (redis_server): In-memory data store for caching and task queuing.
Tor Containers (tor-extend-*): Multiple Tor instances to route crawler traffic through different Tor exit nodes.

Step 3: Build and Start the Services

You can run the webcrawler in two ways:

Direct Execution:

Copy app/libs/nltk_data folder to appdata in windows or home directory in linux.
Navigate to the Orion-Crawler/app/ directory.
Run the webcrawler directly using:

python main_direct.py

Using Docker:

The webcrawler can also be started using Docker, which utilizes the start_app.sh script:

docker-compose up --build

Project Structure

api/: Contains the webcrawler source code. data/db/: Directory where MongoDB stores data. dockerFiles/: Dockerfiles for building custom images.

Usage

Follow the installation steps to set up and run the webcrawler. After starting the services, the crawler will automatically begin monitoring specified dark web URLs through the Tor network, storing data in MongoDB. Redis is used for caching and managing tasks.

Configuring Tor Instances

Each Tor container is configured to run as a separate instance, routing traffic through different Tor exit nodes. This increases anonymity and reduces the chances of IP bans.

Scaling

You can scale the number of Tor instances by modifying the docker-compose.yml file and adding more tor-extend-* services as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 428 Commits
app		app
dockerFiles		dockerFiles
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Orion Crawler

Prerequisites

Installation

Step 1: Clone Repository

Step 2: Build and Start the Docker

Step 3: Build and Start the Services

Direct Execution:

Using Docker:

Project Structure

Usage

Configuring Tor Instances

Scaling

About

Releases 1

Packages

Languages

msmannan00/Orion-Crawler

Folders and files

Latest commit

History

Repository files navigation

Orion Crawler

Prerequisites

Installation

Step 1: Clone Repository

Step 2: Build and Start the Docker

Step 3: Build and Start the Services

Direct Execution:

Using Docker:

Project Structure

Usage

Configuring Tor Instances

Scaling

About

Resources

Security policy

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages