Web crawler

Simply library for crawling websites by following links with minimal dependencies.

📦 Installation

It's best to use Composer for installation, and you can also find the package on Packagist and GitHub.

To install, simply use the command:

$ composer require baraja-core/webcrawler

You can use the package manually by creating an instance of the internal classes, or register a DIC extension to link the services directly to the Nette Framework.

How to use

Crawler can run without dependencies.

In default settings create instance and call crawl() method:

$crawler = new \Baraja\WebCrawler\Crawler;

$result = $crawler->crawl('https://example.com');

In $result variable will be entity of type CrawledResult.

Advanced checking of multiple URLs

In real case you need download multiple URLs in single domain and check if some specific URLs works.

Simple example:

$crawler = new \Baraja\WebCrawler\Crawler;

$result = $crawler->crawlList(
    'https://example.com', // Starting (main) URL
    [ // Additional URLs
        'https://example.com/error-404',
        '/robots.txt', // Relative links are also allowed
        '/web.config',
    ]
);

Notice: File robots.txt and sitemap will be downloaded automatically if exist.

Settings

In constructor of service Crawler you can define your project specific configuration.

Simply like:

$crawler = new \Baraja\WebCrawler\Crawler(
    new \Baraja\WebCrawler\Config([
        // key => value
    ])
);

No one value is required. Please use as key-value array.

Configuration options:

Option	Default value	Possible values
`followExternalLinks`	`false`	`Bool`: Stay only in given domain?
`sleepBetweenRequests`	`1000`	`Int`: Sleep in milliseconds.
`maxHttpRequests`	`1000000`	`Int`: Crawler budget limit.
`maxCrawlTimeInSeconds`	`30`	`Int`: Stop crawling when limit is exceeded.
`allowedUrls`	`['.+']`	`String[]`: List of valid regex about allowed URL format.
`forbiddenUrls`	`['']`	`String[]`: List of valid regex about banned URL format.

📄 License

baraja-core/webcrawler is licensed under the MIT license. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github		.github
src		src
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
phpstan.neon		phpstan.neon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web crawler

📦 Installation

How to use

Advanced checking of multiple URLs

Settings

📄 License

About

Releases 15

Sponsor this project

Packages

Contributors 4

Languages

License

baraja-core/webcrawler

Folders and files

Latest commit

History

Repository files navigation

Web crawler

📦 Installation

How to use

Advanced checking of multiple URLs

Settings

📄 License

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 15

Sponsor this project

Packages 0

Contributors 4

Languages

Packages