Simple web site crawler

Overview

This is an elementary Web site crawler written using C# on .NET Core . What do we mean by crawling a web site ? No! We are not indexing the content of the pages.

This is a simple component which will crawl through a web site (example: www.cnn.com), find sub-links and in turn crawl those pages . Only links which fall the under the domain of the parent site are accepted.

Implementation overview

The accompanying code produces a command line application. The results are written directly to the console.

Sample output

Here is the output when https://www.cnn.com/ was crawled

https://www.cnn.com/videos/world/2021/11/16/uganda-explosions-madowo-live-nr-ovn-intl-ldn-vpx.cnn
https://www.cnn.com/videos/world/2021/11/17/lagos-nigeria-erosion-climate-change-ctw-busari-pkg-intl-vpx.cnn
https://www.cnn.com/videos/world/2021/11/19/blinken-says-ethiopia-on-path-to-destruction-sot-intl-ctw-vpx.cnn
https://www.cnn.com/videos/world/2021/11/25/davido-birthday-donations-orphanage-oneworld-intl-intv-vpx.cnn
https://www.cnn.com/videos/world/2021/11/29/salim-abdool-karim-south-africa-omicron-variant-covid-19-travel-ban-sot-newday-vpx.cnn
https://www.cnn.com/videos/world/2021/12/02/botswana-president-omicron-coronavirus-travel-bans-oneworld-intl-vpx.cnn
https://www.cnn.com/vr
https://www.cnn.com/warnermediaprivacy.com/opt-out
https://www.cnn.com/weather
https://www.cnn.com/world
https://www.cnn.com/www.cnn.com/interactive/travel/best-beaches
https://www.cnn.com/www.cnnpartners.com
https://www.cnn.com/youtube.com/user/CNN
Found 663 sites in the Url:'https://www.cnn.com', after searching a maximum of 30 sites

What kind of hyperlinks are we processing?

Only links with references to pages on the same domain as the specified URL are extracted.

The following links are ignored:

Any external site. E.g. www.twitter.com
Links which are mailto: or tel: or sms:
Bookmarks Example: <a href='#chapter1'>Chapter 1</a>
Any link which produces a content type other than text/html is ignored. Example: A link to pdf document
Any link which is under a sub-domain of the root domain.E.g. if the www.cnn.com was being crawled, then a link such as sports.cnn.com would be ignored

The following links are accepted:

<a href='/contactus.htm'>Contact us</a>
<a href='/Careers'>Careers</a>
if www.cnn.com was being crawled then <a href='https://www.cnn.com/Careers'>Careers</a> would be acceptable.

How to compile and run the code?

The current version has been build on Visual Studio 2019 and .NET Core 3.1. The crawler component itself is built on .NET Standard 2.0

Solution structure

Unit and integration tests

Running from command line

WebsiteCrawler.exe  --url https://www.cnn.com --maxsites 30

Sample outputs

Here are some examples of the output produced by the web site crawler

WWW.BBC.CO.UK

WebsiteCrawler.exe --url https://www.bbc.co.uk --maxsites 5

WWW.CNN.COM

WebsiteCrawler.exe --url https://www.cnn.com --maxsites 30

WWW.PREMIERLEAGUE.COM

WebsiteCrawler.exe --maxsites 10 --url https://www.premierleague.com

How are we parsing the HTML?

The component HtmlAgilityPack is being used for parsing the links out of a HTML fragment. Refer class HtmlAgilityParser.cs

Separation of concerns via interfaces and Dependency injecton

One of the key principles of SOLID mandates that we code around interfaces as opposed to concrete implementations. This significantly simplifies unit testing via mocking/faking

IHtmlParser

    public interface IHtmlParser
    {
        /// <summary>
        /// Returns all hyper link URLs from the given HTML document
        /// </summary>
        /// <param name="html">The HTML content to scan</param>
        /// <returns></returns>
        List<string> GetLinks(string htmlContent);
    }

I have provided a concrete implementation using the NUGET package HtmlAgilityPack.

ILogger

The command line executable uses Log4net for logging.

        private static ServiceProvider ConfigureServices()
        {
            var serviceCollection = new ServiceCollection();
            serviceCollection.AddTransient<IWebSiteCrawler, SingleThreadedWebSiteCrawler>();
            serviceCollection.AddTransient<IHtmlParser, HtmlAgilityParser>();
            serviceCollection.AddTransient<HttpClient>();
            serviceCollection.AddLogging(builder => builder.AddLog4Net("log4net.config"));

            serviceCollection.AddTransient<ICrawlerResultsFormatter, CsvResultsFormatter>();
            return serviceCollection.BuildServiceProvider();
        }

If the crawler component were used by an ASP.NET Worker code, then you could imagine logging to Azure Application Insights

services.AddLogging(builder =>
                {
                    builder.AddApplicationInsights("<YourInstrumentationKey>");
                });

How are we mocking HttpClient class?

The class HttpClient lets you intercept the protected method SendAsync. I found this article very helpful.

internal static class Mocks
{
    internal static Mock<HttpMessageHandler> CreateHttpMessageHandlerMock(
        string responseBody,
        HttpStatusCode responseStatus,
        string responseContentType)
    {
        var handlerMock = new Mock<HttpMessageHandler>();
        var response = new HttpResponseMessage
        {
            StatusCode = responseStatus,
            Content = new StringContent(responseBody),
        };

        response.Content.Headers.ContentType = new MediaTypeHeaderValue(responseContentType);

        handlerMock
            .Protected()
            .Setup<Task<HttpResponseMessage>>(
                "SendAsync",
                ItExpr.IsAny<HttpRequestMessage>(),
                ItExpr.IsAny<CancellationToken>())
            .ReturnsAsync(response);

        return handlerMock;
    }
}

How are command line arguments being processed?

The component CommandLineParser is being used. This component uses the following model class to interpret the command line arguments:

    public class CmdLineArgumentModel
    {
        [Option("maxsites", Required = true, HelpText = "An upper limit on the number of sites to search. Example: 30")]
        public int MaxSites { get; set; }

        [Option("url", Required = true, HelpText = "The URL of the site to search")]
        public string Url { get; set; }
    }

Refer the project site of this component for more information

How are we logging?

All C# code is written with the expection that a concrete implementation of ILogger<T> would be injected via Dependency Injection.

Logging in the executable

log4net is being used. In the current implementation, the logging output is displayed on the Console

Logging in the unit tests and integration tests

Logging to the output window of Visual Studio helps in examining the log outputs.

        ILogger<SingleThreadedWebSiteCrawler> logger=CreateOutputWindowLogger<SingleThreadedWebSiteCrawler>();

        /// <summary>
        /// Helps you view logging results in the Output Window of Visual Studio
        /// </summary>
        private ILogger<T> CreateOutputWindowLogger<T>()
        {
            var serviceProvider = new ServiceCollection().AddLogging(builder => builder.AddDebug()).BuildServiceProvider();
            return serviceProvider.GetService<ILogger<T>>();
        }

Scenarios that have not been tested

Intergration test on the executable. We should programmatically launch the executable with various combinations of command line arguments and test for expected output
Explore if we have covered all HTTP status code while doing a Polly retry.
Better approach to recording errors. Currently being logged. Should have an errors collection and this should be unit tested
Better approach to ignored links. Currently being logged. Can be incorporated as another collection in the results object.

Further improvements

Making it multi-threaded

The current implementation is single threaded. This is obviously slow. A more sophisticated implementation would be broadly as follows:

start --->  master thread manages a queue of work items ---> spawns child Tasks --> each Task will pull first available item from work queue , parse links and add to same queue

Do we really stand to gain by making the component multi-threaded?

Rewriting the component as a multithreaded is possible, but makes the coding and testing more complex. But, what do we gain? Consider the scenario where we have a multiple sites to crawl (www.site1.com, www.site2.com, www.site3.com) . Why not created independent threads , one each for www.site1.com, www.site2.com and www.site3.com? This might be a win-win solution.

Specify output format

We could think of providing the ability to produce JSON format Example:

WebsiteCrawler.exe  --url https://www.cnn.com --maxsites 30 --format csv
WebsiteCrawler.exe  --url https://www.cnn.com --maxsites 30 --format json

Specify the output file path

We want to save the generated results to a file Example:

WebsiteCrawler.exe  --url https://www.cnn.com --maxsites 30 --output c:\temp\results.csv

Capability to persiste current state and resume from last known state

Imagine, if we are crawling a deeply nested web site like www.cnn.com. This could run into thousands of pages. Beginning from the top level migh not be a very efficient approach. Searching in smaller batches (say 100 sites) might be more practical. To handle such a long running process, we could make the SingleThreadedWebSiteCrawler component stateful. Pass the last known state and resume from that point onwards. The Queue data structure should be returned along with the results and the entire results should be made serializable.

public interface IWebSiteCrawler
{
    Task<SearchResults> Run(SearchResults lastKnownResult,string url, int maxPagesToSearch);
}

Can we make a full blown web crawler?

We could turn this into an ASP.NET worker service
The service would be running as a Web job in Azure or in some container.
The worker service would be passively waiting for messages in a queue
Results would be written back to a database.
Such a worker service could be made very resilient by managing state in an external data store. Example - Consider incrementally crawling a large web site over several hours.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.MD

README.MD

Simple web site crawler

Overview

Implementation overview

Sample output

What kind of hyperlinks are we processing?

The following links are ignored:

The following links are accepted:

How to compile and run the code?

Solution structure

Unit and integration tests

Running from command line

Sample outputs

WWW.BBC.CO.UK

WWW.CNN.COM

WWW.PREMIERLEAGUE.COM

How are we parsing the HTML?

Separation of concerns via interfaces and Dependency injecton

IHtmlParser

ILogger

How are we mocking HttpClient class?

How are command line arguments being processed?

How are we logging?

Logging in the executable

Logging in the unit tests and integration tests

Scenarios that have not been tested

Further improvements

Making it multi-threaded

Do we really stand to gain by making the component multi-threaded?

Specify output format

Specify the output file path

Capability to persiste current state and resume from last known state

Can we make a full blown web crawler?

Files

README.MD

Latest commit

History

README.MD

File metadata and controls

Simple web site crawler

Overview

Implementation overview

Sample output

What kind of hyperlinks are we processing?

The following links are ignored:

The following links are accepted:

How to compile and run the code?

Solution structure

Unit and integration tests

Running from command line

Sample outputs

WWW.BBC.CO.UK

WWW.CNN.COM

WWW.PREMIERLEAGUE.COM

How are we parsing the HTML?

Separation of concerns via interfaces and Dependency injecton

IHtmlParser

ILogger

How are we mocking HttpClient class?

How are command line arguments being processed?

How are we logging?

Logging in the executable

Logging in the unit tests and integration tests

Scenarios that have not been tested

Further improvements

Making it multi-threaded

Do we really stand to gain by making the component multi-threaded?

Specify output format

Specify the output file path

Capability to persiste current state and resume from last known state

Can we make a full blown web crawler?