This is an elementary Web site crawler written using C# on .NET Core . What do we mean by crawling a web site ? No! We are not indexing the content of the pages.
This is a simple component which will crawl through a web site (example: www.cnn.com), find sub-links and in turn crawl those pages . Only links which fall the under the domain of the parent site are accepted.
The accompanying code produces a command line application. The results are written directly to the console.
Here is the output when https://www.cnn.com/ was crawled
https://www.cnn.com/videos/world/2021/11/16/uganda-explosions-madowo-live-nr-ovn-intl-ldn-vpx.cnn
https://www.cnn.com/videos/world/2021/11/17/lagos-nigeria-erosion-climate-change-ctw-busari-pkg-intl-vpx.cnn
https://www.cnn.com/videos/world/2021/11/19/blinken-says-ethiopia-on-path-to-destruction-sot-intl-ctw-vpx.cnn
https://www.cnn.com/videos/world/2021/11/25/davido-birthday-donations-orphanage-oneworld-intl-intv-vpx.cnn
https://www.cnn.com/videos/world/2021/11/29/salim-abdool-karim-south-africa-omicron-variant-covid-19-travel-ban-sot-newday-vpx.cnn
https://www.cnn.com/videos/world/2021/12/02/botswana-president-omicron-coronavirus-travel-bans-oneworld-intl-vpx.cnn
https://www.cnn.com/vr
https://www.cnn.com/warnermediaprivacy.com/opt-out
https://www.cnn.com/weather
https://www.cnn.com/world
https://www.cnn.com/www.cnn.com/interactive/travel/best-beaches
https://www.cnn.com/www.cnnpartners.com
https://www.cnn.com/youtube.com/user/CNN
Found 663 sites in the Url:'https://www.cnn.com', after searching a maximum of 30 sites
Only links with references to pages on the same domain as the specified URL are extracted.
- Any external site. E.g. www.twitter.com
- Links which are
mailto:
ortel:
orsms:
- Bookmarks Example:
<a href='#chapter1'>Chapter 1</a>
- Any link which produces a content type other than text/html is ignored. Example: A link to pdf document
- Any link which is under a sub-domain of the root domain.E.g. if the www.cnn.com was being crawled, then a link such as sports.cnn.com would be ignored
<a href='/contactus.htm'>Contact us</a>
<a href='/Careers'>Careers</a>
- if www.cnn.com was being crawled then
<a href='https://www.cnn.com/Careers'>Careers</a>
would be acceptable.
The current version has been build on Visual Studio 2019 and .NET Core 3.1. The crawler component itself is built on .NET Standard 2.0
WebsiteCrawler.exe --url https://www.cnn.com --maxsites 30
Here are some examples of the output produced by the web site crawler
WebsiteCrawler.exe --url https://www.bbc.co.uk --maxsites 5
WebsiteCrawler.exe --url https://www.cnn.com --maxsites 30
WebsiteCrawler.exe --maxsites 10 --url https://www.premierleague.com
The component HtmlAgilityPack is being used for parsing the links out of a HTML fragment. Refer class HtmlAgilityParser.cs
One of the key principles of SOLID mandates that we code around interfaces as opposed to concrete implementations. This significantly simplifies unit testing via mocking/faking
public interface IHtmlParser
{
/// <summary>
/// Returns all hyper link URLs from the given HTML document
/// </summary>
/// <param name="html">The HTML content to scan</param>
/// <returns></returns>
List<string> GetLinks(string htmlContent);
}
I have provided a concrete implementation using the NUGET package HtmlAgilityPack.
The command line executable uses Log4net for logging.
private static ServiceProvider ConfigureServices()
{
var serviceCollection = new ServiceCollection();
serviceCollection.AddTransient<IWebSiteCrawler, SingleThreadedWebSiteCrawler>();
serviceCollection.AddTransient<IHtmlParser, HtmlAgilityParser>();
serviceCollection.AddTransient<HttpClient>();
serviceCollection.AddLogging(builder => builder.AddLog4Net("log4net.config"));
serviceCollection.AddTransient<ICrawlerResultsFormatter, CsvResultsFormatter>();
return serviceCollection.BuildServiceProvider();
}
If the crawler component were used by an ASP.NET Worker code, then you could imagine logging to Azure Application Insights
services.AddLogging(builder =>
{
builder.AddApplicationInsights("<YourInstrumentationKey>");
});
The class HttpClient lets you intercept the protected method SendAsync. I found this article very helpful.
internal static class Mocks
{
internal static Mock<HttpMessageHandler> CreateHttpMessageHandlerMock(
string responseBody,
HttpStatusCode responseStatus,
string responseContentType)
{
var handlerMock = new Mock<HttpMessageHandler>();
var response = new HttpResponseMessage
{
StatusCode = responseStatus,
Content = new StringContent(responseBody),
};
response.Content.Headers.ContentType = new MediaTypeHeaderValue(responseContentType);
handlerMock
.Protected()
.Setup<Task<HttpResponseMessage>>(
"SendAsync",
ItExpr.IsAny<HttpRequestMessage>(),
ItExpr.IsAny<CancellationToken>())
.ReturnsAsync(response);
return handlerMock;
}
}
The component CommandLineParser is being used. This component uses the following model class to interpret the command line arguments:
public class CmdLineArgumentModel
{
[Option("maxsites", Required = true, HelpText = "An upper limit on the number of sites to search. Example: 30")]
public int MaxSites { get; set; }
[Option("url", Required = true, HelpText = "The URL of the site to search")]
public string Url { get; set; }
}
Refer the project site of this component for more information
All C# code is written with the expection that a concrete implementation of ILogger<T>
would be injected via Dependency Injection.
log4net is being used. In the current implementation, the logging output is displayed on the Console
Logging to the output window of Visual Studio helps in examining the log outputs.
ILogger<SingleThreadedWebSiteCrawler> logger=CreateOutputWindowLogger<SingleThreadedWebSiteCrawler>();
/// <summary>
/// Helps you view logging results in the Output Window of Visual Studio
/// </summary>
private ILogger<T> CreateOutputWindowLogger<T>()
{
var serviceProvider = new ServiceCollection().AddLogging(builder => builder.AddDebug()).BuildServiceProvider();
return serviceProvider.GetService<ILogger<T>>();
}
- Intergration test on the executable. We should programmatically launch the executable with various combinations of command line arguments and test for expected output
- Explore if we have covered all HTTP status code while doing a Polly retry.
- Better approach to recording errors. Currently being logged. Should have an errors collection and this should be unit tested
- Better approach to ignored links. Currently being logged. Can be incorporated as another collection in the results object.
The current implementation is single threaded. This is obviously slow. A more sophisticated implementation would be broadly as follows:
start ---> master thread manages a queue of work items ---> spawns child Tasks --> each Task will pull first available item from work queue , parse links and add to same queue
Rewriting the component as a multithreaded is possible, but makes the coding and testing more complex. But, what do we gain? Consider the scenario where we have a multiple sites to crawl (www.site1.com, www.site2.com, www.site3.com) . Why not created independent threads , one each for www.site1.com, www.site2.com and www.site3.com? This might be a win-win solution.
We could think of providing the ability to produce JSON format Example:
WebsiteCrawler.exe --url https://www.cnn.com --maxsites 30 --format csv
WebsiteCrawler.exe --url https://www.cnn.com --maxsites 30 --format json
We want to save the generated results to a file Example:
WebsiteCrawler.exe --url https://www.cnn.com --maxsites 30 --output c:\temp\results.csv
Imagine, if we are crawling a deeply nested web site like www.cnn.com. This could run into thousands of pages. Beginning from the top level migh not be a very efficient approach. Searching in smaller batches (say 100 sites) might be more practical. To handle such a long running process, we could make the SingleThreadedWebSiteCrawler component stateful. Pass the last known state and resume from that point onwards. The Queue data structure should be returned along with the results and the entire results should be made serializable.
public interface IWebSiteCrawler
{
Task<SearchResults> Run(SearchResults lastKnownResult,string url, int maxPagesToSearch);
}
- We could turn this into an ASP.NET worker service
- The service would be running as a Web job in Azure or in some container.
- The worker service would be passively waiting for messages in a queue
- Results would be written back to a database.
- Such a worker service could be made very resilient by managing state in an external data store. Example - Consider incrementally crawling a large web site over several hours.