Skip to content

Configuration options explained

Pierre-Louis Gottfrois edited this page Aug 28, 2015 · 16 revisions

You'll find here detail explanations about each configuration options available on LinkThumbnailer.

redirect_limit

Maximum number of http redirection allowed. If LinkThumbnailer cannot resolve given URL before redirect_limit is reach, it will raise a LinkThumbnailer::RedirectLimit exception.

Default is 3

user_agent

You can set the http user agent used to resolve given URL.

Default is link_thumbnailer.

verify_ssl

You can activate/deactivate SSL verification for each LinkThumbnailer requests.

Default is true.

http_timeout

The amount of time in seconds to wait for a connection to be opened. If the HTTP object cannot open a connection in this many seconds, it raises a Net::OpenTimeout exception.

See here for more details.

Default is 5.

blacklist_urls

This is a list of backlisted URL pattern (using regex) to skip when LinkThumbnailer will fetch the website images. Use this option to filter advertising images.

Default are well known urls:

^http://ad\.doubleclick\.net/
^http://b\.scorecardresearch\.com/
^http://pixel\.quantserve\.com/
^http://s7\.addthis\.com/

attributes

This is a new option introduced in the v2 of LinkThumbnailer allowing you to explicitly tell what kind of HTML attributes you are expected to see.

LinkThumbnailer will do its best to find all given attributes in the provided website using the following scrapers (order matter):

  1. OpenGraph protocol scraper
  2. Homemade custom scraper

Currently there are only the following attributes available:

  • title
  • description
  • images
  • videos
  • favicon

See here for more informations about each attributes and what they do.

graders

This is a new option introduced with the v2 of LinkThumbnailer allowing you to customize how LinkThumbnailer selects the best description for a given website.

When fetching all possible description candidates for a given website, LinkThumbnailer computes the likely hood for each descriptions to be the best one using all graders against each descriptions.

See here for more informations about graders and how to build your own.

Default are:

  • Length grader will score description length
  • HtmlAttribute grader will score class's html node
  • HtmlAttribute grader will score id's html node
  • Position grader will score descriptions based on the order they appeared on the page. The first one are more likely to be reliable descriptions.
  • LinkDensity grader will score description link density

Every graders can specify a probability weight, 1 By default. For example, the position grader has a builtin weight of 3 since we consider the position of the text to be 3 times more important than the length of the text for example.

description_min_length

This is a new option introduced with the v2 of LinkThumbnailer allowing you to set description minimum length threshold to be taken as a candidate.

Default is 25 characters.

positive_regex

This is a new option introduced with the v2 of LinkThumbnailer allowing you to customize the word used to score class's html node and id's html node when using the HtmlAttribute grader. Those are positive keywords.

Default is /article|body|content|entry|hentry|main|page|pagination|post|text|blog|story/i.

negative_regex

This is a new option introduced with the v2 of LinkThumbnailer allowing you to customize the word used to score class's html node and id's html node when using the HtmlAttribute grader. Those are negative keywords.

Default is /combx|comment|com-|contact|foot|footer|footnote|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|tool|widget|modal/i.

image_limit

This is a new option introduced with the v2 of LinkThumbnailer allowing you to set maximum number of images to fetch for a given website. Since fetching image informations has a cost (performing a http request for each images) you should consider setting a limit here.

Please note that when setting an image_limit, the gem can't guarantee to return the "best" image describing the page. If you requested only 5 images and the 6th was the "best" image, it will not be returned. Only fetched images are compared to each other.

Default is 5 images.

image_stats

This is a new option introduced with the v2 of LinkThumbnailer allowing to disable the image size and type parsing. In order for LinkThumbnailer to retrieve image's size and type, it performs a HTTP request using the image's url. This can have performance impact when parsing many images.

Set the value to false to improve performance by deactivating image stats retrieval.

raise_on_invalid_format

Whether you want LinkThumbnailer to raise an exception or not when the Content-Type of the HTTP request is not supported by the gem. Since LinkThumbnailer was built to work on HTML pages, passing an URL pointing to a PDF file for example, might return unexpected results.