Integrate Mozilla's Readibility.js #42

sebastian-nagel · 2021-04-14T14:24:07Z

see https://github.com/mozilla/readability
enabled by command-line flag --readerView
strip the boilerplate from text and HTML (done of a clone of the DOM)
extract article metadata (author, etc. - if available)
add readable 'article' object to records in pages.jsonl

The readability library is useful to get a clean text for news articles or blog posts, see the example below. Of course, the results are not always perfect.

{
  "id": "...",
  "url": "https://www.theguardian.com/government-computing-network/2011/jul/25/national-archives-web-archiving-project",
  "title": "National Archives pilots council web archiving project | Guardian Government Computing | The Guardian",
  "text": "Advertisement\nNews\nOpinion\nSport\nCulture\nLifestyle\nShow\nMore\nShow More\nNews\nCoronavirus\nWorld news\nUK news\nEnvironment\nScience\nGlobal development\nFootball\nTech\nBusiness\nObituaries\nOpinion\nThe Guardian view\nColumnists\nCartoons\nOpinion videos\nLetters\nSport\nFootball\nCricket\nRugby union\nTennis\nCycling\nF1\nGolf\nUS sports\nCulture\nBooks\nMusic\nTV & radio\nArt & design\nFilm\nGames\nClassical\nStage\nLifestyle\nFashion\nFood\nRecipes\nLove & sex\nHealth & fitness\nHome & garden\nWomen\nMen\nFamily\nTravel\nMoney\nMake a contribution\nSubscribe\nSearch jobs\nHolidays\nDigital Archive\nGuardian Puzzles app\nThe Guardian app\nVideo\nPodcasts\nPictures\nNewsletters\nToday's paper\nInside the Guardian\nThe Observer\nGuardian Weekly\nCrosswords\nSearch jobs\nHolidays\nDigital Archive\nGuardian Puzzles app\nGuardian Government Computing\nThis article is more than\n9 years old\nNational Archives pilots council web archiving project\nThis article is more than 9 years old\nNew project will allow councils to preserve online information\nG\nu\na\nr\nd\ni\na\nn\nG\no\nv\ne\nr\nn\nm\ne\nn\nt\nC\no\nm\np\nu\nt\ni\nn\ng\nMon 25 Jul 2011 12.04 BST\n1\n1\nA web archiving model that allows local authorities to preserve important online information is to be piloted by the National Archives.\nIt will run the pilots ...",
  "article": {
    "title": "National Archives pilots council web archiving project",
    "byline": null,
    "dir": null,
    "content": "<div id=\"readability-page-1\" class=\"page\"><div><p>A web archiving model that allows local authorities to preserve important online information is to be piloted by the National Archives.</p><p>It will run the pilots ...",
    "textContent": "A web archiving model that allows local authorities to preserve important online information is to be piloted by the National Archives.It will run the pilots ...",
    "length": 3414,
    "excerpt": "New project will allow councils to preserve online information",
    "siteName": "the Guardian"
  }
}

emmadickson · 2021-04-23T15:14:22Z

crawler.js

@@ -654,14 +683,20 @@ class Crawler {
      if (!fs.existsSync(this.pagesDir)) {
        fs.mkdirSync(this.pagesDir);
        const header = {"format": "json-pages-1.0", "id": "pages", "title": "All Pages"};
+        header["hasText"] = this.params.text;
+        header["hasReaderView"] = this.params.readerView;


Instead of adding "hasReaderView" how about using "textSource". That way we can reference using readability or the browser dom or 'boilerpipe'. the boilerpipe library lives in py-wacz and we have the dom extraction method used in browsertrix-crawler and in archiveweb.page. This way we can be specific about what method was used for text extraction.

So the current default would be browser-dom and when reader view is set it would instead be 'readability'

Sure, I can change this. Just to confirm, you mean?

header["textSource"] = (this.params.readerView ? "readability" : "browser-dom");

Also: by now, the article object is added to a page in case there is the reader view is available, see isProbablyReaderable. If the intention is to replace title and text by the readerable values, shouldn't the property set per page?

emmadickson · 2021-04-23T15:14:59Z

Would you be willing to add tests and a line in the readme where we detail all available flags?

sebastian-nagel · 2021-04-27T14:48:35Z

Sure, I'll add a test. Is it possible to provide a static page source? Could take https://www.iana.org/domains/reserved (Firefox applies the reader view), but with a remote page source any subtle change in the page content or layout may break the test.

emmadickson · 2021-06-21T14:28:55Z

@sebastian-nagel sorry for the radio silence. We've been using www.example.com or example.org for tests.

- see https://github.com/mozilla/readability - if enabled (command-line flag --readerView): - remove boilerplate from text and HTML - (if available) extract article metadat (author, etc.) - add readable 'article' object to page records in pages.jsonl

- indicate in header "textSource" from where the text extract could be taken (via readability or via DOM dump)

- add unit test reading https://www.iana.org/about

sebastian-nagel · 2021-06-22T15:58:11Z

@emmadickson: no problem. sorry as well, some time has passed. I've rebased the branch to the current main branch to start work.

We've been using www.example.com or example.org for tests.

Sure. The point is: the page is so simple that there is no "readable" version of it. Only a short and clean text, no boilerplate to strip. So I've added a unit test reading https://www.iana.org/about - no cons if the tests shall be limited to example.com.

stavares843

It's a good practice to validate user input and handle any potential errors. In this case, there is no validation of the --readerView flag, and if the user enters an invalid value, the code will still run without showing any warnings or errors. It may be worth adding a check for invalid flag values and show an error message to the user.
The Readability library is loaded from a local file path, which could cause issues if the file is missing or corrupted. It may be more reliable to include Readability as a package dependency and load it through require() instead of reading from a file.
There are some console log statements in the code, which are useful for debugging during development, but they should be removed before the code is deployed to production.
There is a typo in the commit message: "metadata" is misspelled as "metadat".
Other than these minor issues, the code changes look good.

emmadickson reviewed Apr 23, 2021

View reviewed changes

sebastian-nagel force-pushed the mozilla-readability branch from 893f2b5 to 2acffd6 Compare June 22, 2021 12:38

sebastian-nagel added 2 commits June 22, 2021 16:12

Integrate Mozilla's Readibility.js

b0050c2

- indicate in header "textSource" from where the text extract could be taken (via readability or via DOM dump)

Integrate Mozilla's Readibility.js

7ab6709

- add unit test reading https://www.iana.org/about

stavares843 reviewed May 11, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Mozilla's Readibility.js #42

Integrate Mozilla's Readibility.js #42

sebastian-nagel commented Apr 14, 2021

emmadickson Apr 23, 2021

sebastian-nagel Apr 27, 2021

emmadickson commented Apr 23, 2021

sebastian-nagel commented Apr 27, 2021

emmadickson commented Jun 21, 2021

sebastian-nagel commented Jun 22, 2021

stavares843 left a comment

Integrate Mozilla's Readibility.js #42

Are you sure you want to change the base?

Integrate Mozilla's Readibility.js #42

Conversation

sebastian-nagel commented Apr 14, 2021

emmadickson Apr 23, 2021

Choose a reason for hiding this comment

sebastian-nagel Apr 27, 2021

Choose a reason for hiding this comment

emmadickson commented Apr 23, 2021

sebastian-nagel commented Apr 27, 2021

emmadickson commented Jun 21, 2021

sebastian-nagel commented Jun 22, 2021

stavares843 left a comment

Choose a reason for hiding this comment