Speedreader refactor readability backend #8349

kkuehlz · 2021-03-24T16:55:32Z

Submitter Checklist:

I confirm that no security/privacy review is needed, or that I have requested one
There is a ticket for my issue
Used Github auto-closing keywords in the PR description above
Wrote a good PR/commit description
Added appropriate labels (QA/Yes or QA/No; release-notes/include or release-notes/exclude; OS/...) to the associated issue
Checked the PR locally: npm run test -- brave_browser_tests, npm run test -- brave_unit_tests, npm run lint, npm run gn_check, npm run tslint
Ran git rebase master (if needed)

Reviewer Checklist:

A security review is not needed, or a link to one is included in the PR description
New files have MPL-2.0 license header
Adequate test coverage exists to prevent regressions
Major classes, functions and non-trivial code blocks are well-commented
Changes in component dependencies are properly reflected in gn
Code follows the style guide
Test plan is specified in PR before merging

After-merge Checklist:

Test Plan:

The RcDom is deprecated and not recommended for production use. Servo provides no guarantees about the stabality and security of the Rcdom servo/html5ever#385

This only applies to the Rcdom

This breaks the DOM structure and even gets rid of important content spaces. We have a test case to cover a   being deleted. Found on rpp.pe Refactor the test helper functions out of the main module.

This is another step towards algorithmic parity with Readability.js. In the current implementation, we root the new DOM at top scoring candidate. With this patch, we keep 1 top candidate and 4 alternative candidates. If 3/4 of those candidates have a score near the top candidate, we start iterating backwards starting from the top candidate. If there is a shared parent with any 3 of the alternative candidates, we make that the new root. This is important, because some articles on Vanity Fair score very high with a deeply nested top candidate, and real content is deleted from the new DOM.

Fix logic errors in the cleaning algorithm.

We are maintaining a very minimal fork of kuchiki-rs at https://github.com/keur/kuchiki/tree/speedreader. We expose two new fields to API users: `score` and `is_candidate`. This removes the need for both the nodes binary tree and the candidates binary tree. Kuchiki has nice helper methods for accessing parents, descendants, and ancestors, further removing the need for the Paths. The trees indexed by `Path` structs were mostly hacking around the lack of API extensibility.

After the new body node is chosen as the readable root, iterate over its sibling nodes, and append "related" siblings. This could be article preambles, content split out by ads that were removed during the preprocess step, etc.

It is common for extra data to be in the title extracted title. It's usually metadata after colons, separators (e.g '|', '-', '»', etc). Try and strip those out, be be mindful of the resulting word length.

Some sites, like Slate and the BBC, have <div> elements enclosing single <p> nodes. This weakens the paragraph's score from propagating higher in the DOM during the scoring step. In the pre-process step, make this replacement. TODO: We may also want to consider doing the same thing for divs with no block-level elements. I didn't implement it here because the function _hasChildBlockElement() in Readability.js checks for non-block elements, so the function in confusing and may be buggy? Not sure.

gpestana

cargo test does not work because the integration tests (lib/tests/*) are still using the old implementation.

I'm still going to do another pass on the PR, but added a few comments that can be addressed for now.

gpestana · 2021-03-26T13:48:14Z

components/speedreader/rust/lib/src/readability/src/dom.rs

            }
        }
    }
 }

+#[inline]


Probably we could run benchmarks with and without the inlining? I'm not sure how it impacts performance in this context.

Doubt it makes a difference at all. I just made the smaller, non-recursive functions that get called a lot inline.

I'm all for benchmarking, but let's also make sure the algorithm is fully complete. Once we get great results, then we can do an optimization pass.

components/speedreader/rust/lib/src/readability/src/dom.rs

components/speedreader/rust/lib/src/readability/src/extractor.rs

kkuehlz · 2021-03-26T21:06:32Z

@gpestana All suggestions made, and fixed up the tests using the markup5ever dom

kkuehlz · 2021-03-26T21:10:33Z

Skipping CI because the rust tests haven't been hooked up to GN.

components/speedreader/rust/lib/examples/feature_extraction_streamer.rs

pes10k

I'm marking approve, but please take with a grain of salt, since my rust is very very baby level

components/speedreader/rust/lib/Cargo.toml

kkuehlz · 2021-03-26T21:19:04Z

components/speedreader/rust/lib/src/readability/tests/lib.rs

@@ -64,7 +64,7 @@ mod test {
 //test!(engadget);
 //test!(folha);
 //test!(gmw);
-test!(guardian_1);


this test was failing before i made any changes to speedreader. further, these tests aren't very good. they just have input and expected htmls, which will break at any change in the algorithm. i moved unit tests into the extractor.rs file, and those test specific features of the algorithm that shouldn't regress.

The component tests are still failing on my side, e.g.

---- fulltest::folha stdout ---- thread 'fulltest::folha' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', tests/lib.rs:256:5

As far as I can tell, the static SAMPLES_PATH: &str = "data/tests-samples/"; does not exist

It seems to me that having these tests would be good. It is true that changing the algorithm will most likely make the tests failing, but that would enforce the developer to make sure that the end result when parsing DOMs from well known pages is as expected after the changes.

I feel that this e2e test approach fits this library nicely. Would you still prefer to keep only unit tests?

gpestana

Some nits here and there and a couple of questions, but in general the code looks good. All things considered, I believe we should keep the E2E tests (even if we cut some of the test cases).

Also it would be great to run cargo fmt

gpestana · 2021-03-30T11:27:27Z

components/speedreader/rust/lib/src/readability/tests/lib.rs

@@ -64,7 +64,7 @@ mod test {
 //test!(engadget);
 //test!(folha);
 //test!(gmw);
-test!(guardian_1);


The component tests are still failing on my side, e.g.

---- fulltest::folha stdout ---- thread 'fulltest::folha' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', tests/lib.rs:256:5

As far as I can tell, the static SAMPLES_PATH: &str = "data/tests-samples/"; does not exist

gpestana · 2021-03-30T11:29:20Z

components/speedreader/rust/lib/src/readability/tests/lib.rs

@@ -64,7 +64,7 @@ mod test {
 //test!(engadget);
 //test!(folha);
 //test!(gmw);
-test!(guardian_1);


It seems to me that having these tests would be good. It is true that changing the algorithm will most likely make the tests failing, but that would enforce the developer to make sure that the end result when parsing DOMs from well known pages is as expected after the changes.

I feel that this e2e test approach fits this library nicely. Would you still prefer to keep only unit tests?

gpestana · 2021-03-30T11:30:24Z

components/speedreader/rust/lib/src/readability/tests/lib.rs

@@ -64,7 +64,7 @@ mod test {
 //test!(engadget);
 //test!(folha);
 //test!(gmw);
-test!(guardian_1);
+//test!(guardian_1);
 //test!(heise);
 //test!(herald_sun_1);


I'd either make sure the tests are passing or remove them altogether. Unless there's a special reason to keep them commented out e.g. for checking later or for the development process, perhaps? -- if that is the case, I'd add some comments explaining why the tests are commented out other than "these tests don't pass".

But in general, I think that removing the commented tests is the best approach.

components/speedreader/rust/lib/src/readability/src/dom.rs

components/speedreader/rust/lib/src/readability/src/util.rs

* Run `cargo fmt` in the speedreader root * Add a section to the README about fetching the test data, since it is not retrieved automatically.

kkuehlz · 2021-03-30T21:24:58Z

@gpestana made all your changes, added comments to every function in scorer.rs and dom.rs. If you grab the test data (discussed in DM and added instructions to README), there are still many integration tests that will run. I agree we should have E2E tests in the readability subtree too, but will fix that in future PR since this one is getting huge. The only test that I commented out has been failing since before I made any changes, so we'll refactor everything in there in another pass.

* Transaction Activity tab v2 UI * Resolve `TransactionParserTests`, update `TransactionActivityStoreTests` for date grouping

Kevin Kuehler added 13 commits March 21, 2021 15:44

speedreader: Have classifier use kuchiki DOM

8c11eed

The RcDom is deprecated and not recommended for production use. Servo provides no guarantees about the stabality and security of the Rcdom servo/html5ever#385

speedreader/readability: Use rust 2018 edition

9464d72

Compiling and tests passing

b448ad7

delete markup5ever dependency

b5b6d56

Remove outdated comment

129310a

This only applies to the Rcdom

speedreader: Don't indiscriminately clean text

d9d0d01

This breaks the DOM structure and even gets rid of important content spaces. We have a test case to cover a   being deleted. Found on rpp.pe Refactor the test helper functions out of the main module.

speedreader: Use text density when cleaning

f306650

Fix logic errors in the cleaning algorithm.

speedreader: Append related siblings to new body

7dc2023

After the new body node is chosen as the readable root, iterate over its sibling nodes, and append "related" siblings. This could be article preambles, content split out by ads that were removed during the preprocess step, etc.

speedreader: Improve title extraction results

2f31744

It is common for extra data to be in the title extracted title. It's usually metadata after colons, separators (e.g '|', '-', '»', etc). Try and strip those out, be be mindful of the resulting word length.

Use clone_element_steal_children() for replacement

36b9f24

kkuehlz requested a review from pes10k March 24, 2021 16:55

github-actions bot assigned kkuehlz Mar 24, 2021

kkuehlz mentioned this pull request Mar 24, 2021

More speedreader improvements #8215

Closed

24 tasks

kkuehlz requested a review from gpestana March 24, 2021 17:26

This was referenced Mar 25, 2021

speedreader: Add --speedreader-backend switch #8363

Merged

Allow user to specify speedreader backend brave/brave-browser#14959

Closed

gpestana suggested changes Mar 26, 2021

View reviewed changes

speedreader: Make legacy tests work with new DOM

956a6e3

kkuehlz force-pushed the speedreader_refactor_readability_backend branch from 3bcb387 to 5e7f65d Compare March 26, 2021 21:05

kkuehlz requested a review from gpestana March 26, 2021 21:06

kkuehlz added the CI/skip Do not run CI builds (except noplatform) label Mar 26, 2021

pes10k reviewed Mar 26, 2021

View reviewed changes

components/speedreader/rust/lib/examples/feature_extraction_streamer.rs Show resolved Hide resolved

pes10k approved these changes Mar 26, 2021

View reviewed changes

pes10k reviewed Mar 26, 2021

View reviewed changes

components/speedreader/rust/lib/Cargo.toml Outdated Show resolved Hide resolved

kkuehlz commented Mar 26, 2021

View reviewed changes

gpestana approved these changes Mar 30, 2021

View reviewed changes

Kevin Kuehler added 3 commits March 30, 2021 14:12

speedreader: Make PR review changes

d5c886d

speedreader: Move kuchiki fork to Brave GH Org

18eaedb

speedreader: Update README and cargo fmt

63ec571

* Run `cargo fmt` in the speedreader root * Add a section to the README about fetching the test data, since it is not retrieved automatically.

kkuehlz force-pushed the speedreader_refactor_readability_backend branch from 5a65353 to 63ec571 Compare March 30, 2021 21:21

kkuehlz merged commit c358844 into master Mar 30, 2021

kkuehlz deleted the speedreader_refactor_readability_backend branch March 30, 2021 21:49

kkuehlz added this to the 1.24.x - Nightly milestone Mar 30, 2021

kkuehlz mentioned this pull request Mar 30, 2021

Bump build/rust/Cargo.lock brave/brave-browser#15033

Closed

kylehickinson pushed a commit that referenced this pull request Jan 4, 2024

Fix #8349: Transaction Activity v2 UI (#8390)

a030781

* Transaction Activity tab v2 UI * Resolve `TransactionParserTests`, update `TransactionActivityStoreTests` for date grouping

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedreader refactor readability backend #8349

Speedreader refactor readability backend #8349

kkuehlz commented Mar 24, 2021

gpestana left a comment

gpestana Mar 26, 2021

kkuehlz Mar 26, 2021 •

edited

Loading

kkuehlz Mar 26, 2021 •

edited

Loading

kkuehlz commented Mar 26, 2021 •

edited

Loading

kkuehlz commented Mar 26, 2021

pes10k left a comment

kkuehlz Mar 26, 2021 •

edited

Loading

gpestana Mar 30, 2021

gpestana Mar 30, 2021

gpestana left a comment

gpestana Mar 30, 2021

gpestana Mar 30, 2021

gpestana Mar 30, 2021

kkuehlz commented Mar 30, 2021 •

edited

Loading

Speedreader refactor readability backend #8349

Speedreader refactor readability backend #8349

Conversation

kkuehlz commented Mar 24, 2021

Submitter Checklist:

Reviewer Checklist:

After-merge Checklist:

Test Plan:

gpestana left a comment

Choose a reason for hiding this comment

gpestana Mar 26, 2021

Choose a reason for hiding this comment

kkuehlz Mar 26, 2021 • edited Loading

Choose a reason for hiding this comment

kkuehlz Mar 26, 2021 • edited Loading

Choose a reason for hiding this comment

kkuehlz commented Mar 26, 2021 • edited Loading

kkuehlz commented Mar 26, 2021

pes10k left a comment

Choose a reason for hiding this comment

kkuehlz Mar 26, 2021 • edited Loading

Choose a reason for hiding this comment

gpestana Mar 30, 2021

Choose a reason for hiding this comment

gpestana Mar 30, 2021

Choose a reason for hiding this comment

gpestana left a comment

Choose a reason for hiding this comment

gpestana Mar 30, 2021

Choose a reason for hiding this comment

gpestana Mar 30, 2021

Choose a reason for hiding this comment

gpestana Mar 30, 2021

Choose a reason for hiding this comment

kkuehlz commented Mar 30, 2021 • edited Loading

kkuehlz Mar 26, 2021 •

edited

Loading

kkuehlz Mar 26, 2021 •

edited

Loading

kkuehlz commented Mar 26, 2021 •

edited

Loading

kkuehlz Mar 26, 2021 •

edited

Loading

kkuehlz commented Mar 30, 2021 •

edited

Loading