Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedreader refactor readability backend #8349

Merged
merged 17 commits into from
Mar 30, 2021

Conversation

kkuehlz
Copy link
Contributor

@kkuehlz kkuehlz commented Mar 24, 2021

Resolves brave/brave-browser#14787

Submitter Checklist:

  • I confirm that no security/privacy review is needed, or that I have requested one
  • There is a ticket for my issue
  • Used Github auto-closing keywords in the PR description above
  • Wrote a good PR/commit description
  • Added appropriate labels (QA/Yes or QA/No; release-notes/include or release-notes/exclude; OS/...) to the associated issue
  • Checked the PR locally: npm run test -- brave_browser_tests, npm run test -- brave_unit_tests, npm run lint, npm run gn_check, npm run tslint
  • Ran git rebase master (if needed)

Reviewer Checklist:

  • A security review is not needed, or a link to one is included in the PR description
  • New files have MPL-2.0 license header
  • Adequate test coverage exists to prevent regressions
  • Major classes, functions and non-trivial code blocks are well-commented
  • Changes in component dependencies are properly reflected in gn
  • Code follows the style guide
  • Test plan is specified in PR before merging

After-merge Checklist:

Test Plan:

Kevin Kuehler added 13 commits March 21, 2021 15:44
The RcDom is deprecated and not recommended for production use.

Servo provides no guarantees about the stabality and security of the
Rcdom servo/html5ever#385
This only applies to the Rcdom
This breaks the DOM structure and even gets rid of important content
spaces. We have a test case to cover a   being deleted. Found on
rpp.pe

Refactor the test helper functions out of the main module.
This is another step towards algorithmic parity with Readability.js.

In the current implementation, we root the new DOM at top scoring
candidate. With this patch, we keep 1 top candidate and 4 alternative
candidates. If 3/4 of those candidates have a score near the top
candidate, we start iterating backwards starting from the top candidate.
If there is a shared parent with any 3 of the alternative candidates, we
make that the new root.

This is important, because some articles on Vanity Fair score very high
with a deeply nested top candidate, and real content is deleted from the
new DOM.
Fix logic errors in the cleaning algorithm.
We are maintaining a very minimal fork of kuchiki-rs at
https://github.com/keur/kuchiki/tree/speedreader. We expose two new
fields to API users: `score` and `is_candidate`. This removes the need
for both the nodes binary tree and the candidates binary tree. Kuchiki
has nice helper methods for accessing parents, descendants, and
ancestors, further removing the need for the Paths. The trees indexed by
`Path` structs were mostly hacking around the lack of API extensibility.
After the new body node is chosen as the readable root, iterate over
its sibling nodes, and append "related" siblings. This could be article
preambles, content split out by ads that were removed during the
preprocess step, etc.
It is common for extra data to be in the title extracted title. It's
usually metadata after colons, separators (e.g '|', '-', '»', etc). Try
and strip those out, be be mindful of the resulting word length.
Some sites, like Slate and the BBC, have <div> elements enclosing single
<p> nodes. This weakens the paragraph's score from propagating higher in
the DOM during the scoring step. In the pre-process step, make this
replacement.

TODO: We may also want to consider doing the same thing for divs with no
block-level elements. I didn't implement it here because the function
_hasChildBlockElement() in Readability.js checks for non-block elements,
so the function in confusing and may be buggy? Not sure.
Copy link
Contributor

@gpestana gpestana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cargo test does not work because the integration tests (lib/tests/*) are still using the old implementation.

I'm still going to do another pass on the PR, but added a few comments that can be addressed for now.

}
}
}
}

#[inline]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we could run benchmarks with and without the inlining? I'm not sure how it impacts performance in this context.

Copy link
Contributor Author

@kkuehlz kkuehlz Mar 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubt it makes a difference at all. I just made the smaller, non-recursive functions that get called a lot inline.

Copy link
Contributor Author

@kkuehlz kkuehlz Mar 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for benchmarking, but let's also make sure the algorithm is fully complete. Once we get great results, then we can do an optimization pass.

components/speedreader/rust/lib/src/readability/src/dom.rs Outdated Show resolved Hide resolved
components/speedreader/rust/lib/src/readability/src/dom.rs Outdated Show resolved Hide resolved
@kkuehlz kkuehlz force-pushed the speedreader_refactor_readability_backend branch from 3bcb387 to 5e7f65d Compare March 26, 2021 21:05
@kkuehlz
Copy link
Contributor Author

kkuehlz commented Mar 26, 2021

@gpestana All suggestions made, and fixed up the tests using the markup5ever dom

@kkuehlz kkuehlz requested a review from gpestana March 26, 2021 21:06
@kkuehlz kkuehlz added the CI/skip Do not run CI builds (except noplatform) label Mar 26, 2021
@kkuehlz
Copy link
Contributor Author

kkuehlz commented Mar 26, 2021

Skipping CI because the rust tests haven't been hooked up to GN.

Copy link
Contributor

@pes10k pes10k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm marking approve, but please take with a grain of salt, since my rust is very very baby level

@@ -64,7 +64,7 @@ mod test {
//test!(engadget);
//test!(folha);
//test!(gmw);
test!(guardian_1);
Copy link
Contributor Author

@kkuehlz kkuehlz Mar 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test was failing before i made any changes to speedreader. further, these tests aren't very good. they just have input and expected htmls, which will break at any change in the algorithm. i moved unit tests into the extractor.rs file, and those test specific features of the algorithm that shouldn't regress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The component tests are still failing on my side, e.g.

---- fulltest::folha stdout ----
thread 'fulltest::folha' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', tests/lib.rs:256:5  

As far as I can tell, the static SAMPLES_PATH: &str = "data/tests-samples/"; does not exist

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that having these tests would be good. It is true that changing the algorithm will most likely make the tests failing, but that would enforce the developer to make sure that the end result when parsing DOMs from well known pages is as expected after the changes.

I feel that this e2e test approach fits this library nicely. Would you still prefer to keep only unit tests?

Copy link
Contributor

@gpestana gpestana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits here and there and a couple of questions, but in general the code looks good. All things considered, I believe we should keep the E2E tests (even if we cut some of the test cases).

Also it would be great to run cargo fmt

@@ -64,7 +64,7 @@ mod test {
//test!(engadget);
//test!(folha);
//test!(gmw);
test!(guardian_1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The component tests are still failing on my side, e.g.

---- fulltest::folha stdout ----
thread 'fulltest::folha' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', tests/lib.rs:256:5  

As far as I can tell, the static SAMPLES_PATH: &str = "data/tests-samples/"; does not exist

@@ -64,7 +64,7 @@ mod test {
//test!(engadget);
//test!(folha);
//test!(gmw);
test!(guardian_1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that having these tests would be good. It is true that changing the algorithm will most likely make the tests failing, but that would enforce the developer to make sure that the end result when parsing DOMs from well known pages is as expected after the changes.

I feel that this e2e test approach fits this library nicely. Would you still prefer to keep only unit tests?

@@ -64,7 +64,7 @@ mod test {
//test!(engadget);
//test!(folha);
//test!(gmw);
test!(guardian_1);
//test!(guardian_1);
//test!(heise);
//test!(herald_sun_1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd either make sure the tests are passing or remove them altogether. Unless there's a special reason to keep them commented out e.g. for checking later or for the development process, perhaps? -- if that is the case, I'd add some comments explaining why the tests are commented out other than "these tests don't pass".

But in general, I think that removing the commented tests is the best approach.

components/speedreader/rust/lib/src/readability/src/dom.rs Outdated Show resolved Hide resolved
Kevin Kuehler added 3 commits March 30, 2021 14:12
* Run `cargo fmt` in the speedreader root
* Add a section to the README about fetching the test data, since it is
  not retrieved automatically.
@kkuehlz kkuehlz force-pushed the speedreader_refactor_readability_backend branch from 5a65353 to 63ec571 Compare March 30, 2021 21:21
@kkuehlz
Copy link
Contributor Author

kkuehlz commented Mar 30, 2021

@gpestana made all your changes, added comments to every function in scorer.rs and dom.rs. If you grab the test data (discussed in DM and added instructions to README), there are still many integration tests that will run. I agree we should have E2E tests in the readability subtree too, but will fix that in future PR since this one is getting huge. The only test that I commented out has been failing since before I made any changes, so we'll refactor everything in there in another pass.

@kkuehlz kkuehlz merged commit c358844 into master Mar 30, 2021
@kkuehlz kkuehlz deleted the speedreader_refactor_readability_backend branch March 30, 2021 21:49
@kkuehlz kkuehlz added this to the 1.24.x - Nightly milestone Mar 30, 2021
kylehickinson pushed a commit that referenced this pull request Jan 4, 2024
* Transaction Activity tab v2 UI

* Resolve `TransactionParserTests`, update `TransactionActivityStoreTests` for date grouping
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/skip Do not run CI builds (except noplatform)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor speedreader readability backend
3 participants