-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedreader refactor readability backend #8349
Conversation
The RcDom is deprecated and not recommended for production use. Servo provides no guarantees about the stabality and security of the Rcdom servo/html5ever#385
This only applies to the Rcdom
This breaks the DOM structure and even gets rid of important content spaces. We have a test case to cover a being deleted. Found on rpp.pe Refactor the test helper functions out of the main module.
This is another step towards algorithmic parity with Readability.js. In the current implementation, we root the new DOM at top scoring candidate. With this patch, we keep 1 top candidate and 4 alternative candidates. If 3/4 of those candidates have a score near the top candidate, we start iterating backwards starting from the top candidate. If there is a shared parent with any 3 of the alternative candidates, we make that the new root. This is important, because some articles on Vanity Fair score very high with a deeply nested top candidate, and real content is deleted from the new DOM.
Fix logic errors in the cleaning algorithm.
We are maintaining a very minimal fork of kuchiki-rs at https://github.com/keur/kuchiki/tree/speedreader. We expose two new fields to API users: `score` and `is_candidate`. This removes the need for both the nodes binary tree and the candidates binary tree. Kuchiki has nice helper methods for accessing parents, descendants, and ancestors, further removing the need for the Paths. The trees indexed by `Path` structs were mostly hacking around the lack of API extensibility.
After the new body node is chosen as the readable root, iterate over its sibling nodes, and append "related" siblings. This could be article preambles, content split out by ads that were removed during the preprocess step, etc.
It is common for extra data to be in the title extracted title. It's usually metadata after colons, separators (e.g '|', '-', '»', etc). Try and strip those out, be be mindful of the resulting word length.
Some sites, like Slate and the BBC, have <div> elements enclosing single <p> nodes. This weakens the paragraph's score from propagating higher in the DOM during the scoring step. In the pre-process step, make this replacement. TODO: We may also want to consider doing the same thing for divs with no block-level elements. I didn't implement it here because the function _hasChildBlockElement() in Readability.js checks for non-block elements, so the function in confusing and may be buggy? Not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cargo test
does not work because the integration tests (lib/tests/*
) are still using the old implementation.
I'm still going to do another pass on the PR, but added a few comments that can be addressed for now.
} | ||
} | ||
} | ||
} | ||
|
||
#[inline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we could run benchmarks with and without the inlining? I'm not sure how it impacts performance in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doubt it makes a difference at all. I just made the smaller, non-recursive functions that get called a lot inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for benchmarking, but let's also make sure the algorithm is fully complete. Once we get great results, then we can do an optimization pass.
components/speedreader/rust/lib/src/readability/src/extractor.rs
Outdated
Show resolved
Hide resolved
3bcb387
to
5e7f65d
Compare
@gpestana All suggestions made, and fixed up the tests using the markup5ever dom |
Skipping CI because the rust tests haven't been hooked up to GN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm marking approve, but please take with a grain of salt, since my rust is very very baby level
@@ -64,7 +64,7 @@ mod test { | |||
//test!(engadget); | |||
//test!(folha); | |||
//test!(gmw); | |||
test!(guardian_1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test was failing before i made any changes to speedreader. further, these tests aren't very good. they just have input and expected htmls, which will break at any change in the algorithm. i moved unit tests into the extractor.rs file, and those test specific features of the algorithm that shouldn't regress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The component tests are still failing on my side, e.g.
---- fulltest::folha stdout ----
thread 'fulltest::folha' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', tests/lib.rs:256:5
As far as I can tell, the static SAMPLES_PATH: &str = "data/tests-samples/";
does not exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that having these tests would be good. It is true that changing the algorithm will most likely make the tests failing, but that would enforce the developer to make sure that the end result when parsing DOMs from well known pages is as expected after the changes.
I feel that this e2e test approach fits this library nicely. Would you still prefer to keep only unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits here and there and a couple of questions, but in general the code looks good. All things considered, I believe we should keep the E2E tests (even if we cut some of the test cases).
Also it would be great to run cargo fmt
@@ -64,7 +64,7 @@ mod test { | |||
//test!(engadget); | |||
//test!(folha); | |||
//test!(gmw); | |||
test!(guardian_1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The component tests are still failing on my side, e.g.
---- fulltest::folha stdout ----
thread 'fulltest::folha' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', tests/lib.rs:256:5
As far as I can tell, the static SAMPLES_PATH: &str = "data/tests-samples/";
does not exist
@@ -64,7 +64,7 @@ mod test { | |||
//test!(engadget); | |||
//test!(folha); | |||
//test!(gmw); | |||
test!(guardian_1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that having these tests would be good. It is true that changing the algorithm will most likely make the tests failing, but that would enforce the developer to make sure that the end result when parsing DOMs from well known pages is as expected after the changes.
I feel that this e2e test approach fits this library nicely. Would you still prefer to keep only unit tests?
@@ -64,7 +64,7 @@ mod test { | |||
//test!(engadget); | |||
//test!(folha); | |||
//test!(gmw); | |||
test!(guardian_1); | |||
//test!(guardian_1); | |||
//test!(heise); | |||
//test!(herald_sun_1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd either make sure the tests are passing or remove them altogether. Unless there's a special reason to keep them commented out e.g. for checking later or for the development process, perhaps? -- if that is the case, I'd add some comments explaining why the tests are commented out other than "these tests don't pass".
But in general, I think that removing the commented tests is the best approach.
* Run `cargo fmt` in the speedreader root * Add a section to the README about fetching the test data, since it is not retrieved automatically.
5a65353
to
63ec571
Compare
@gpestana made all your changes, added comments to every function in |
Resolves brave/brave-browser#14787
Submitter Checklist:
QA/Yes
orQA/No
;release-notes/include
orrelease-notes/exclude
;OS/...
) to the associated issuenpm run test -- brave_browser_tests
,npm run test -- brave_unit_tests
,npm run lint
,npm run gn_check
,npm run tslint
git rebase master
(if needed)Reviewer Checklist:
gn
After-merge Checklist:
changes has landed on
Test Plan: