You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The pagination side of Diffbot is buggy at best. It will often fail to recognize articles that are multi-page and will not merge them. What's more, it tops out at 20 pages, so anything longer will get ignored.
The feature suggestion for the client is as follows:
Add a new method to the Article API: paginateBy. This method takes 2 arguments: $identifier and $maxPages. The former is a way to identify the nextPage link element on the page. This element would auto-processed to find out all the next pages programmatically. The latter is the max number of pages to concat.
This method would, in order:
Make an Article API request to the original URL.
Find the nextPage element and process it to find out the pattern to which to attach incrementing numbers, thus generating next pages.
Make an additional Article API request to each page, up to $maxPages number of pages
Concatenate the HTML content of all pages.
Send the merged HTML content as a POST request to the Article API, for a final analysis of the entire post.
Alternatively, in order to save Article API requests and use up only one, the client could just Guzzle the raw HTML of all the articles, extract the content HTML, merge that and send it as POST. This, however, is less reliable, as Diffbot is much better at figuring out what is content on the page, and what isn't (headers, ads, comments, etc.).
Maybe make it a switch of some kind, and additional setter?
The text was updated successfully, but these errors were encountered:
The pagination side of Diffbot is buggy at best. It will often fail to recognize articles that are multi-page and will not merge them. What's more, it tops out at 20 pages, so anything longer will get ignored.
The feature suggestion for the client is as follows:
Add a new method to the Article API:
paginateBy
. This method takes 2 arguments:$identifier
and$maxPages
. The former is a way to identify thenextPage
link element on the page. This element would auto-processed to find out all the next pages programmatically. The latter is the max number of pages to concat.This method would, in order:
$maxPages
number of pagesAlternatively, in order to save Article API requests and use up only one, the client could just Guzzle the raw HTML of all the articles, extract the content HTML, merge that and send it as POST. This, however, is less reliable, as Diffbot is much better at figuring out what is content on the page, and what isn't (headers, ads, comments, etc.).
Maybe make it a switch of some kind, and additional setter?
The text was updated successfully, but these errors were encountered: