You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a simplified version of something we have encountered in the wild on two occasions.
First on https://chopin.lib.uchicago.edu/. If you open any title and its scores, you will see a combo box in top right corner which is a form with a select combobox. On this website we also have prev/next links so the combobox is not the single navigation option so all pages are crawled.
Second on https://medecine-integree.com/ (we have only been approached by a user of this website, we do not have right to copy ... yet at least). On this website you have a comboxbox "Tous nos articles" which is the single navigation mean to access pages behind this combobox.
These combobox are simply used to "easily" generate a "GET" request with a given query parameter to load proper page.
It would help to be able to automatically crawl these.
Or is there a possibility I missed to customize browsertrix crawler with some JS code to customize link extractions? A bit like custom behaviors, but these are only aimed at loading more resources of a given page, not at extracting new URLs to crawl, if I'm not mistaken.
For now I plan to build on my own a "fake" sitemap to pass to browsertrix crawler to populate proper links.
The text was updated successfully, but these errors were encountered:
I have setup a test page at https://website.test.openzim.org/form-get.html
This is a simplified version of something we have encountered in the wild on two occasions.
First on https://chopin.lib.uchicago.edu/. If you open any title and its scores, you will see a combo box in top right corner which is a form with a select combobox. On this website we also have prev/next links so the combobox is not the single navigation option so all pages are crawled.
Second on https://medecine-integree.com/ (we have only been approached by a user of this website, we do not have right to copy ... yet at least). On this website you have a comboxbox "Tous nos articles" which is the single navigation mean to access pages behind this combobox.
These combobox are simply used to "easily" generate a "GET" request with a given query parameter to load proper page.
This is what has been repoduced on https://website.test.openzim.org/form-get.html
Currently Browsertrix crawler does not extract links from this kind of form / combobox, as can be seen by running following command:
It would help to be able to automatically crawl these.
Or is there a possibility I missed to customize browsertrix crawler with some JS code to customize link extractions? A bit like custom behaviors, but these are only aimed at loading more resources of a given page, not at extracting new URLs to crawl, if I'm not mistaken.
For now I plan to build on my own a "fake" sitemap to pass to browsertrix crawler to populate proper links.
The text was updated successfully, but these errors were encountered: