Scrapix V2 #101

qdequele · 2024-10-13T14:43:03Z

Pull Request

Add new crawlers:

Puppeteer: Launch Chrome to load JavaScript. Useful for websites that require JavaScript to display content.
Cheerio (default): Load only the raw HTML without JavaScript. This method is 18 times faster than the previous default.

Add new scrapers:

Default: Extract page blocks one by one. Useful for content websites.
Schema: Extract schema data from the meta description. Useful for e-commerce and media sites.
Custom: Use selectors to extract data. Useful for highly personalized websites.
Docsearch: Export a compatible version for Docsearch. Useful for older documents.
Markdown: Export all page content as a single markdown file. Useful for LLMs.

Add some test and benchmarks: yarn run tests

A full documentation

This is the list of features and fixes it should include.

Features

P1

P2

P3

Breaking

Fix

Throw error when redis server is not answering #56

…ssibility to add custom selector for 404 pages.

qdequele added 11 commits September 22, 2024 16:47

change to cheerio for scraping, keep puppeteer for crawling.

e49c556

big update

5f461e7

merge the maximum code of the crawlers together

aca8cb1

another big commit.

896e6c1

add markdown scraper

45a203f

add custom scraper

6195809

remove startCrawl; comment playwright

60f2416

fix #99

b3db1bb

update packages

550a4ab

fix #56: Throw error when redis server is not answering

6e17d28

fix #48: add the automatic detection of 404 pages to skip with the po…

a79061b

…ssibility to add custom selector for 404 pages.

qdequele changed the title ~~New crawlers and scrapers~~ Scrapix V2 Nov 9, 2024

qdequele mentioned this pull request Nov 9, 2024

Scrapix V2 #111

Open

19 tasks

qdequele linked an issue Nov 9, 2024 that may be closed by this pull request

Scrapix V2 #111

Open

19 tasks

qdequele added 4 commits November 9, 2024 18:58

By default use cheerio instead of Puppeteer #113

536326a

fix #112: Remove the useless headless option

ad970e6

remove launcher_option and launcher

49324fc

Update Documentation

c2ba9bf

qdequele self-assigned this Nov 9, 2024

qdequele added the epic label Nov 9, 2024

qdequele added 6 commits November 10, 2024 13:20

fix #103: Keep the previous settings

a43ee93

fix #102: Load the sitemap as starter point for crawling.

f5e9944

add a new playground

8e1adef

extract sitemap

62b8dce

add pdf scraper

2105cc7

Update testing

81be6b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapix V2 #101

Scrapix V2 #101

qdequele commented Oct 13, 2024 •

edited

Loading

Scrapix V2 #101

Are you sure you want to change the base?

Scrapix V2 #101

Conversation

qdequele commented Oct 13, 2024 • edited Loading

Pull Request

Features

P1

P2

P3

Breaking

Fix

qdequele commented Oct 13, 2024 •

edited

Loading