Off to a good start writing a fairly complicated crawler #226

dogweather · 2022-09-21T05:13:44Z

dogweather
Sep 21, 2022
Collaborator

Hi Everyone,

The crawl target is a pretty dreadful gov't website made from awful dynamic JS and MS Word documents. :-)

I'm pretty excited to be giving Crawly a real try. I've got a lot of Scrapy code already, but I'm ok with switching if this Elixir alternative really pans out. Or, possibly use both systems for different tasks, since they're so similar from the programmer's POV.

It's been an easy transition, with Crawly using a lot of Scrapy conventions. I code my spiders a little unconventionally, using TDD with tests, a large parser module, and a small spider.

Now, in my Python Scrapy code, I rely on strict type checking to validate the output. E.g., glossary models for parsing online glossaries. I don't know yet how I'll do that in a natural Elixir way.

oltarasenko · 2022-09-21T13:32:39Z

oltarasenko
Sep 21, 2022
Maintainer

Maybe it's a task for an item pipeline? E.g. if you could create a pipeline responsible for assuring the data types it would solve the problem.

1 reply

dogweather Sep 21, 2022
Collaborator Author

BTW, I get [debug] Starting data storage when I run mix test. I think it could be slowing down my tests, which don't use data storage. Is there a way to disable it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Off to a good start writing a fairly complicated crawler #226

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Off to a good start writing a fairly complicated crawler #226

dogweather Sep 21, 2022 Collaborator

Replies: 1 comment · 1 reply

oltarasenko Sep 21, 2022 Maintainer

dogweather Sep 21, 2022 Collaborator Author

dogweather
Sep 21, 2022
Collaborator

Replies: 1 comment 1 reply

oltarasenko
Sep 21, 2022
Maintainer

dogweather Sep 21, 2022
Collaborator Author