Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler skip_uri_patterns is incomplete #35

Open
rmacklin opened this issue Oct 24, 2013 · 1 comment
Open

Crawler skip_uri_patterns is incomplete #35

rmacklin opened this issue Oct 24, 2013 · 1 comment

Comments

@rmacklin
Copy link

I was using tarantula and noticed that it was trying to crawl a "tel:" link, ultimately failing because the path of the parsed URI was nil. I looked into it and saw that a simple fix would be to add tel to the skip_uri_patterns list in Crawler's initialize function. However, the crawler would have the same issue with other URI schemes that aren't listed in skip_uri_patterns, so it seems like a more general approach may be better. Do you think it would make more sense to skip URIs that start with any scheme name, or is there a reason you specifically chose to only skip the javascript, mailto, and http schemes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@rmacklin and others