Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Follow Subdomains #45

Open
caigner opened this issue Nov 29, 2018 · 5 comments
Open

Feature request: Follow Subdomains #45

caigner opened this issue Nov 29, 2018 · 5 comments

Comments

@caigner
Copy link

caigner commented Nov 29, 2018

I noticed that CeWL doesn't follow subdomains.

cewl http://www.domain.com

does not traverse into http://sub.domain.com

cewl http://domain.com

does not work. Neither does

cewl http://*.domain.com

Would be nice to have that as additional feature.

Thanks!
Christian

@digininja
Copy link
Owner

digininja commented Nov 29, 2018 via email

@caigner
Copy link
Author

caigner commented Nov 29, 2018

I think it would be ok if CeWL followed links which lead to subdomains.
Nothing wrong in spidering
www.mydomain.com
summer.mydomain.com
winter.mydomain.com.

It would be just something in between staying within the domain and going wild by using the -o option. And if you think about it: subdomains are just like subdirectories.

@digininja
Copy link
Owner

digininja commented Nov 29, 2018 via email

@digininja
Copy link
Owner

digininja commented Dec 3, 2018 via email

@Mayyhem
Copy link

Mayyhem commented Sep 28, 2021

@digininja , what if the user was permitted to enable crawling of subdomains by modifying the code for the --allowed option argument to match a regular expression for the full URL rather than only the path?

diff --git a/cewl.rb b/cewl.rb
index f9dfe02..7fd7f78 100755
--- a/cewl.rb
+++ b/cewl.rb
@@ -811,8 +811,8 @@ catch :ctrl_c do
                                                        allow = false
                                                end
 
-                                               if allowed_pattern && !a_url_parsed.path.match(allowed_pattern)
-                                                       puts "Excluding path: #{a_url_parsed.path} based on allowed pattern" if verbose
+                                                if allowed_pattern && !a_url_parsed.to_s.match(allowed_pattern)
+                                                       puts "Excluding URL: #{a_url_parsed.to_s} based on allowed pattern" if verbose
                                                        allow = false
                                                end
                                        end

Then the user could set the -o option and something like "--allowed='(http(s|):\/\/domain.com|.*\.domain.com|^domain.com)($|\/.*)|^\/.*'" to allow crawling of other subdomains from the original URL as well as relative paths. It's easy to mess up a regex like this and visit an unintended site, I'll admit, but the user should understand and accept responsibility that they're explicitly allowing offsite spidering when enabling the -o option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants