Feature request: Follow Subdomains #45

caigner · 2018-11-29T14:09:30Z

I noticed that CeWL doesn't follow subdomains.

cewl http://www.domain.com

does not traverse into http://sub.domain.com

cewl http://domain.com

does not work. Neither does

cewl http://*.domain.com

Would be nice to have that as additional feature.

Thanks!
Christian

The text was updated successfully, but these errors were encountered:

digininja · 2018-11-29T14:11:55Z

That is deliberate. CeWL sticks to the domain it has been asked to spider unless you set the flag to let it go off site. This is to stop it going mad and spidering the whole internet. It may be possible to work out subdomains but it could get messy and very wide very quickly so not something I'm likely to implement.

…

On Thu, 29 Nov 2018 at 14:09, Christian Aigner ***@***.***> wrote: I noticed that CeWL doesn't follow subdomains. cewl http://www.domain.com does not traverse into http://sub.domain.com cewl http://domain.com does not work. Neither does cewl http://*.domain.com Would be nice to have that as additional feature. Thanks! Christian — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#45>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHJWeCjRDcMIUIjO3DPhNZlNl80oVbDks5uz-qagaJpZM4Y5wDk> .

caigner · 2018-11-29T14:43:42Z

I think it would be ok if CeWL followed links which lead to subdomains.
Nothing wrong in spidering
www.mydomain.com
summer.mydomain.com
winter.mydomain.com.

It would be just something in between staying within the domain and going wild by using the -o option. And if you think about it: subdomains are just like subdirectories.

digininja · 2018-11-29T14:45:39Z

On some sites they are like subdirectories but on others they are completely different sites. I'll have a think about it, part of it depends on how easy the spider is to manipulate to get it to understand subdomains.

…

On Thu, 29 Nov 2018 at 14:43, Christian Aigner ***@***.***> wrote: I think it would be ok if CeWL followed links which lead to subdomains. Nothing wrong in spidering www.mydomain.com summer.mydomain.com winter.mydomain.com. It would be just something in between staying within the domain and going wild by using the -o option. And if you think about it: subdomains are just like subdirectories. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#45 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHJWagE4ft4Lz7iQ4EG_XYzZIKVxO8qks5uz_KegaJpZM4Y5wDk> .

digininja · 2018-12-03T13:24:37Z

I've been thinking about this and trying to work out parentage is probably going to be too hard. Trying to work out where the domain ends and the TLD starts could get messy and result in either scans exploding or being a lot shorter than expected.

…

On Thu, 29 Nov 2018 at 14:45, Robin Wood ***@***.***> wrote: On some sites they are like subdirectories but on others they are completely different sites. I'll have a think about it, part of it depends on how easy the spider is to manipulate to get it to understand subdomains. On Thu, 29 Nov 2018 at 14:43, Christian Aigner ***@***.***> wrote: > I think it would be ok if CeWL followed links which lead to subdomains. > Nothing wrong in spidering > www.mydomain.com > summer.mydomain.com > winter.mydomain.com. > > It would be just something in between staying within the domain and going > wild by using the -o option. And if you think about it: subdomains are just > like subdirectories. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#45 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AAHJWagE4ft4Lz7iQ4EG_XYzZIKVxO8qks5uz_KegaJpZM4Y5wDk> > . >

Mayyhem · 2021-09-28T16:34:48Z

@digininja , what if the user was permitted to enable crawling of subdomains by modifying the code for the --allowed option argument to match a regular expression for the full URL rather than only the path?

diff --git a/cewl.rb b/cewl.rb
index f9dfe02..7fd7f78 100755
--- a/cewl.rb
+++ b/cewl.rb
@@ -811,8 +811,8 @@ catch :ctrl_c do
                                                        allow = false
                                                end
 
-                                               if allowed_pattern && !a_url_parsed.path.match(allowed_pattern)
-                                                       puts "Excluding path: #{a_url_parsed.path} based on allowed pattern" if verbose
+                                                if allowed_pattern && !a_url_parsed.to_s.match(allowed_pattern)
+                                                       puts "Excluding URL: #{a_url_parsed.to_s} based on allowed pattern" if verbose
                                                        allow = false
                                                end
                                        end

Then the user could set the -o option and something like "--allowed='(http(s|):\/\/domain.com|.*\.domain.com|^domain.com)($|\/.*)|^\/.*'" to allow crawling of other subdomains from the original URL as well as relative paths. It's easy to mess up a regex like this and visit an unintended site, I'll admit, but the user should understand and accept responsibility that they're explicitly allowing offsite spidering when enabling the -o option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Follow Subdomains #45

Feature request: Follow Subdomains #45

caigner commented Nov 29, 2018

digininja commented Nov 29, 2018 via email

caigner commented Nov 29, 2018

digininja commented Nov 29, 2018 via email

digininja commented Dec 3, 2018 via email

Mayyhem commented Sep 28, 2021

Feature request: Follow Subdomains #45

Feature request: Follow Subdomains #45

Comments

caigner commented Nov 29, 2018

digininja commented Nov 29, 2018 via email

caigner commented Nov 29, 2018

digininja commented Nov 29, 2018 via email

digininja commented Dec 3, 2018 via email

Mayyhem commented Sep 28, 2021