-
-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Follow Subdomains #45
Comments
That is deliberate. CeWL sticks to the domain it has been asked to spider
unless you set the flag to let it go off site. This is to stop it going mad
and spidering the whole internet.
It may be possible to work out subdomains but it could get messy and very
wide very quickly so not something I'm likely to implement.
…On Thu, 29 Nov 2018 at 14:09, Christian Aigner ***@***.***> wrote:
I noticed that CeWL doesn't follow subdomains.
cewl http://www.domain.com
does not traverse into http://sub.domain.com
cewl http://domain.com
does not work. Neither does
cewl http://*.domain.com
Would be nice to have that as additional feature.
Thanks!
Christian
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#45>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHJWeCjRDcMIUIjO3DPhNZlNl80oVbDks5uz-qagaJpZM4Y5wDk>
.
|
I think it would be ok if CeWL followed links which lead to subdomains. It would be just something in between staying within the domain and going wild by using the -o option. And if you think about it: subdomains are just like subdirectories. |
On some sites they are like subdirectories but on others they are
completely different sites.
I'll have a think about it, part of it depends on how easy the spider is to
manipulate to get it to understand subdomains.
…On Thu, 29 Nov 2018 at 14:43, Christian Aigner ***@***.***> wrote:
I think it would be ok if CeWL followed links which lead to subdomains.
Nothing wrong in spidering
www.mydomain.com
summer.mydomain.com
winter.mydomain.com.
It would be just something in between staying within the domain and going
wild by using the -o option. And if you think about it: subdomains are just
like subdirectories.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#45 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAHJWagE4ft4Lz7iQ4EG_XYzZIKVxO8qks5uz_KegaJpZM4Y5wDk>
.
|
I've been thinking about this and trying to work out parentage is probably
going to be too hard. Trying to work out where the domain ends and the TLD
starts could get messy and result in either scans exploding or being a lot
shorter than expected.
…On Thu, 29 Nov 2018 at 14:45, Robin Wood ***@***.***> wrote:
On some sites they are like subdirectories but on others they are
completely different sites.
I'll have a think about it, part of it depends on how easy the spider is
to manipulate to get it to understand subdomains.
On Thu, 29 Nov 2018 at 14:43, Christian Aigner ***@***.***>
wrote:
> I think it would be ok if CeWL followed links which lead to subdomains.
> Nothing wrong in spidering
> www.mydomain.com
> summer.mydomain.com
> winter.mydomain.com.
>
> It would be just something in between staying within the domain and going
> wild by using the -o option. And if you think about it: subdomains are just
> like subdirectories.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#45 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AAHJWagE4ft4Lz7iQ4EG_XYzZIKVxO8qks5uz_KegaJpZM4Y5wDk>
> .
>
|
@digininja , what if the user was permitted to enable crawling of subdomains by modifying the code for the --allowed option argument to match a regular expression for the full URL rather than only the path? diff --git a/cewl.rb b/cewl.rb
index f9dfe02..7fd7f78 100755
--- a/cewl.rb
+++ b/cewl.rb
@@ -811,8 +811,8 @@ catch :ctrl_c do
allow = false
end
- if allowed_pattern && !a_url_parsed.path.match(allowed_pattern)
- puts "Excluding path: #{a_url_parsed.path} based on allowed pattern" if verbose
+ if allowed_pattern && !a_url_parsed.to_s.match(allowed_pattern)
+ puts "Excluding URL: #{a_url_parsed.to_s} based on allowed pattern" if verbose
allow = false
end
end Then the user could set the -o option and something like "--allowed='(http(s|):\/\/domain.com|.*\.domain.com|^domain.com)($|\/.*)|^\/.*'" to allow crawling of other subdomains from the original URL as well as relative paths. It's easy to mess up a regex like this and visit an unintended site, I'll admit, but the user should understand and accept responsibility that they're explicitly allowing offsite spidering when enabling the -o option. |
I noticed that CeWL doesn't follow subdomains.
cewl http://www.domain.com
does not traverse into http://sub.domain.com
cewl http://domain.com
does not work. Neither does
cewl http://*.domain.com
Would be nice to have that as additional feature.
Thanks!
Christian
The text was updated successfully, but these errors were encountered: