-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Headless crawler #310
Headless crawler #310
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #310 +/- ##
==========================================
- Coverage 76.44% 74.82% -1.63%
==========================================
Files 92 94 +2
Lines 8938 9268 +330
==========================================
+ Hits 6833 6935 +102
- Misses 2105 2333 +228 ☔ View full report in Codecov by Sentry. |
One difficult point is that in headless mode it is hard to tell if the page has completely loaded or not ( https://stackoverflow.com/questions/15122864/selenium-wait-until-document-is-ready ). |
Pour faire les tests :
@bretfourbe pourras-tu regarder stp ? |
Hi @devl00p, I tried to build from your branch but i got an error. |
Yes, mitmproxy use urwid as a required dependency, unfortunately it needs gcc |
I also have an issue with |
Ok, I will do that. I have the following versions from my
|
@bretfourbe If it looks good to you, I will merge. |
Hi @devl00p, i aslo noticed we need to fix aiohttp to <4 since there is a new pre-release |
@bretfourbe tu peux lancer un test sur http://angular.testsparker.com/ stp ? @devl00p nous a donné des résultats sympas pour cette cible et j'aimerais aussi avoir ton avis Aussi, @devl00p a ajouté des éléments en plus pour mieux browser JuiceShop si tu peux retester Merci ! |
Hi @devl00p, i tried new mod_wapp and it looks great ! Regarding file downloads, it seems that files are still downloaded locally, and confirmation pop-up holds the focus, crawl never ends 😕 Also can you lock aiohttp<4 since it does not build with aiohttp 4 (pre-release) ? |
…less is activated / completely hide arsenic & marionette logs
@bretfourbe I did some changes. On which website do you have the download issue ? |
@devl00p it was on a juicy-shop instance, there is a legal.md file which is downloaded |
You should test with the latest version, some changes have been made to replace download responses with a simple HTML placeholder. |
@bretfourbe The headless crawler was not using the intercepting proxy for 127.0.0.1 / localhost, this is the default Firefox behavior. I pushed a fix to change the browser configuration. |
@devl00p except the JSON behavior, it is working pretty well now ! I tried on several apps, it is a big step for SPA applications scanning :) |
I pushed a fix for the JSON files. Firefox has a default setting that adds a small bar at the top of the window when reading JSON files. As wapiti clicks on buttons, it will click on the "Download" button that Firefox added. I was able to deactivate the bar appearance with firefox settings |
Looks good now :) |
added one more exception catching (asyncio.TimeoutError) due to exception not being caught in Arsenic: HENNGE/arsenic#152 (comment) |
@bretfourbe I let the crawler explore microsoft.com for an hour and it didn't bumped into errors. Do you need more time for testing or do you think I should merge ? |
@@ -17,7 +16,7 @@ class CrawlerConfiguration: | |||
proxy: str = None | |||
auth_credentials: tuple = tuple() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! There are a lot of improvements!
I think a more complex data structure would be appropriate for this variable. The Tuple does not offer any guarantee on the number of data. Therefore, it is necessary to add explicit checks:
A Namedtuple might be more appropriate here.
This is a minor change and the PR has a bigger scope. Feel free to ignore this suggestion 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that auth_credentials
tuple neither :D
I will work on it later
Quick question : can we use this headless crawler for authenticated scans? Would this be a "second" step? |
Hi @devl00p, I tried authenticated scan with the following command |
Yes, Also a website may have a login form AND be behind HTTP basic auth so the option should act differently but this is out of the scope of that MR. What I can do soon is make sure |
…e=post is used without -s
The headless crawler is used for the auth step if the headless option is set. The login URL will be loaded into the crawler so if the login form is generated dynamically it will be found. However if the login mechanism is working with some custom JS without the use of classic input fields and sendind data in JSON/XML then Wapiti won't be able to login. That's why I plan to work on a plugin system to allow users to write custom scripts for specific websites. |
@bretfourbe pushed some code to fix the missing |
I thought the hardest part would be cookie management. And it was 😅
Dev made with: