Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headless crawler #310

Merged
merged 40 commits into from
Sep 20, 2022
Merged

Headless crawler #310

merged 40 commits into from
Sep 20, 2022

Conversation

devl00p
Copy link
Contributor

@devl00p devl00p commented Jul 26, 2022

I thought the hardest part would be cookie management. And it was 😅

Dev made with:

  • geckodriver 0.26.0 (e9783a644016 2019-10-10 13:38 +0000)
  • Firefox 102.0.1
  • Python 3.8.13
  • arsenic 21.8
  • mitmproxy 8.0.0

@codecov-commenter
Copy link

codecov-commenter commented Jul 26, 2022

Codecov Report

Attention: Patch coverage is 52.31417% with 340 lines in your changes missing coverage. Please review.

Project coverage is 74.82%. Comparing base (9a69242) to head (dfb8df7).
Report is 360 commits behind head on master.

Files with missing lines Patch % Lines
wapitiCore/net/intercepting_explorer.py 17.58% 150 Missing ⚠️
wapitiCore/main/wapiti.py 43.79% 77 Missing ⚠️
wapitiCore/attack/mod_wapp.py 33.80% 47 Missing ⚠️
wapitiCore/net/auth.py 64.36% 31 Missing ⚠️
wapitiCore/net/cookies.py 23.52% 13 Missing ⚠️
wapitiCore/wappalyzer/wappalyzer.py 93.61% 6 Missing ⚠️
wapitiCore/net/async_stickycookie.py 33.33% 4 Missing ⚠️
wapitiCore/net/web.py 69.23% 4 Missing ⚠️
wapitiCore/net/jsoncookie.py 70.00% 3 Missing ⚠️
wapitiCore/main/getcookie.py 50.00% 2 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #310      +/-   ##
==========================================
- Coverage   76.44%   74.82%   -1.63%     
==========================================
  Files          92       94       +2     
  Lines        8938     9268     +330     
==========================================
+ Hits         6833     6935     +102     
- Misses       2105     2333     +228     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@devl00p
Copy link
Contributor Author

devl00p commented Jul 28, 2022

One difficult point is that in headless mode it is hard to tell if the page has completely loaded or not ( https://stackoverflow.com/questions/15122864/selenium-wait-until-document-is-ready ).
I increased the sleep time between fetching the page and reading the source code to 1 sec but this is something that should be configured through a command line option.

@tarraschk
Copy link
Member

Pour faire les tests :

@bretfourbe pourras-tu regarder stp ?

@bretfourbe
Copy link
Collaborator

Hi @devl00p, I tried to build from your branch but i got an error.
Capture d’écran 2022-08-10 160037
Do we need urwid depedency ? Do we need to install gcc to build wapiti now ?

@devl00p
Copy link
Contributor Author

devl00p commented Aug 10, 2022

Hi @devl00p, I tried to build from your branch but i got an error. Capture d’écran 2022-08-10 160037 Do we need urwid depedency ? Do we need to install gcc to build wapiti now ?

Yes, mitmproxy use urwid as a required dependency, unfortunately it needs gcc

@bretfourbe
Copy link
Collaborator

I also have an issue with pyasn1 0.5.0 (ldap3dependency) ImportError: cannot import name 'tagMap' from 'pyasn1.codec.ber.encoder'. We should force the use of pyasn1 0.4.8 since there is no tagMap in version 0.5.0 at the moment : cannatag/ldap3#981

@devl00p
Copy link
Contributor Author

devl00p commented Aug 11, 2022

I also have an issue with pyasn1 0.5.0 (ldap3dependency) ImportError: cannot import name 'tagMap' from 'pyasn1.codec.ber.encoder'. We should force the use of pyasn1 0.4.8 since there is no tagMap in version 0.5.0 at the moment : cannatag/ldap3#981

Ok, I will do that.

I have the following versions from my pip freeze

aiocache==0.11.1
aiohttp==3.8.1
aiosignal==1.2.0
aiosqlite==0.17.0
anyio==3.6.1
appdirs==1.4.4
arsenic==21.8
asgiref==3.5.2
astroid==2.11.6
async-timeout==4.0.2
attrs==21.4.0
beautifulsoup4==4.11.1
blinker==1.4
Brotli==1.0.9
browser-cookie3==0.11.4
bs4==0.0.1
certifi==2022.6.15
cffi==1.15.1
charset-normalizer==2.1.0
click==8.1.3
coverage==6.4
cryptography==36.0.2
cssselect==1.1.0
dill==0.3.5.1
dnspython==2.1.0
fake-useragent==0.1.11
Flask==2.0.3
frozenlist==1.3.0
greenlet==1.1.2
h11==0.12.0
h2==4.1.0
hpack==4.0.0
httpcore==0.15.0
httpx==0.23.0
humanize==3.13.1
hyperframe==6.0.1
idna==3.3
importlib-metadata==3.7.2
iniconfig==1.1.1
isort==5.10.1
itsdangerous==2.1.2
jeepney==0.8.0
Jinja2==3.1.2
kaitaistruct==0.9
keyring==23.7.0
lazy-object-proxy==1.7.1
ldap3==2.9.1
loguru==0.6.0
lxml==4.9.1
lz4==4.0.1
Mako==1.2.1
MarkupSafe==2.1.1
mccabe==0.7.0
mitmproxy==8.0.0
msgpack==1.0.4
multidict==6.0.2
nassl==4.0.2
packaging==21.3
parse==1.19.0
passlib==1.7.4
pbkdf2==1.3
platformdirs==2.5.2
pluggy==1.0.0
protobuf==3.19.4
publicsuffix2==2.20191221
py==1.11.0
pyaes==1.6.1
pyasn1==0.4.8
pycparser==2.21
pycryptodome==3.15.0
pydantic==1.8.2
pyee==8.2.2
pylint==2.14.3
pyOpenSSL==22.0.0
pyparsing==3.0.9
pyperclip==1.8.2
pyppeteer==1.0.2
pyquery==1.4.3
pytest==7.1.2
pytest-asyncio==0.14.0
pytest-cov==3.0.0
requests==2.28.1
requests-html==0.10.0
respx==0.19.2
rfc3986==1.5.0
ruamel.yaml==0.17.21
ruamel.yaml.clib==0.2.6
SecretStorage==3.3.2
six==1.16.0
sniffio==1.2.0
socksio==1.0.0
sortedcontainers==2.4.0
soupsieve==2.3.2.post1
SQLAlchemy==1.4.39
sslyze==5.0.1
structlog==20.2.0
tld==0.12.6
tls-parser==1.2.2
tomli==2.0.1
tomlkit==0.11.0
tornado==6.2
tqdm==4.64.0
typing_extensions==4.3.0
urllib3==1.26.10
urwid==2.1.2
w3lib==1.22.0
websockets==10.3
Werkzeug==2.1.2
wrapt==1.14.1
wsproto==1.1.0
yarl==1.7.2
yaswfp==0.9.3
zipp==3.8.1
zstandard==0.17.0

@devl00p
Copy link
Contributor Author

devl00p commented Aug 16, 2022

@bretfourbe If it looks good to you, I will merge.

@bretfourbe
Copy link
Collaborator

Hi @devl00p, i aslo noticed we need to fix aiohttp to <4 since there is a new pre-release

@tarraschk
Copy link
Member

tarraschk commented Sep 2, 2022

@bretfourbe tu peux lancer un test sur http://angular.testsparker.com/ stp ? @devl00p nous a donné des résultats sympas pour cette cible et j'aimerais aussi avoir ton avis

Aussi, @devl00p a ajouté des éléments en plus pour mieux browser JuiceShop si tu peux retester

Merci !

@bretfourbe
Copy link
Collaborator

Hi @devl00p, i tried new mod_wapp and it looks great !
As i can see, headless mode is always on for this module now. Do you know if it is possible to activate headless (and so js pattern detection) only if --headless is used in the command line ?

Regarding file downloads, it seems that files are still downloaded locally, and confirmation pop-up holds the focus, crawl never ends 😕

Also can you lock aiohttp<4 since it does not build with aiohttp 4 (pre-release) ?

@devl00p
Copy link
Contributor Author

devl00p commented Sep 5, 2022

@bretfourbe I did some changes. On which website do you have the download issue ?

@bretfourbe
Copy link
Collaborator

@devl00p it was on a juicy-shop instance, there is a legal.md file which is downloaded

@devl00p
Copy link
Contributor Author

devl00p commented Sep 5, 2022

@devl00p it was on a juicy-shop instance, there is a legal.md file which is downloaded

You should test with the latest version, some changes have been made to replace download responses with a simple HTML placeholder.

@bretfourbe
Copy link
Collaborator

Even using the latest version with the following command ./bin/wapiti -u http://localhost:3000/ -m "" -v 2 --store-session=sessions/ --store-config=sessions/ --flush-session --color --headless visible, I still have the legal.md downloaded and blocking the crawler.
image
I also noticed a strange behavior with JSON responses : there is a pop-up asking where to download the answer as a JSON file, and the clipboard contains the server response^^
image

@devl00p
Copy link
Contributor Author

devl00p commented Sep 12, 2022

@bretfourbe The headless crawler was not using the intercepting proxy for 127.0.0.1 / localhost, this is the default Firefox behavior.

I pushed a fix to change the browser configuration.

@bretfourbe
Copy link
Collaborator

@devl00p except the JSON behavior, it is working pretty well now ! I tried on several apps, it is a big step for SPA applications scanning :)

@devl00p
Copy link
Contributor Author

devl00p commented Sep 13, 2022

@devl00p except the JSON behavior, it is working pretty well now ! I tried on several apps, it is a big step for SPA applications scanning :)

I pushed a fix for the JSON files. Firefox has a default setting that adds a small bar at the top of the window when reading JSON files.

As wapiti clicks on buttons, it will click on the "Download" button that Firefox added.

I was able to deactivate the bar appearance with firefox settings

@bretfourbe
Copy link
Collaborator

I pushed a fix for the JSON files. Firefox has a default setting that adds a small bar at the top of the window when reading JSON files.

As wapiti clicks on buttons, it will click on the "Download" button that Firefox added.

I was able to deactivate the bar appearance with firefox settings

Looks good now :)

@devl00p
Copy link
Contributor Author

devl00p commented Sep 14, 2022

added one more exception catching (asyncio.TimeoutError) due to exception not being caught in Arsenic: HENNGE/arsenic#152 (comment)

@devl00p
Copy link
Contributor Author

devl00p commented Sep 14, 2022

@bretfourbe I let the crawler explore microsoft.com for an hour and it didn't bumped into errors. Do you need more time for testing or do you think I should merge ?

@@ -17,7 +16,7 @@ class CrawlerConfiguration:
proxy: str = None
auth_credentials: tuple = tuple()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! There are a lot of improvements!

I think a more complex data structure would be appropriate for this variable. The Tuple does not offer any guarantee on the number of data. Therefore, it is necessary to add explicit checks:

https://github.com/wapiti-scanner/wapiti/pull/310/files#diff-f5a9d807270fdc42bf1ef8d5ed535674fb8467254f7fe2eadbba7c5bb9c68a59R27-R29

A Namedtuple might be more appropriate here.

This is a minor change and the PR has a bigger scope. Feel free to ignore this suggestion 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that auth_credentials tuple neither :D

I will work on it later

@tarraschk
Copy link
Member

Quick question : can we use this headless crawler for authenticated scans? Would this be a "second" step?

@bretfourbe
Copy link
Collaborator

Hi @devl00p, I tried authenticated scan with the following command ./bin/wapiti -u http://testphp.vulweb.com/ -m "" -v 2 --store-session=sessions/ --store-config=sessions/ --flush-session --color -a test%test --auth-type=post and it seems that the -s option is required now.
If i do not specify it, i get the following error :
image

@devl00p
Copy link
Contributor Author

devl00p commented Sep 17, 2022

Hi @devl00p, I tried authenticated scan with the following command ./bin/wapiti -u http://testphp.vulweb.com/ -m "" -v 2 --store-session=sessions/ --store-config=sessions/ --flush-session --color -a test%test --auth-type=post and it seems that the -s option is required now. If i do not specify it, i get the following error : image

Yes, -s is required to specify the URL with the --auth-type=post option. Maybe I should add another more explicit option for that URL then display an error message if it is missing.

Also a website may have a login form AND be behind HTTP basic auth so the option should act differently but this is out of the scope of that MR.

What I can do soon is make sure -s is not missing for --auth-type=post and raise an error about usage.

@devl00p
Copy link
Contributor Author

devl00p commented Sep 18, 2022

Quick question : can we use this headless crawler for authenticated scans? Would this be a "second" step?

The headless crawler is used for the auth step if the headless option is set. The login URL will be loaded into the crawler so if the login form is generated dynamically it will be found.

However if the login mechanism is working with some custom JS without the use of classic input fields and sendind data in JSON/XML then Wapiti won't be able to login. That's why I plan to work on a plugin system to allow users to write custom scripts for specific websites.

@devl00p
Copy link
Contributor Author

devl00p commented Sep 18, 2022

@bretfourbe pushed some code to fix the missing -s bug

@devl00p devl00p merged commit 5c77939 into wapiti-scanner:master Sep 20, 2022
@devl00p devl00p deleted the headless_crawler branch July 4, 2023 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants