Headless crawler #310

devl00p · 2022-07-26T06:59:40Z

I thought the hardest part would be cookie management. And it was 😅

Dev made with:

geckodriver 0.26.0 (e9783a644016 2019-10-10 13:38 +0000)
Firefox 102.0.1
Python 3.8.13
arsenic 21.8
mitmproxy 8.0.0

codecov-commenter · 2022-07-26T16:08:23Z

Codecov Report

Attention: Patch coverage is 52.31417% with 340 lines in your changes missing coverage. Please review.

Project coverage is 74.82%. Comparing base (9a69242) to head (dfb8df7).
Report is 360 commits behind head on master.

Files with missing lines	Patch %	Lines
wapitiCore/net/intercepting_explorer.py	17.58%	150 Missing ⚠️
wapitiCore/main/wapiti.py	43.79%	77 Missing ⚠️
wapitiCore/attack/mod_wapp.py	33.80%	47 Missing ⚠️
wapitiCore/net/auth.py	64.36%	31 Missing ⚠️
wapitiCore/net/cookies.py	23.52%	13 Missing ⚠️
wapitiCore/wappalyzer/wappalyzer.py	93.61%	6 Missing ⚠️
wapitiCore/net/async_stickycookie.py	33.33%	4 Missing ⚠️
wapitiCore/net/web.py	69.23%	4 Missing ⚠️
wapitiCore/net/jsoncookie.py	70.00%	3 Missing ⚠️
wapitiCore/main/getcookie.py	50.00%	2 Missing ⚠️
... and 2 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #310      +/-   ##
==========================================
- Coverage   76.44%   74.82%   -1.63%     
==========================================
  Files          92       94       +2     
  Lines        8938     9268     +330     
==========================================
+ Hits         6833     6935     +102     
- Misses       2105     2333     +228

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

devl00p · 2022-07-28T06:57:53Z

One difficult point is that in headless mode it is hard to tell if the page has completely loaded or not ( https://stackoverflow.com/questions/15122864/selenium-wait-until-document-is-ready ).
I increased the sleep time between fetching the page and reading the source code to 1 sec but this is something that should be configured through a command line option.

tarraschk · 2022-08-02T06:59:44Z

Pour faire les tests :

https://juice-shop.herokuapp.com/#/ sur la partie crawling headless sur une appli avec plein JS
http://testphp.vulnweb.com/ pour la partie "crawling classique"

@bretfourbe pourras-tu regarder stp ?

bretfourbe · 2022-08-10T14:05:46Z

Hi @devl00p, I tried to build from your branch but i got an error.

Do we need urwid depedency ? Do we need to install gcc to build wapiti now ?

devl00p · 2022-08-10T16:31:40Z

Hi @devl00p, I tried to build from your branch but i got an error. Do we need urwid depedency ? Do we need to install gcc to build wapiti now ?

Yes, mitmproxy use urwid as a required dependency, unfortunately it needs gcc

bretfourbe · 2022-08-11T08:30:23Z

I also have an issue with pyasn1 0.5.0 (ldap3dependency) ImportError: cannot import name 'tagMap' from 'pyasn1.codec.ber.encoder'. We should force the use of pyasn1 0.4.8 since there is no tagMap in version 0.5.0 at the moment : cannatag/ldap3#981

devl00p · 2022-08-11T11:42:10Z

I also have an issue with pyasn1 0.5.0 (ldap3dependency) ImportError: cannot import name 'tagMap' from 'pyasn1.codec.ber.encoder'. We should force the use of pyasn1 0.4.8 since there is no tagMap in version 0.5.0 at the moment : cannatag/ldap3#981

Ok, I will do that.

I have the following versions from my pip freeze

aiocache==0.11.1
aiohttp==3.8.1
aiosignal==1.2.0
aiosqlite==0.17.0
anyio==3.6.1
appdirs==1.4.4
arsenic==21.8
asgiref==3.5.2
astroid==2.11.6
async-timeout==4.0.2
attrs==21.4.0
beautifulsoup4==4.11.1
blinker==1.4
Brotli==1.0.9
browser-cookie3==0.11.4
bs4==0.0.1
certifi==2022.6.15
cffi==1.15.1
charset-normalizer==2.1.0
click==8.1.3
coverage==6.4
cryptography==36.0.2
cssselect==1.1.0
dill==0.3.5.1
dnspython==2.1.0
fake-useragent==0.1.11
Flask==2.0.3
frozenlist==1.3.0
greenlet==1.1.2
h11==0.12.0
h2==4.1.0
hpack==4.0.0
httpcore==0.15.0
httpx==0.23.0
humanize==3.13.1
hyperframe==6.0.1
idna==3.3
importlib-metadata==3.7.2
iniconfig==1.1.1
isort==5.10.1
itsdangerous==2.1.2
jeepney==0.8.0
Jinja2==3.1.2
kaitaistruct==0.9
keyring==23.7.0
lazy-object-proxy==1.7.1
ldap3==2.9.1
loguru==0.6.0
lxml==4.9.1
lz4==4.0.1
Mako==1.2.1
MarkupSafe==2.1.1
mccabe==0.7.0
mitmproxy==8.0.0
msgpack==1.0.4
multidict==6.0.2
nassl==4.0.2
packaging==21.3
parse==1.19.0
passlib==1.7.4
pbkdf2==1.3
platformdirs==2.5.2
pluggy==1.0.0
protobuf==3.19.4
publicsuffix2==2.20191221
py==1.11.0
pyaes==1.6.1
pyasn1==0.4.8
pycparser==2.21
pycryptodome==3.15.0
pydantic==1.8.2
pyee==8.2.2
pylint==2.14.3
pyOpenSSL==22.0.0
pyparsing==3.0.9
pyperclip==1.8.2
pyppeteer==1.0.2
pyquery==1.4.3
pytest==7.1.2
pytest-asyncio==0.14.0
pytest-cov==3.0.0
requests==2.28.1
requests-html==0.10.0
respx==0.19.2
rfc3986==1.5.0
ruamel.yaml==0.17.21
ruamel.yaml.clib==0.2.6
SecretStorage==3.3.2
six==1.16.0
sniffio==1.2.0
socksio==1.0.0
sortedcontainers==2.4.0
soupsieve==2.3.2.post1
SQLAlchemy==1.4.39
sslyze==5.0.1
structlog==20.2.0
tld==0.12.6
tls-parser==1.2.2
tomli==2.0.1
tomlkit==0.11.0
tornado==6.2
tqdm==4.64.0
typing_extensions==4.3.0
urllib3==1.26.10
urwid==2.1.2
w3lib==1.22.0
websockets==10.3
Werkzeug==2.1.2
wrapt==1.14.1
wsproto==1.1.0
yarl==1.7.2
yaswfp==0.9.3
zipp==3.8.1
zstandard==0.17.0

devl00p · 2022-08-16T08:44:06Z

@bretfourbe If it looks good to you, I will merge.

bretfourbe · 2022-08-22T12:32:44Z

Hi @devl00p, i aslo noticed we need to fix aiohttp to <4 since there is a new pre-release

…tin support

![Yes](https://i.giphy.com/media/3o7btQbbKZQDN0nasE/giphy.webp)

…he webpage

… urls"

…itm proxy)

tarraschk · 2022-09-02T12:37:35Z

@bretfourbe tu peux lancer un test sur http://angular.testsparker.com/ stp ? @devl00p nous a donné des résultats sympas pour cette cible et j'aimerais aussi avoir ton avis

Aussi, @devl00p a ajouté des éléments en plus pour mieux browser JuiceShop si tu peux retester

Merci !

bretfourbe · 2022-09-05T09:46:26Z

Hi @devl00p, i tried new mod_wapp and it looks great !
As i can see, headless mode is always on for this module now. Do you know if it is possible to activate headless (and so js pattern detection) only if --headless is used in the command line ?

Regarding file downloads, it seems that files are still downloaded locally, and confirmation pop-up holds the focus, crawl never ends 😕

Also can you lock aiohttp<4 since it does not build with aiohttp 4 (pre-release) ?

…less is activated / completely hide arsenic & marionette logs

devl00p · 2022-09-05T11:55:08Z

@bretfourbe I did some changes. On which website do you have the download issue ?

bretfourbe · 2022-09-05T12:21:52Z

@devl00p it was on a juicy-shop instance, there is a legal.md file which is downloaded

devl00p · 2022-09-05T16:04:21Z

@devl00p it was on a juicy-shop instance, there is a legal.md file which is downloaded

You should test with the latest version, some changes have been made to replace download responses with a simple HTML placeholder.

bretfourbe · 2022-09-12T08:32:31Z

Even using the latest version with the following command ./bin/wapiti -u http://localhost:3000/ -m "" -v 2 --store-session=sessions/ --store-config=sessions/ --flush-session --color --headless visible, I still have the legal.md downloaded and blocking the crawler.

I also noticed a strange behavior with JSON responses : there is a pop-up asking where to download the answer as a JSON file, and the clipboard contains the server response^^

devl00p · 2022-09-12T11:30:53Z

@bretfourbe The headless crawler was not using the intercepting proxy for 127.0.0.1 / localhost, this is the default Firefox behavior.

I pushed a fix to change the browser configuration.

bretfourbe · 2022-09-12T13:45:26Z

@devl00p except the JSON behavior, it is working pretty well now ! I tried on several apps, it is a big step for SPA applications scanning :)

…ing a JSON file

devl00p · 2022-09-13T11:59:26Z

@devl00p except the JSON behavior, it is working pretty well now ! I tried on several apps, it is a big step for SPA applications scanning :)

I pushed a fix for the JSON files. Firefox has a default setting that adds a small bar at the top of the window when reading JSON files.

As wapiti clicks on buttons, it will click on the "Download" button that Firefox added.

I was able to deactivate the bar appearance with firefox settings

bretfourbe · 2022-09-13T13:37:34Z

I pushed a fix for the JSON files. Firefox has a default setting that adds a small bar at the top of the window when reading JSON files.

As wapiti clicks on buttons, it will click on the "Download" button that Firefox added.

I was able to deactivate the bar appearance with firefox settings

Looks good now :)

devl00p · 2022-09-14T11:57:03Z

added one more exception catching (asyncio.TimeoutError) due to exception not being caught in Arsenic: HENNGE/arsenic#152 (comment)

devl00p · 2022-09-14T16:06:44Z

@bretfourbe I let the crawler explore microsoft.com for an hour and it didn't bumped into errors. Do you need more time for testing or do you think I should merge ?

polyedre · 2022-09-15T07:43:12Z

wapitiCore/net/crawler_configuration.py

@@ -17,7 +16,7 @@ class CrawlerConfiguration:
    proxy: str = None
    auth_credentials: tuple = tuple()


Wow! There are a lot of improvements!

I think a more complex data structure would be appropriate for this variable. The Tuple does not offer any guarantee on the number of data. Therefore, it is necessary to add explicit checks:

https://github.com/wapiti-scanner/wapiti/pull/310/files#diff-f5a9d807270fdc42bf1ef8d5ed535674fb8467254f7fe2eadbba7c5bb9c68a59R27-R29

A Namedtuple might be more appropriate here.

This is a minor change and the PR has a bigger scope. Feel free to ignore this suggestion 😄

I don't like that auth_credentials tuple neither :D

I will work on it later

tarraschk · 2022-09-16T06:57:50Z

Quick question : can we use this headless crawler for authenticated scans? Would this be a "second" step?

bretfourbe · 2022-09-16T09:55:25Z

Hi @devl00p, I tried authenticated scan with the following command ./bin/wapiti -u http://testphp.vulweb.com/ -m "" -v 2 --store-session=sessions/ --store-config=sessions/ --flush-session --color -a test%test --auth-type=post and it seems that the -s option is required now.
If i do not specify it, i get the following error :

devl00p · 2022-09-17T15:13:52Z

Hi @devl00p, I tried authenticated scan with the following command ./bin/wapiti -u http://testphp.vulweb.com/ -m "" -v 2 --store-session=sessions/ --store-config=sessions/ --flush-session --color -a test%test --auth-type=post and it seems that the -s option is required now. If i do not specify it, i get the following error :

Yes, -s is required to specify the URL with the --auth-type=post option. Maybe I should add another more explicit option for that URL then display an error message if it is missing.

Also a website may have a login form AND be behind HTTP basic auth so the option should act differently but this is out of the scope of that MR.

What I can do soon is make sure -s is not missing for --auth-type=post and raise an error about usage.

…e=post is used without -s

devl00p · 2022-09-18T19:29:38Z

Quick question : can we use this headless crawler for authenticated scans? Would this be a "second" step?

The headless crawler is used for the auth step if the headless option is set. The login URL will be loaded into the crawler so if the login form is generated dynamically it will be found.

However if the login mechanism is working with some custom JS without the use of classic input fields and sendind data in JSON/XML then Wapiti won't be able to login. That's why I plan to work on a plugin system to allow users to write custom scripts for specific websites.

devl00p · 2022-09-18T19:30:48Z

@bretfourbe pushed some code to fix the missing -s bug

devl00p requested review from JulienTD, bretfourbe, fwininger and tarraschk July 26, 2022 16:10

devl00p added 17 commits August 23, 2022 23:09

wip: headless crawler

2b8683a

missing dependency

e3073c2

fix most tests

a603c23

fix more tests

b9ac5eb

remove our asyncmock as we removed python3.7 support and 3.8 has buil…

2f8ca4d

…tin support

fix annoying warning in the mod_xxe test (... not awaited)

1ba7112

improving stop of the headless crawler

25c48d1

manage several headless modes, cli option

ddc7fcd

Loading cookies inside intercepting explorer.

49182f0

![Yes](https://i.giphy.com/media/3o7btQbbKZQDN0nasE/giphy.webp)

fixing beginner level error in CrawlerConfiguration (:sweating:) + style

c8d9564

ignore intercepted CONNECT requests + increase delay before reading t…

6130465

…he webpage

add --wait option for headless mode + force Request objects in "start…

2f82d73

… urls"

fix test and style

174d779

fix style (again)

8cb0ee2

extract more urls

f583709

integrate exclusions for headless (had to do it in both crawler and m…

703762b

…itm proxy)

fix on urls with fragments

55d16b3

devl00p added 2 commits September 5, 2022 13:32

pin aiohttp version / use headless browser in mod_wapp only if --head…

d14bcf2

…less is activated / completely hide arsenic & marionette logs

fix tests for mod_wapp

d0b5760

check geckodriver presence before activating headless mode

9a4b36a

headless mode: allows Firefox to connect to 127.0.0.1 using the proxy

01d9360

headless mode: remove the buttons that are added by firefox when read…

a2161d9

…ing a JSON file

catch asyncio.TimeoutError due to arsenic issue

4f317eb

polyedre reviewed Sep 15, 2022

View reviewed changes

polyedre approved these changes Sep 15, 2022

View reviewed changes

remove content-disposition header when set (intercepting mode)

772ef37

put back use of HTTP redirection urls + raise usage error is auth-typ…

dfb8df7

…e=post is used without -s

devl00p merged commit 5c77939 into wapiti-scanner:master Sep 20, 2022

fwininger mentioned this pull request Sep 21, 2022

Add a minimal dockerfile to run the headless crawler #320

Closed

devl00p deleted the headless_crawler branch July 4, 2023 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Headless crawler #310

Headless crawler #310

devl00p commented Jul 26, 2022

codecov-commenter commented Jul 26, 2022 •

edited by codecov bot

Loading

devl00p commented Jul 28, 2022

tarraschk commented Aug 2, 2022

bretfourbe commented Aug 10, 2022

devl00p commented Aug 10, 2022

bretfourbe commented Aug 11, 2022

devl00p commented Aug 11, 2022

devl00p commented Aug 16, 2022

bretfourbe commented Aug 22, 2022

tarraschk commented Sep 2, 2022 •

edited

Loading

bretfourbe commented Sep 5, 2022

devl00p commented Sep 5, 2022

bretfourbe commented Sep 5, 2022

devl00p commented Sep 5, 2022

bretfourbe commented Sep 12, 2022

devl00p commented Sep 12, 2022

bretfourbe commented Sep 12, 2022

devl00p commented Sep 13, 2022

bretfourbe commented Sep 13, 2022

devl00p commented Sep 14, 2022

devl00p commented Sep 14, 2022

polyedre Sep 15, 2022

devl00p Sep 17, 2022

tarraschk commented Sep 16, 2022

bretfourbe commented Sep 16, 2022

devl00p commented Sep 17, 2022

devl00p commented Sep 18, 2022

devl00p commented Sep 18, 2022

		@@ -17,7 +16,7 @@ class CrawlerConfiguration:
		proxy: str = None
		auth_credentials: tuple = tuple()

Headless crawler #310

Headless crawler #310

Conversation

devl00p commented Jul 26, 2022

codecov-commenter commented Jul 26, 2022 • edited by codecov bot Loading

Codecov Report

devl00p commented Jul 28, 2022

tarraschk commented Aug 2, 2022

bretfourbe commented Aug 10, 2022

devl00p commented Aug 10, 2022

bretfourbe commented Aug 11, 2022

devl00p commented Aug 11, 2022

devl00p commented Aug 16, 2022

bretfourbe commented Aug 22, 2022

tarraschk commented Sep 2, 2022 • edited Loading

bretfourbe commented Sep 5, 2022

devl00p commented Sep 5, 2022

bretfourbe commented Sep 5, 2022

devl00p commented Sep 5, 2022

bretfourbe commented Sep 12, 2022

devl00p commented Sep 12, 2022

bretfourbe commented Sep 12, 2022

devl00p commented Sep 13, 2022

bretfourbe commented Sep 13, 2022

devl00p commented Sep 14, 2022

devl00p commented Sep 14, 2022

polyedre Sep 15, 2022

Choose a reason for hiding this comment

devl00p Sep 17, 2022

Choose a reason for hiding this comment

tarraschk commented Sep 16, 2022

bretfourbe commented Sep 16, 2022

devl00p commented Sep 17, 2022

devl00p commented Sep 18, 2022

devl00p commented Sep 18, 2022

codecov-commenter commented Jul 26, 2022 •

edited by codecov bot

Loading

tarraschk commented Sep 2, 2022 •

edited

Loading