Things that I don't necessarily want to go into the README.md but I don't mind being included with the project.
Research, concepts, and ideas for hiding the use of explicitly Webdriver ideally; using every strategy possible, broken up into three primary categories (once we determine what existing higher-level "crawler" tool like "Rod", or "Playwright", or "Colly".
Our development should focus on what makes this project unique, to maximize our time volunteering towards this common goal: unhindered browser automation.
-
Don't use the standard port 4444, this could be checked with JavaScript running in the browser. Ideally, switch to using a socket connection, or some other IPC methodology.
-
Conceal CDC JavaScript added to each page better than just renaming it, especially concealing it in the exact format (for now, while we are binary patching, we can use 0's because we have fixed allocated memory space to get a variable length name).
Ideally, if we can, use another technique entirely to introduce JavaScript actions to a loaded page, because a large portion of the strategy to identify automated browsers is by the very loud, and obvious tools used to control the browser that are standardized.
If we have to maybe even input JavaScript via the console since we are intentionally working with full browser implementations.
Alternatively, an extension could be used to introduce JavaScript, taking a popular extension and modifying it to introduce our JavaScript steathily.
-
Switch off Chromedriver, or at work on patching the source code and compiling it, enabling much fine grain control over the tool.
However, eventually we want to be supporting Webdriver, which is more generic and works on the top 5 browsers.
A single correct implementation can interact with Firefox, and Chrome; and this should be our aim.
-
Transparent proxy around our browser to modify incoming JavaScript, and outgoing data. In addition, increasing the speed of our implementation by caching common libraries, frameworks, or other assets. Also, simply reject connections to certain hosts, and monitor exfiltrated data which will provide us with a way to determine exactly what the third-party companies offering "bot fingerprinting" are sending home.
Providing a cheatsheet on what is expected. Possibly even by modifying outgoing data to fit expectations rather than changing the browser.
-
Most likely will need to patch the Browsers directly, either binary or source patching to conceal automation, and hide whatever clues browser developers may have seeded so they can determine if the browser is automated.
Ideally, we bypass needing to do this; but it may end up being an inevitability, as it would give us the control we need to "blend-in" into the rest of internet traffic or at least get close enough that blocking us, will inevitably block regular users, which will result in: gg.
-
We want extensions, ability to say what fonts we have, and other things that will give us a unique but not too unique of a fingerprint.
-
Possibly use Container or MicroVM with (Micro/Uni)-kernel to store one or more browsers within a single binary.
But this may not be advantageous.
-
Finally, the last component will be input, or behavior-based. Meaning we need to move the mouse realistically, not in straight lines towards the target, with S-Curves instead of a fixed velocity, and variable input of keyboard keys.
And this will be basically determined by "rolling-a-character" which will have a specific "proficiency".
Typing speed and other things like, typos (yes, we will need to have typos, and backspacing, we want to blend in to the point its too difficult to detangle us from real users.
It may be necessary to use libinput, or uinput, to create virtual devices to input keyboard keys and mouse movement to move behavior beyond the built-in tools that may have fingerpints since the browser developers have incentive to install concealed flags that are raised silently
Simply need to get close, we don't have to fully mimic a user.
This should be enough to setup a roadmap, we at least have a clear order of operations.
This will be refined further, and Ill generate flow charts and better explanations of what needs to be done to maximize time of anyone who decides to volunteer towards this goal, that is clearly both important and necessary to a lot of people.
And with the python project "undetected chromedriver" essentially abandoned and the developer working on direct CDP which does not naturally give one any advantages to avoiding detection. And actually guarantees the focus will remain around Chrome instead of taking advantage of the more advanced "Webdriver" which is a W3 spec protocol, implemented in the 5 most popular browsers.
Ultimately, the transparent MITM proxy, is most likely the most effective strategy at long-term solution, solving the cat-n-mouse games.