-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deep packet inspection to classify V2Ray traffic in Dec, 2020 #557
Comments
go tls is the same |
Not exactly. See v2ray/v2ray-core#2522 and v2ray/v2ray-core#2521. According to the commit author, only when transport = h2 and disableSessionResumption = true, v2ray has the same tls fingerprint as golang default. |
TL;DRProblem
Solution
Well, we know that WebSocket is poorly supported on HTTP/2. The Go standard libaray use HTTP/2 by default, so it's difficult to reach 100% anonymous in VMess / WebSocket / TLS mode (you have to avoid HTTP/2). A very direct purpose of use VMess / WebSocket / TLS instead of VMess / TLS is that, the former can easily be transported over a CDN (like CloudFlare). If not, a more direct way like VMess / TLS even VMess / TCP is preferred. It's the CDN that decided the usage of WebSocket. However, Golang programs are known to have And that's why I started https://github.com/Qv2ray/gun, an attempt to use gRPC (Protobuf over HTTP2) as the transportation layer. This will make it a lot easier to blend into normal Golang program traffic. Maybe it's time to let WebSocket die. Let's hug gRPC and forget about HTTP/1.1. |
目前除了 WSS,都没有特意设置请求的 ALPN。 至于 SessionTicketsDisabled,目前默认传 false,我需要去看下 Go 的默认值。 |
@rickyzhang82 would you mind to do an activation visualization or something, to help us locate the ROI? |
This comment has been minimized.
This comment has been minimized.
The fact that v2ray's ClientHello when using wss transport is unque is not a news. And it will remain even after adopting uTLS, because none of common browsers still use
|
Currently, though Note that TLSv1.3 handshake is always 1-RTT in Golang, as Golang doesn't support early data yet. |
It seems you are using a very imbalanced dataset. What about your PR-curves? |
I repeated what I did last year. See the steps here. The only difference I make this year is to add a filter in none-V2ray traffic where only TLS traffic were retained. But I still kept the equal size between V2Ray traffic and none-V2Ray traffic in data generator. @DuckSoft, the binary classification categories in both training data set and inference data set are equal. The change is too trivial to publish the whole thing again. But anyone could repeat the whole thing at home. @xiaokangwang confirmed he could replicate it. See def generate_train_validation_packet_path_list(data_root=DATA_ROOT, training_pct=TRAINING_DATA_PERCENTAGE, eqaul_size=True,
non_v2ray_file_filter_func=None):
# All file list
file_list = rglob(data_root, PACKET_FILE_EXT)
# V2ray file list
v2ray_file_list = [file_path for file_path in file_list if binary_classification(file_path) == 1]
# None V2ray file list
if non_v2ray_file_filter_func is None:
non_v2ray_file_list = [file_path for file_path in file_list if binary_classification(file_path) == 0]
else:
non_v2ray_file_list = [file_path for file_path in file_list if binary_classification(file_path) == 0 and non_v2ray_file_filter_func(file_path)]
v2ray_file_list.sort()
non_v2ray_file_list.sort()
if eqaul_size:
cut_off_count = min(len(v2ray_file_list), len(non_v2ray_file_list))
v2ray_file_size = cut_off_count
non_v2ray_file_size = cut_off_count
else:
v2ray_file_size = len(v2ray_file_list)
non_v2ray_file_size = len(non_v2ray_file_list)
v2ray_indexes = np.arange(len(v2ray_file_list))
np.random.shuffle(v2ray_indexes)
non_v2ray_indexes = np.arange(len(non_v2ray_file_list))
np.random.shuffle(non_v2ray_indexes)
training_file_list = [v2ray_file_list[index]
for index in v2ray_indexes[:math.ceil(v2ray_file_size * training_pct)]] + \
[non_v2ray_file_list[index]
for index in non_v2ray_indexes[:math.ceil(non_v2ray_file_size * training_pct)]]
validation_file_list = [v2ray_file_list[index]
for index in v2ray_indexes[math.ceil(v2ray_file_size * training_pct): v2ray_file_size]] + \
[non_v2ray_file_list[index]
for index in non_v2ray_indexes[math.ceil(non_v2ray_file_size * training_pct): non_v2ray_file_size]]
print("Statistics: ")
print("Total V2ray traffic %d, Total non-V2ray traffic %d" % (len(v2ray_file_list), len(non_v2ray_file_list)))
print("Output train traffic %d, Total validation traffic %d" % (len(training_file_list), len(validation_file_list)))
return training_file_list, validation_file_list |
Golang programs are rare and I'm not surprised that it's accurately picked out. I thought V2Ray should at least blend with normal Golang traffic. Currently WS + TLS is easily distinguished due to the ALPN and |
@rickyzhang82 The above code undersamples both the training set and the validation set. It's ok to undersample on the training set, but undersampling on the val/test set is not such a good choice. This method of dividing the data set allows the classifier to cheat. Because the distribution of positive and negative examples of the test set has been artificially adjusted. For machine learning, what is important is the independence of the training set and the test set. |
@rickyzhang82 Can you please try the iptables rules I posted on v2ray/discussion#704 (comment) to the same datasets? It should still work for wss mode by now. I wonder what its accuracy is. |
Hello
The biggest problem right now is v2rayNG and v2ray itself because they are using the old version of v2ray which does not work at all in Iran. |
@HirbodBehnam That's no surprise. In 4.34 Release Notes
Also: #557 (comment) You may validate my conclusion by setting |
@HirbodBehnam Please test (at client side): v2ray-core v4.32.1/v4.33.0 with v2ray-core v4.34.0 with |
@DuckSoft @RPRX |
Hello again |
@HirbodBehnam try v2ray-core v4.24.2 (as client) |
@HirbodBehnam and v2ray-core v4.23.1 |
@RPRX Still nothing... |
@HirbodBehnam |
Test with clean Linux OS with client v4.34 and see, native socks5 or http proxy port could possibly get detected by spyware, same goes with smart pohone |
@NathanIceSea
And you can see the wireshark screenshot which I have tweeted. Same ID, but different TTL. Plus, it's good to note that on phone, using v2ray with Termux, or by using HTTP Injector, I can connect to my server without any problems. (With the port 10808 which is the default port in my config) Also small status report: VMess + Raw WS and Trojan are working fine for now. Probably Iran doesn't DPI the WS connections? Or doesn't have to tools/knowledge to identify the VMess. |
Hi, this adversarial work is important, and I know it takes a lot of effort, so keep doing it, if you can. It would be more helpful if you can provide more microscopic or ablation analysis of the most informative features, which will then be actionable, otherwise it's just one big conclusion. |
how your configuration for Vmess? |
Yeah I'm not using cloudflare at all. |
Thank you. |
Hello again So on my own computer I edited that But anyway; I really want to appreciate everyone who has contributed in this project and made v2ray. Thanks everyone :) PS 1: It's worth noting that v2rayNG used ChaCha as it's top cipher suites, while HTTP injector used AES. Thanks again every one :) Edit: I also forgot to apologize from everyone; because at first I thought this has to do something with DPI, but from what it looks like, it's just a simple matter of fingerprint + cipher suite combinations which triggers Iran's firewall. |
@HirbodBehnam That's very fruitful discovery! Thank you! |
So apparently that was the case! Here is what I did to compile this binary
This client should force all TLS connections to prefer AES to Chacha. It's also worth noting that I also tried forcing Chacha and I couldn't connect from devices which was working fine. I really doubt there is anything v2ray can do to fix this; unless adding custom ciphers which It looks like there is no plan for it. |
Removed in a previous security update, you can manually patch 9321210#diff-be836badf579ea512702a700ff7bb7f654b6f6ccced8a38c031fb331a1b491cdR144 |
While checking naiveproxy's fingerprints I started to question the decision to disable session resumption by default. The reason given was simply that with session resumption the TLS fingerprint (manifested as the pre_shared_key extension in TLS 1.3) differs from the top ranking one and in order to minimize exposure let's just use the top ranking fingerprint all the time. Now I believe this reasoning is flawed, and it's fairly easy to see why by looking into usage of session resumption in the real world. Session resumption does happen organically in the wild all the time, though less frequently than not. Per tlsfingerprint.io, the top fingerprint 9c673fd64a32c8dc has two neighbors 5408690af1e08199 (past week 5.88%, large session tickets that preclude padding), e360886acbf4f415 (past week 0.53%, smaller session tickets with padding). Do we look at this number of 0.53% and say it makes this fingerprint more classifiable as a circumvention technology? No. The fact that it appears less frequently than not does not imply it is not an organic fingerprint factor. Session resumption being less likely to happen is a result of its requirements of multiple conditions being present:
I took a real tcpdump capture of Chrome browsing and found pre_shared_key with these SNIs: www.google-analytics.com, adservice.google.com, fonts.googleapis.com, update.googleapis.com, ajax.cloudflare.com. These examples can explain the unlikeliness of the third condition: If the usage pattern is of frequent requests to the same host, a carefully designed network stack will reuse the same TLS session without needing to restart and resume it. If the usage pattern is too infrequent, or the browser itself is restarted or drops the session tickets, it will also not resume previous TLS sessions. And these domains are exactly the type of hosts that will serve infrequent but not too infrequent requests over time. The unlikeliness of session resumption also once again reveals it has fallen out of favor in the task of reducing connection setup overhead (TLS RTT) compared to the strategy of connection pooling and reuse, but nevertheless it remains an organic usage pattern emitted by the most common browsers and servers. Now back to the decision to disable session resumption by default, I believe it is not only based on flawed logic, but it is actively harmful to the goal of minimizing classification exposure. The act of disabling a commonly supported and less commonly used feature altogether itself creates a less common/more unique configuration, and the total lack of any session resumption also constitutes a passive feature. |
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days |
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days |
I believe the GFW of Iran is trying to block everything while causing the least damage to ordinary web browsing traffic (TLS traffic on port 443 and cleartext on 80) that affect social or economical aspects, because as we all know banks and government/social services rely on these technologies to provide basic services to citizens. And here's the interesting part: Most websites tend to use |
I collected V2Ray traffic data and reran my deep packet inspection test, as described from the issue here.
I compiled V2Ray (commit hash
5dffca842
) in Go 1.15 and use TLS + websocket.In 10 days, I collected V2ray connections 17,998, none-V2ray TLS connections 136,981. I trained the new CNN model. It still could reach traffic classification accuracy
0.9959
. It shows a perfect ROC curve.Has V2Ray dev team schedule any road map to blend in V2Ray with other none-V2Ray TLS traffic yet?
The text was updated successfully, but these errors were encountered: