-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add alexa list #8
Comments
Commandscurl -kLs https://s3.amazonaws.com/alexa-static/top-1m.csv.zip | gunzip > top-1m.csv
# top 1m in world
awk -F ',' '{print $2}' top-1m.csv > top-1m.txt
# top list in china
grep -Fx -f chinalist.txt top-1m.txt > top-cn.txt
# top 5k in china
head -5000 top-cn.txt > top-cn-5000.txt
# top 1k in china
head -1000 top-cn.txt > top-cn-1000.txt
# top 500 in china
head -500 top-cn.txt > top-cn-500.txt |
It seems that the Alexa top list is now restricted to paid customers and the link is no longer updated and incomplete (source). The Cisco Umbrella Popularity List takes DNS query into account and reflects the most commonly connected domains (including website resources) that are not restricted to those "main domains". I have conducted some testings and found that the Umbrella list has more common items compared with Alexa list with both gfwlist and chinalist.
In addition, domains that are only in Umbrella list look much more common than those only in Alexa list. Therefore, I would suggest to replace Alexa list with the Umbrella list. This could better serve the purpose of generating a shorter chinalist and/or gfwlist. |
https://s3.amazonaws.com/alexa-static/top-1m.csv.zip has never stopped updating, and Umbrella list contains too many subdomains, so it is not applicable here. |
Actually that is one advantage of the Umbrella list, because both the gfwlist and the The other thing is that the Alexa list mainly focuses on the main domain of a website and ignores (or puts much less weight on) domains serving static files for that website. Here are some test chinalist sorted by Umbrella toplist, sorted by Alexa toplist, and sorted by both, generated by today. As you can see, the Umbrella list has 3231 matches with the Personally I'm using the Umbrella list only and the generated gfwlist and chinalist work well (BTW the matches can be used to generate a short chinalist for mobile apps). But maybe it is worth to using both Umbrella and Alexa? Just for your consideration. |
I'm sorry, I haven't touched this project for too long, I know almost nothing about its logic now. |
Reopen. |
Origin: https://gist.github.com/chilts/7229605
Data sources
or
https://umbrella-static.s3.amazonaws.com/top-1m.csv.zip
EDIT:
The umbrella-static data source seems include subdomains, not recommends.
The text was updated successfully, but these errors were encountered: