Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add alexa list #8

Open
pexcn opened this issue Nov 9, 2018 · 6 comments
Open

add alexa list #8

pexcn opened this issue Nov 9, 2018 · 6 comments
Labels
enhancement New feature or request

Comments

@pexcn
Copy link
Owner

pexcn commented Nov 9, 2018

Origin: https://gist.github.com/chilts/7229605

Data sources

EDIT:
The umbrella-static data source seems include subdomains, not recommends.

@pexcn
Copy link
Owner Author

pexcn commented Nov 9, 2018

Commands

curl -kLs https://s3.amazonaws.com/alexa-static/top-1m.csv.zip | gunzip > top-1m.csv

# top 1m in world
awk -F ',' '{print $2}' top-1m.csv > top-1m.txt

# top list in china
grep -Fx -f chinalist.txt top-1m.txt > top-cn.txt

# top 5k in china
head -5000 top-cn.txt > top-cn-5000.txt

# top 1k in china
head -1000 top-cn.txt > top-cn-1000.txt

# top 500 in china
head -500 top-cn.txt > top-cn-500.txt

@pexcn pexcn closed this as completed in 5530d18 Nov 9, 2018
@pexcn pexcn added the enhancement New feature or request label Nov 20, 2018
pexcn added a commit that referenced this issue Apr 21, 2020
@wyf88
Copy link

wyf88 commented Nov 29, 2020

It seems that the Alexa top list is now restricted to paid customers and the link is no longer updated and incomplete (source).

The Cisco Umbrella Popularity List takes DNS query into account and reflects the most commonly connected domains (including website resources) that are not restricted to those "main domains".

I have conducted some testings and found that the Umbrella list has more common items compared with Alexa list with both gfwlist and chinalist.

  • Umbrella: 1661 with gfwlist, 8562 with chinalist
  • Alexa: 1444 with gfwlist, 3029 with chinalist

In addition, domains that are only in Umbrella list look much more common than those only in Alexa list. Therefore, I would suggest to replace Alexa list with the Umbrella list. This could better serve the purpose of generating a shorter chinalist and/or gfwlist.

@pexcn
Copy link
Owner Author

pexcn commented Nov 29, 2020

It seems that the Alexa top list is now restricted to paid customers and the link is no longer updated and incomplete (source).

https://s3.amazonaws.com/alexa-static/top-1m.csv.zip has never stopped updating, and Umbrella list contains too many subdomains, so it is not applicable here.

@wyf88
Copy link

wyf88 commented Jun 24, 2021

Umbrella list contains too many subdomains, so it is not applicable here.

Actually that is one advantage of the Umbrella list, because both the gfwlist and the accelerated-domains.china.conf contain a large number of subdomains, e.g. only a subdomain is blocked, or has server in CN. Therefore compared with Alexa toplist, the Umbrella list can actually match more domains in gfwlist and accelerated-domains.china.conf.

The other thing is that the Alexa list mainly focuses on the main domain of a website and ignores (or puts much less weight on) domains serving static files for that website.

Here are some test chinalist sorted by Umbrella toplist, sorted by Alexa toplist, and sorted by both, generated by today. As you can see, the Umbrella list has 3231 matches with the accelerated-domains.china.conf, whereas the Alexa list has 2642 matches, and there are 5038 matches if both Umbrella and Alexa are used.

Personally I'm using the Umbrella list only and the generated gfwlist and chinalist work well (BTW the matches can be used to generate a short chinalist for mobile apps). But maybe it is worth to using both Umbrella and Alexa? Just for your consideration.

@pexcn
Copy link
Owner Author

pexcn commented Jun 24, 2021

I'm sorry, I haven't touched this project for too long, I know almost nothing about its logic now.
Maybe I will deal with it later if I have time.

@pexcn
Copy link
Owner Author

pexcn commented Jun 24, 2021

Reopen.

@pexcn pexcn reopened this Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants