Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alexa sunset. Priority rethinking. #3656

Closed
karlcow opened this issue Dec 8, 2021 · 6 comments
Closed

Alexa sunset. Priority rethinking. #3656

karlcow opened this issue Dec 8, 2021 · 6 comments

Comments

@karlcow
Copy link
Member

karlcow commented Dec 8, 2021

On May 1st, 2022, Amazon will sunset Alexa.
https://support.alexa.com/hc/en-us/articles/4410503838999

The priority flag for our bug is defined according to Alexa ranking. We need to rethink this strategy.

@karlcow
Copy link
Member Author

karlcow commented Dec 8, 2021

Would be worth to add this to 2022H1

@ksy36
Copy link
Contributor

ksy36 commented Jan 14, 2022

One thing to add, since we'll need to rethink this, wonder if it makes sense to change the priority importance to priority importance per country (it could be a different label too). This could help us prioritize diagnosis.

Two examples:
webcompat/web-bugs#97492 has priority-normal label, but it is a 41 top site in Korea. So it's priority-critical for Korea.
webcompat/web-bugs#83322 has priority-normal, #674 in US and #846 in Canada. So it's likely priority-important for these 2 countries.

@karlcow
Copy link
Member Author

karlcow commented Jan 16, 2022

Yes it's a good opportunity to revise and improve the script here, which was trying to play with locales too.

if url not in topsites:
# URL not cached, create Site object and put in topsites
site_row = Site(url, priority, country_code, rank)
topsites[url] = site_row
session.add(site_row)
else:
site_row = topsites[url]
# If priority of the URL is higher than cached one,
# update new priority, country_code and ranking in cache
if site_row.priority > priority:
site_row.priority = priority
site_row.country_code = country_code
site_row.ranking = rank

@miketaylr
Copy link
Member

https://tranco-list.eu/ can be used - and it's free and has a nice Python API (https://pypi.org/project/tranco/) - there's also an HTTP API if that's preferred.

Attribution: We use the lists from three providers: Alexa, Cisco Umbrella (available free of charge), and Majestic (available under a Creative Commons Attribution 3.0 Unported License). Tranco is not affiliated with any of these providers.

I guess it will lose Alexa data eventually, but the data will still provide some signal.

@ksy36 ksy36 self-assigned this Feb 27, 2022
@ksy36
Copy link
Contributor

ksy36 commented Mar 21, 2022

Thanks for the suggestion, Mike :)

I've looked at 2 Tranco lists (with and without Alexa) and going to document my findings here. There are two things I've noticed so far that are worth considering when making the switch to Tranco.

Also keeping in mind this rule in #1533 (comment) from @MDTsai :

Critical: alexa top 100 in worldwide
Important: alexa top 101-1000 in worldwide or alexa top 100 in tier 1 countries/regions
Normal: alexa top 1001-10000 or alexa top 101-1000 in tier 1 countries/regions
Others: others

1) Ranking per country
Tranco doesn't seem to have a mechanism to get top domains per country. So it may be worth preserving the Alexa's per country data that we currently have and not updating it when we make the switch (...and fetching the updates one last time before Alexa is deprecated). This would only apply to domains that have non-empty country_code:

Screen Shot 2022-03-21 at 3 10 44 PM

Countries list that we fetch ranking for: 'US', 'FR', 'IN', 'DE', 'TW', 'ID', 'HK', 'SG', 'PL', 'GB', 'RU'.

It's worth mentioning, there is a checkbox "Only include domains included in the Chrome User Experience Report of February 2022" on https://tranco-list.eu/configure, which allows filter by country. It doesn't seem to be accurate though - it's weighted towards global sites rather than local.

Screen Shot 2022-03-21 at 2 28 56 PM

2) For some sites ranking is lower than current Alexa's ranking and "perceived" ranking

For certain sites, the ranking is lower on the global level (especially in the list without Alexa).

Screen Shot 2022-03-21 at 3 36 46 PM

This matters once a site is out of the top 100 / top 1000. In this screenshot, in the Tranco list without Alexa all the sites that are considered to be in Alexa top 100, are ranked lower. In the Tranco list with Alexa some are still in the top 100 and etsy.com is out.
Not sure how big of a concern this would be going forward. If Tranco is keeping Alexa's data for some time, then it's probably fine. We could also change the rule a bit to make sites that are in 150-200 Tranco rank be priority-critical (instead of the current 100).

Also just saw a message on Tranco's website (so my observations may not be relevant soon 🙂 ):

We plan to continue maintaining Tranco, even after the deprecation of the Alexa ranking on 1 May. We are currently working on the next steps, including adding new data sources. We will announce more details in due time.

@ksy36
Copy link
Contributor

ksy36 commented Mar 28, 2022

An update here:

I've looked at https://pypi.org/project/tranco/ and it is downloading a csv with 1 million domains, which seems a bit too much for our need as we only require 10000 max.

So I wrote a script that fetches tranco's API and gets a recent list by date, (for example https://tranco-list.eu/api/lists/date/2022-03-26) and then downloading a csv with 10k (https://tranco-list.eu/download/GZ6NK/10000).

As for storing it, I'm thinking to create 2 tables, one would contain data from this csv (domain and ranking) and the second one would contain top domains per country from Alexa (domain, ranking, country). There is probably no need to store priority, as it can be determined in the code, since it primarily depends on the rank.

So the ranking would be determined as follows (will have to join two tables to get the rank):

  • if domain only exists in the main table, use ranking from the main table
  • if domain exists in both tables, use ranking from the second table (if ranking per country is more important than in the global table)

Thinking of going 2 tables route because when fetching updates from Tranco it would be easy to archive the old table and just create a new one (in the same it's done right now) without the need to search and update a rank for each domain. And the per country rank table will not be updated once Alexa turns off their API. Regions that we're fetching at the moment: 'US', 'FR', 'IN', 'DE', 'TW', 'ID', 'HK', 'SG', 'PL', 'GB', 'RU'.

I'm probably missing something, so any insight or correction is appreciated :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants