-
-
Notifications
You must be signed in to change notification settings - Fork 183
Analysts' Guide
Analysts are responsible for researching the stats and trends used throughout the Almanac. Analysts work closely with authors and reviewers during the planning phase to give direction on the types of stats that are possible from the dataset, and during the analyzing/writing phases to ensure that the stats are used correctly.
- Table of Contents
- Commitment summary
- Expensing queries
- Analysis workflow
- Dates to use
- Query conventions
- Formatting results
- A warning on totals
- Saving results to the sheets
- Almanac custom metrics and UDFs
- Almanac dataset
- Rank
- Appendix
- Content planning: about 4 hours working with the content team to identify metrics from the HTTP Archive that could be used to support the chapter. This time is also spent preparing queries and augmenting the testing infrastructure as needed (custom metrics). This phase typically occurs in June-July.
- Data validation: about 6 hours finalizing queries, running them against the dataset, and building a spreadsheet of results for the rest of the content team to review. This phase typically occurs in July-September.
- Content writing: about 2 hours reviewing the drafts to ensure that the results are being interpreted correctly. This phase typically occurs in September-October.
The total commitment is approximately 30 hours of work per chapter, spread over 6 months.
The 2024 project is underway and we're actively looking for analysts! Browse the list of open chapters and comment in any chapter-specific issues that interest you.
The Analysts team will be given BigQuery coupon codes to offset the cost of analysis. To get your code, join the #web-almanac-analysts
channel on the HTTP Archive Slack and ask @Paul Calvano
.
There could be dozens of queries per chapter, so to help stay organized here is a recommended workflow for analysts to get the queries written and the results shared.
- When the chapter is outlined, work with the content team to identify the quantifiable metrics that would be needed.
- Add each metric to the Metrics section using data-oriented descriptions of what it's measuring, for example "% of mobile pages that have images larger than 10 MB".
- Triage the feasibility of each metric based on the capabilities of the HTTP Archive dataset. Feel free to ask around in
#web-almanac-analysts
if you're not sure whether a metric is feasible. - Create a git branch off
main
named<chapter>-sql-<yyyy>
(or similar) and create a subdirectory for your chapter under thesql
directory named<yyyy>/<chapter-name>
. For example,2022/css
. - Open up a pull request to merge with the
main
branch, but leave it in draft mode for now. Edit the PR description to contain a checklist of all of the metrics you triaged as feasible in the doc. See #1087, for example. - As you write the queries for each metric, add a comment to the top of the query using the description from step 2, and give the file a short and descriptive name, like
pct_large_images.sql
. (Note that in 2019 we used a system of metric IDs like01_22.sql
, which made it hard to understand what a query did by its name.) - When all of the queries have been written, request a review from the @HTTPArchive/analysts team.
- When the 2020_08_01 HTTP Archive crawl is complete at the end of August, run each query and save its results to your designated spreadsheet linked from the chapter issue. Each metric's results should be in its own tab, which is named using the file name (eg
pct large images
). See visualizations examples. - To help others interpret the results, leave a comment on the top row of the results with the correct interpretation. For example: "Read as: 4.28% of mobile pages contain images larger than 10 MB".
- Authors and reviewers should check the results to make sure they're within expected limits, otherwise the query may need to be rewritten.
- When the authors have drafted the chapter, they should leave a placeholder for the data visualization. Generate a chart using the templates provided, publish it to get its embeddable HTML, and replace the placeholder.
The Web Almanac is an annual publication and we try to enforce a common HTTP Archive run to query for consistency across the Almanac. There has been some suggestions and discussion about making it less annual, or publishing different chapters at different time of years but, for now, it's annual primarily based around the one month's dataset.
We try to use the July dataset for each year, but there have been exceptions, either due to issues with that particular dataset making it less reliable (2020), or when we decided to publish earlier than usual (2022). Even within years, there are some exceptions. Below are the dates for each "Almanac Year" and the exceptions.
Almanac Edition | Dataset to use | Exceptions |
---|---|---|
2019 | 2019_07_01 | |
2020 | 2020_08_01 | HTTP Archive performance stats were incorrect for Aug and 2020-09-01 should be used for this (CrUX data was fine, and should really be used in preference to lab-based results anyway). |
2021 | 2021_07_01 | WASM used 2021_09_01 |
2022 | 2022_06_01 | Parsed CSS data was not available for June and July data should be used. Also 2022_06_09 contains home AND secondary pages for those chapters wanting to use that (the home page match the data fro 2022_06_01 as the official run kicked off later in the month and we just marked them as 2022_06_01). |
We have a linter to attempt to block incorrect usage when copying queries from year to year. You can prevent the linter from flagging in these edge cases by adding --noqa: L063
to the end of the SQL line with the "incorrect" date.
#standardSQL
# The number of unique websites in the July 2019 crawl.
SELECT
_TABLE_SUFFIX AS client,
COUNT(DISTINCT url) AS num_pages
FROM
`httparchive.summary_pages.2019_07_01_*`
GROUP BY
client
- The first line should be
#standardSQL
so the query is interpreted as Standard SQL syntax - The next line(s) should be a comment describing what the query is analyzing and the metric ID(s) if relevant
- Use uppercase for all SQL keywords (SELECT, COUNT, AS, FROM, etc)
- Fields, and tables, are descriptively named in
snake_case
(lowercase and underscore-delimited) - Function names are descriptively named in
camelCase
(lowercase in general with uppercase letter indicating new word) to differentiate them from fields and tables. Do not reuse the exact same function name for different purposes in different queries. See below for more information on user defined functions. - Table names should always specify the project (
httparchive
) and include both desktop and mobile when possible by using a wildcard (*
) - Separate the desktop and mobile results by grouping by client. When using a wildcard, this will be the
_TABLE_SUFFIX
special value. - Filenames should have a
.sql
extension for syntax highlighting in GitHub and other editors, and also allow them to be recognised by the linter (see below). - Filenames should not have spaces in them to make them easier to reference from the web.
In general we would like both data sets queried. It is tempting for some stats to only run one, whether there should not be any difference between the two (e.g. HTTP version supported, or looking at robots.txt), but that makes the assumption that we tests the same sites on both! There is a huge overlap alright, but not exactly the same because we base our crawl on CrUX's list which is based on usage. So https://m.facebook.com might appear on mobile but not on desktop. And maybe facebook forgot to add a robots.txt (or added a really weird one!) to their mobile site?
Most stats are available to both, the big exception being Lighthouse which currently is only available for the mobile crawl (due primarily to capacity constraints but something we are working on!).
We've also been surprised before when we wouldn't expect there to be much difference, so unless there's a very good reason (e.g. Lighthouse tests are only run on mobile, or it's for the mobile chapter), we recommend to include both - you might be surprised in the differences which makes interesting information to report on! That's not to say that the author needs to include both stats if they are repetitive, but let's at least run the data so they can make that call.
If we really do only need to run (or report on) only one, we generally prefer mobile over desktop because 1) We only have certain stats (Lighthouse) for mobile, 2) we've a bigger dataset for mobile and 3) we live in a mobile first world!
Most of the above standards will be enforced by the linter (SQLFluff) when you open a Pull Request and the linting will fail if you deviate from these standards.
You can run the linter locally if you have a python environment set up, which can be set up with the following commands if using Virtual Env:
cd src
virtualenv --python python3.8 .venv
source .venv/bin/activate
pip install -r requirements.txt
Or for those on Windows:
cd src
virtualenv --python python3 .venv
.venv\Scripts\activate.bat
pip install -r requirements.txt
The environment will need to be activated each time but the virtual env should only need set up once and the requirements only need to be installed again if you want new versions.
To lint the 2020 resource-hints SQL files, for example, install the python environment as per above, and then issue the following command:
sqlfluff lint ../sql/2020/resource-hints
This will return errors like the following:
% sqlfluff lint ../sql/2020/resource-hints
== [../sql/2020/resource-hints/adoption_service_workers.sql] FAIL
L: 25 | P: 26 | L010 | Inconsistent capitalisation of keywords.
L: 26 | P: 37 | L010 | Inconsistent capitalisation of keywords.
L: 34 | P: 63 | L038 | Trailing comma in select statement forbidden
All Finished 📜 🎉!
This states that:
- On line 25, in position 26 you are using lowercase for keywords (e.g.
as
instead ofAS
) so failed rule L010. - Similarly on line 26, position 37.
- And finally on line 34, position 63 you have an unnecessary comma (e.g.
SELECT a,b, FROM table
) and so failed rule L038. Remove the extra comma.
The list of rules can be found in the SQLFLuff documentation though we have turned a few of them off and configured others for our style (see the our .sqlfluff file if curious).
If you see any "unparseable" or PRS errors, then this is either an error in your code, or perhaps you've discovered a bug. Reach out to Barry (@tunetheweb) for help if stuck.
To attempt to autofix the errors you can use the fix
command, instead of lint
:
sqlfluff fix ../sql/2020/resource-hints
Which will produce similar output but with an offer to fix the issues it thinks it can fix:
% sqlfluff fix ../sql/2020/resource-hints
==== finding fixable violations ====
== [../sql/2020/resource-hints/adoption_service_workers.sql] FAIL
L: 25 | P: 26 | L010 | Inconsistent capitalisation of keywords.
L: 26 | P: 37 | L010 | Inconsistent capitalisation of keywords.
L: 34 | P: 63 | L038 | Trailing comma in select statement forbidden
==== fixing violations ====
3 fixable linting violations found
Are you sure you wish to attempt to fix these? [Y/n]
If you lint again you should see most of the errors are fixed. Note that not all errors can be autofixed and some will require manual intervention but autofixing is useful for the simple errors. So while it's generally OK to run the fix
command, do run the lint
command when all clean to make sure the fix
command didn't miss any issues.
When returning a percent value, ensure that the number is between 0 and 1, not 0 and 100 (we used 0 to 100 in 2019 but switched in 2020 to allow easier formatting in Sheets). We also stopped rounding in BigQuery from 2020 onwards and present the full decimal value and round in Sheets.
#standardSQL
# Take the average percent of external stylesheets over all requests.
SELECT
_TABLE_SUFFIX AS client,
COUNTIF(type = 'css') / COUNT(0) AS pct_stylesheets
FROM
`httparchive.summary_requests.2020_08_01_*`
GROUP BY
client
Usually you would sort by pct
in DESC
ing order. Try to reuse the column names rather than reimplement the percentage formula in the ORDER BY
clause too.
SELECT
_TABLE_SUFFIX AS client,
element,
COUNT(DISTINCT url) AS pages,
total,
COUNT(DISTINCT url) / total, 2 AS pct
FROM
`some_dataset`
GROUP BY
client,
total,
element
ORDER BY
pct DESC,
client
We generally want to present stats as a percentage of pages. The HTTP Archive data set changes month by month (based on the latest available CrUX data) and has been growing over time. Therefore presenting absolute values can make it difficult to understand the scale, and to do year on year comparison. So as much as possible we want data presented as percentages.
To do this we need to take the absolute frequency of stats and divide by the total number. This means you need a total
number. There are a number of ways of getting this.
You can use CTE (Common Table Expression):
WITH totals AS (
SELECT
_TABLE_SUFFIX AS _TABLE_SUFFIX,
COUNT(0) AS total
FROM
`httparchive.pages.2021_07_01_*`
GROUP BY
_TABLE_SUFFIX
)
SELECT
_TABLE_SUFFIX AS client
COUNT(0) AS freq,
total,
COUNT(0) / total AS pct
FROM
`httparchive.pages.2021_07_01_*`
JOIN
totals
USING (_TABLE_SUFFIX)
WHERE
...
GROUP BY
client
ORDER BY
pct DESC
You can use a subquery:
SELECT
_TABLE_SUFFIX AS client,
COUNT(0) AS freq,
total,
COUNT(0) / total AS pct
FROM
`httparchive.pages.2021_07_01_*`
JOIN
(
SELECT
_TABLE_SUFFIX AS _TABLE_SUFFIX,
COUNT(0) AS total
FROM
`httparchive.pages.2021_07_01_*`
GROUP BY
_TABLE_SUFFIX
)
USING (__TABLE_SUFFIX)
WHERE
...
GROUP BY
client
ORDER BY
pct DESC
Or you can use a windowing clause using OVER
:
SELECT
_TABLE_SUFFIX AS client,
COUNTIF(...) AS freq,
SUM(COUNT(DISTINCT url)) OVER (PARTITION BY client) AS total
COUNT(0) / SUM(COUNT(DISTINCT url)) OVER (PARTITION BY client) AS pct
FROM
`httparchive.pages.2021_07_01_*`
GROUP BY
client
ORDER BY
pct DESC
The last looks the simplest and is a very handy way of not having to do another join. However care must be taken with this, as it's easy to get the wrong total with this. In particular:
-
UNNEST
queries change the totals and even usingDISTINCT url
or similar will not remove this as you are doing aSUM
of thoseDISTINCT url
's rather than aDISTINCT
of theSUM
. This often leads to a total much LARGER than expected. -
WHERE
clauses are applied before the windowing function, so your total may already be being filtered. This often leads to a total SMALLER than expected. -
HAVING
clauses are applied before the windowing function, so your total may already be being filtered. This often leads to a total SMALLER than expected. You can use theQUALIFY
option instead which is basically the same asHAVING
but is applied after.
So for simple functions with no joins, and no filters (using WHERE
or HAVING
) by all means use the windowing option, but be aware of its limitations.
It is always best to include the total in your exports so you can validate the numbers - why is this out of 500 million URLs? I thought our crawl was 7 million URLs?
Once all the queries are ready you can automate this step using this Colab Notebook.
To get all your queries results uploaded to the Google Sheets follow the steps:
-
Run 'Download repo' cell
This will create a local copy of the Almanac repository. It uses the original repo https://github.com/HTTPArchive/almanac.httparchive.org by default.
-
Adjust parameters and run 'Configure the chapter to process' cell
-
GCP_PROJECT
- the Google Cloud project ID that is used for billing of BigQuery queries (querying permissions required). -
year
- the year of the Almanac edition. -
chapter
- the name of the chapter. It's used to find branch and sql files. See naming conventions in Analysis Workflow. -
spreadsheet_url
- the spreadsheet designated for your chapter (edit permission required). You can find it in the chapter issue.
-
-
Run 'Update local branch' cell
It switches to the designated chapter branch and fetches any new commits.
-
Run 'Authenticate' cell
It authenticates your Google account in Google Colab to gain access to BigQuery and Google Sheets.
-
Adjust parameters and run 'Upload query results' cell
-
filename_match
- regular expression to match the queries to run. Any query that doesn't match will not be run. Wrap the regex in quotes. -
filename_match_exclude
- regular expression to skip the queries by the filename. No queries will be skipped if this left empty. Wrap the regex in quotes. -
dry_run
- check this to run queries as dry run. It will give an overview of the queries to be processed and the estimated data processing volumes. -
overwrite_sheets
- check this if the query results must overwrite the sheets that already exist. -
maximum_tb_billed
- use the maximum bytes billed setting to limit query costs.
-
In the pages
dataset, there is a payload property of _almanac
containing an object with several custom metrics. See https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/custom_metrics/almanac.js for the definitions of each metric.
Sometimes the values in a custom metric are too complex to process using the JSON_EXTRACT
functions in BigQuery, so you can do more complex analysis with a user-defined function (UDF). In the example below, the UDF parses both the payload JSON and the JSON-encoded _almanac
custom metric:
#standardSQL
# 01_15: Percent of pages that include link[rel=modulepreload]
CREATE TEMP FUNCTION hasModulePreload(payload STRING)
RETURNS BOOLEAN LANGUAGE js AS '''
try {
var $ = JSON.parse(payload);
var almanac = JSON.parse($._almanac);
return !!almanac['link-nodes'].find(e => e.rel.toLowerCase() == 'modulepreload');
} catch (e) {
return false;
}
''';
SELECT
_TABLE_SUFFIX AS client,
COUNTIF(hasModulePreload(payload)) AS num_pages,
COUNTIF(hasModulePreload(payload)) / COUNT(0) AS pct_modulepreload
FROM
`httparchive.pages.2019_07_01_*`
GROUP BY
client
As per the example function names should be written in camelCase (hasModulePreload
) rather than snake_case (has_module_preload
) to help differentiate them from table names.
Additionally we have had reports that reusing the same function name in different queries being run at the same time (e.g. two tabs in your browser) in the same session, can mix up the results! Sounds like a BigQuery bug if this is the case, but either way it's easily avoided but not reusing the same function name for different functions across queries within the same set of chapter queries.
httparchive.almanac
is a dataset containing 1k subset tables for all standard datasets (lighthouse, pages, requests, response_bodies, summary_pages, summary_requests, technologies)
There are also several preprocessed tables to make querying more convenient and cost-effective:
-
response_bodies: combination of desktop/mobile response bodies clustered by
client
,page
, andurl
fields -
summary_requests: combination of desktop/mobile summary requests clustered by
client
,page
, andurl
fields. Note thatpage
is the URL of the correspondingpageid
in the summary_pages table. -
summary_response_bodies: combination of desktop/mobile response bodies and summary requests data joined by
page
andurl
fields, clustered byclient
,firstHtml
, andtype
fields. This is the preferred table for querying response bodies of a certain type, for example:
#standardSQL
# Calculates the percent of pages that reference the word "almanac" (case-insensitive).
SELECT
client,
COUNTIF(REGEXP_CONTAINS(body, '(?i)almanac')) / COUNT(0) AS pct_almanac
FROM
`httparchive.almanac.summary_response_bodies`
WHERE
firstHtml
GROUP BY
client
Note that summary_response_bodies
is an 18 TB table, but thanks to the clustering, this query is limited to only the response bodies that are marked as firstHtml
. So the query actually completes in ~13 seconds and consumes only 850 GB.
Since the June 2021 crawl, rank data is taken from the CrUX report allowing queries on the top 1k origins, top 10k, top 100k, top 1M...etc. The rank
field is available in the httparchive.summary_pages.*
tables and will contain the values 1000
, 10000
, 100000
, 1000000
and 10000000
representing the rank for each url
crawled.
Note that the ranks are exclusive, meaning that a URL may only have one rank. For example, the 100000
rank contains 99,000 URLs because the top 1,000 is its own exclusive segment. To include all URLs you may combine the different ranks as follows:
SELECT
client,
rank_grouping,
COUNT(0) AS total
FROM (
SELECT
_TABLE_SUFFIX AS client,
rank,
FROM
`httparchive.summary_pages.2021_07_01_*`
),
UNNEST([1000, 10000, 100000, 1000000, 10000000]) AS rank_grouping
WHERE
rank <= rank_grouping
GROUP BY
client,
rank_grouping
ORDER BY
client,
rank_grouping
When running the above query, you may notice that the total
does not perfectly match the number of expected URLs, i.e. less than 1,000 URLs for the top 1000
rank. Since the CrUX rank for each origin is shared between desktop & mobile, it may be possible that an origin which features on CrUX for one device type, does not feature for the other (e.g. a mobile version of a site like https://m.example.com
may not appear in desktop CrUX data if it doesn't get a lot of visits from desktop devices). As a result, for the 1000
rank, the list of URLs to crawl may contain a maximum of 1,000 URLs - if all URLs feature on both desktop and mobile - but likely to contain a subset.
Scripts to analyze number of queries and lines of SQL for a given year.
Number of queries:
ls sql/2021/**/*.sql | wc -l
Total lines of SQL:
wc -l sql/2021/**/*.sql | tail -n 1