Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privacy 2024 queries #3653

Merged
merged 114 commits into from
Nov 3, 2024
Merged
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
114 commits
Select commit Hold shift + click to select a range
251654f
readme
max-ostapenko May 3, 2024
537aa53
copied 2022 SQLs over to update/review
max-ostapenko May 3, 2024
237a280
fixed link
max-ostapenko May 3, 2024
66cccb2
origin trials
max-ostapenko Jun 9, 2024
736a7ab
Bump puppeteer from 22.7.1 to 22.8.0 in /src (#3655)
dependabot[bot] May 7, 2024
86506a3
notebook + readme (#3652)
max-ostapenko May 7, 2024
99aee63
Bump pytest from 8.1.1 to 8.2.0 in /src (#3651)
dependabot[bot] May 7, 2024
f76f679
Translation of privacy chapter to Japanese (#3654)
ksakae1216 May 7, 2024
c29735a
Update Timestamps (#3657)
github-actions[bot] May 7, 2024
47811ca
2023 Performance (#3525)
rviscomi May 15, 2024
46dabdc
Bump puppeteer from 22.8.0 to 22.9.0 in /src (#3662)
dependabot[bot] May 19, 2024
743f560
Upgrade to web-vitals v4 (#3661)
rviscomi May 20, 2024
0983d6c
Bump pytest from 8.2.0 to 8.2.1 in /src (#3664)
dependabot[bot] May 21, 2024
42e4599
--- (#3665)
dependabot[bot] May 22, 2024
22ec347
Bump puppeteer from 22.9.0 to 22.10.0 in /src (#3668)
dependabot[bot] May 24, 2024
1eb5873
Bump jsdom from 24.0.0 to 24.1.0 in /src (#3669)
dependabot[bot] May 27, 2024
c566929
Typofix (#3670)
borisschapira May 28, 2024
2d4bceb
SQL and MD folders the 2024 Web Almanac (#3666)
ChrisBeeti May 29, 2024
2c87cf2
Bump prettier from 3.2.5 to 3.3.0 in /src (#3672)
dependabot[bot] Jun 4, 2024
d60de8c
Bump pytest from 8.2.1 to 8.2.2 in /src (#3673)
dependabot[bot] Jun 5, 2024
4f925b8
Bump prettier from 3.3.0 to 3.3.1 in /src (#3674)
dependabot[bot] Jun 5, 2024
846c710
Fix loaf monitoring bug (#3675)
tunetheweb Jun 5, 2024
524c51c
Update Timestamps (#3677)
github-actions[bot] Jun 5, 2024
6f441ff
Bump web-vitals from 4.0.1 to 4.1.0 in /src (#3678)
dependabot[bot] Jun 7, 2024
689d2ef
fixed link
max-ostapenko May 3, 2024
dd21cf0
remove unreviewed sql
max-ostapenko Jun 9, 2024
be50c36
Merge branch 'main' into privacy-sql-2024
max-ostapenko Jun 9, 2024
320ebbe
lint test
max-ostapenko Jun 9, 2024
21f612e
lint
max-ostapenko Jun 9, 2024
d0c2c35
ads supply graph
max-ostapenko Jul 14, 2024
27511d0
lint
max-ostapenko Jul 14, 2024
8cd1e83
close file
max-ostapenko Jul 14, 2024
da015db
lint
max-ostapenko Jul 14, 2024
b7179e4
top_direct_sellers
max-ostapenko Jul 20, 2024
4de4c61
ads_txt_lines_histogram
max-ostapenko Jul 20, 2024
de249eb
ads_txt_seller_accounts_by_type
max-ostapenko Jul 20, 2024
ddf2ba8
top_ads_variables
max-ostapenko Jul 20, 2024
4e24a59
format
max-ostapenko Jul 20, 2024
3552776
tcf2
max-ostapenko Jul 21, 2024
5cc3695
rename
max-ostapenko Jul 21, 2024
4796653
lint
max-ostapenko Jul 21, 2024
cd6cac0
using custom_metrics
max-ostapenko Jul 25, 2024
9d11fcc
most_common_cname_domains
max-ostapenko Jul 25, 2024
ab54d6a
adguard list
max-ostapenko Aug 4, 2024
d9242dd
gpc
max-ostapenko Aug 4, 2024
17c4455
referrer policy
max-ostapenko Aug 4, 2024
52d57b5
usp
max-ostapenko Aug 4, 2024
234ef27
iab frameworks
max-ostapenko Aug 5, 2024
0bce587
lint
max-ostapenko Aug 5, 2024
8c60240
bounce trackers
max-ostapenko Aug 5, 2024
b1b47bc
Added privacy sandbox related queries
Yash-Vekaria Aug 13, 2024
14136ae
lint
Yash-Vekaria Aug 13, 2024
d6b1db4
missed lint
Yash-Vekaria Aug 13, 2024
a83f88d
dnt
max-ostapenko Aug 14, 2024
cf99788
client hints
max-ostapenko Aug 14, 2024
7fc52f4
whotracksme update
max-ostapenko Aug 14, 2024
95dd276
lint
max-ostapenko Aug 14, 2024
23ce85b
referrer policy
max-ostapenko Aug 14, 2024
27e4d43
rank filter removed
max-ostapenko Aug 14, 2024
109e807
trackers
max-ostapenko Aug 15, 2024
d41de3d
util deps
max-ostapenko Aug 15, 2024
266fa78
limits
max-ostapenko Aug 15, 2024
b90332f
Privacy 2024 queries - CCPA, fingerprinting, cookies (#3720)
bstandaert-wustl Aug 15, 2024
29cccaf
bq to sheets updates
max-ostapenko Aug 15, 2024
b35e6de
query optimisation
max-ostapenko Aug 15, 2024
22a21ef
downgrade for python 3.8
max-ostapenko Aug 15, 2024
7ea017b
more categories
max-ostapenko Aug 15, 2024
ff429ff
more categories and columns reordered
max-ostapenko Aug 15, 2024
5afff7c
forms and formatted logs
max-ostapenko Aug 15, 2024
37c42d3
Refactoring queries to produce output for queries only
Yash-Vekaria Aug 15, 2024
0d39f6b
lint
max-ostapenko Aug 16, 2024
1c4e468
Merge branch 'main' into privacy-sql-2024
max-ostapenko Aug 16, 2024
a239c25
lint
max-ostapenko Aug 16, 2024
baf490d
Privacy Sql Tracking Detection Using Easylist Adservers (#3730)
hadiamjad Aug 16, 2024
4ab293f
log query errors
max-ostapenko Aug 17, 2024
3fb692e
Fixed privacy sandbox attestation query bug
Yash-Vekaria Aug 17, 2024
58dac23
maximum_bytes_billed parameter
max-ostapenko Aug 17, 2024
6f99ae6
moved to chapter root
max-ostapenko Aug 17, 2024
0b4898d
postpone dryrun check
max-ostapenko Aug 17, 2024
5445b92
fingerprinting_most_common_apis: improve resilience to malformed JSON…
bstandaert-wustl Aug 17, 2024
dac1167
optional maximum_bytes_billed parameter
max-ostapenko Aug 17, 2024
3d8cb6d
formatting
max-ostapenko Aug 18, 2024
e8a032a
queries and notebook updates
max-ostapenko Aug 18, 2024
82c084e
queries to rerun
max-ostapenko Aug 18, 2024
ed8944c
origin trials function fix
max-ostapenko Aug 19, 2024
bc6a045
optimised sellers count
max-ostapenko Aug 19, 2024
a917161
apps included in ads.txt lines
max-ostapenko Aug 19, 2024
c51a3e7
another rerun
max-ostapenko Aug 19, 2024
2792d67
lint
max-ostapenko Aug 19, 2024
b2a7f4f
no origins
max-ostapenko Aug 20, 2024
51a71f0
optimized perf
max-ostapenko Aug 20, 2024
23a72c7
more optimized perf
max-ostapenko Aug 20, 2024
c8450a0
graph optimization and OT expiration
max-ostapenko Aug 21, 2024
17ded3e
Merge remote-tracking branch 'origin/main' into privacy-sql-2024
max-ostapenko Aug 21, 2024
e29a3eb
earlier grouping for performance
max-ostapenko Aug 21, 2024
975f7c8
graph fixes
max-ostapenko Aug 21, 2024
fda33dd
cookies, ccpa, fingerprinting: calculate percent of total pages
bstandaert-wustl Aug 22, 2024
0c30a7a
query for top third-party cookie names
bstandaert-wustl Aug 24, 2024
cfde873
bq writer module
max-ostapenko Sep 18, 2024
ac6e895
add grouping
max-ostapenko Oct 1, 2024
fe31518
domain suffixes and regexes removed
max-ostapenko Oct 1, 2024
741b655
Merge remote-tracking branch 'origin/main' into privacy-sql-2024
max-ostapenko Oct 28, 2024
760ebed
add comments
max-ostapenko Oct 30, 2024
522ab70
review
max-ostapenko Oct 30, 2024
3d5a9cb
add PR link
max-ostapenko Oct 30, 2024
46390a5
lint
max-ostapenko Oct 30, 2024
b98454b
remove mobile filter
max-ostapenko Oct 30, 2024
858324e
lint
max-ostapenko Oct 30, 2024
129e36f
lint
max-ostapenko Oct 30, 2024
9bd5ea4
disable import-error rule
max-ostapenko Oct 30, 2024
dd0357a
adguard not used
max-ostapenko Oct 31, 2024
c374995
linting
max-ostapenko Oct 31, 2024
a81781c
pages_pct in query
max-ostapenko Oct 31, 2024
a646f8e
lint
max-ostapenko Oct 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions sql/2024/privacy/ads_and_sellers_graph.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
WITH RECURSIVE pages AS (
SELECT
CASE page -- publisher websites may redirect to an SSP domain, and need to use redirected domain instead of page domain
WHEN 'https://www.chunkbase.com/' THEN 'cafemedia.com'
max-ostapenko marked this conversation as resolved.
Show resolved Hide resolved
ELSE NET.REG_DOMAIN(page)
END AS page,
custom_metrics
FROM `httparchive.all.pages`
WHERE date = '2024-06-01' AND
client = 'desktop' AND
is_root_page = TRUE AND
rank <= 10000
), ads AS (
SELECT
page,
JSON_QUERY(custom_metrics, '$.ads.ads.account_types') AS ad_accounts
FROM pages
WHERE
CAST(JSON_VALUE(custom_metrics, '$.ads.ads.account_count') AS INT64) > 0
), sellers AS (
SELECT
page,
JSON_QUERY(custom_metrics, '$.ads.sellers.seller_types') AS ad_sellers
FROM pages
WHERE
CAST(JSON_VALUE(custom_metrics, '$.ads.sellers.seller_count') AS INT64) > 0
), relationships AS (
SELECT
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS demand,
'Web' AS supply,
'direct' AS relationship,
page AS publisher
FROM ads, UNNEST(JSON_VALUE_ARRAY(ad_accounts, '$.direct.domains')) AS domain
UNION ALL
SELECT
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS demand,
'Web' AS supply,
'indirect' AS relationship,
page AS publisher
FROM ads, UNNEST(JSON_VALUE_ARRAY(ad_accounts, '$.reseller.domains')) AS domain
UNION ALL
SELECT
page AS demand,
'Web' AS supply,
'direct' AS relationship,
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS publisher
FROM sellers, UNNEST(JSON_VALUE_ARRAY(ad_sellers, '$.publisher.domains')) AS domain
UNION ALL
SELECT
page AS demand,
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS supply,
'indirect' AS relationship,
NULL AS publisher
FROM sellers, UNNEST(JSON_VALUE_ARRAY(ad_sellers, '$.intermediary.domains')) AS domain
UNION ALL
SELECT
page AS demand,
'Web' AS supply,
'direct' AS relationship,
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS publisher
FROM sellers, UNNEST(JSON_VALUE_ARRAY(ad_sellers, '$.both.domains')) AS domain
UNION ALL
SELECT
page AS demand,
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS supply,
'indirect' AS relationship,
NULL AS publisher
FROM sellers, UNNEST(JSON_VALUE_ARRAY(ad_sellers, '$.both.domains')) AS domain
), nodes AS (
(
SELECT
demand,
supply,
CONCAT(demand, '-', supply) AS path_history,
relationship,
HLL_COUNT.INIT(publisher) AS supply_sketch
FROM relationships
WHERE supply = 'Web'
GROUP BY demand, supply, relationship
)
UNION ALL
(
SELECT
relationships.demand AS demand,
relationships.supply AS supply,
CONCAT(relationships.demand, '-', nodes.path_history) AS path_history,
relationships.relationship AS relationship,
nodes.supply_sketch AS supply_sketch
FROM relationships
INNER JOIN nodes
ON relationships.supply = nodes.demand AND
nodes.supply_sketch IS NOT NULL AND
nodes.relationship = 'indirect' AND
STRPOS(nodes.path_history, CONCAT(relationships.demand, '-', relationships.supply)) = 0
)
)

SELECT
demand,
supply,
path_history,
relationship,
HLL_COUNT.MERGE(supply_sketch) AS publishers_count
FROM nodes
GROUP BY demand, supply, relationship, path_history
ORDER BY publishers_count DESC
27 changes: 27 additions & 0 deletions sql/2024/privacy/ads_lines_amount.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
WITH RECURSIVE pages AS (
SELECT
CASE page -- publisher websites may redirect to an SSP domain, and need to use redirected domain instead of page domain
WHEN 'https://www.chunkbase.com/' THEN 'cafemedia.com'
ELSE NET.REG_DOMAIN(page)
END AS page,
custom_metrics
FROM `httparchive.all.pages`
WHERE date = '2024-06-01' AND
client = 'desktop' AND
is_root_page = TRUE AND
rank <= 10000
), ads AS (
SELECT
page,
CEIL(CAST(JSON_VALUE(custom_metrics, '$.ads.ads.line_count') AS INT64) / 100) * 100 AS line_count_bucket
FROM pages
WHERE
CAST(JSON_VALUE(custom_metrics, '$.ads.ads.line_count') AS INT64) > 0
)

SELECT
line_count_bucket,
COUNT(DISTINCT page) AS page_count
FROM ads
GROUP BY line_count_bucket
ORDER BY line_count_bucket ASC
33 changes: 33 additions & 0 deletions sql/2024/privacy/ads_seller_accounts_by_type.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
WITH RECURSIVE pages AS (
SELECT
NET.REG_DOMAIN(page) AS page,
custom_metrics
FROM `httparchive.all.pages`
WHERE date = '2024-06-01' AND
client = 'desktop' AND
is_root_page = TRUE AND
rank <= 10000
), ads AS (
SELECT
CEIL(CAST(JSON_VALUE(custom_metrics, '$.ads.ads.account_types.direct.account_count') AS INT64) / 100) * 100 AS direct_account_count_bucket,
CEIL(CAST(JSON_VALUE(custom_metrics, '$.ads.ads.account_types.reseller.account_count') AS INT64) / 100) * 100 AS reseller_account_count_bucket
FROM pages
WHERE
CAST(JSON_VALUE(custom_metrics, '$.ads.ads.account_count') AS INT64) > 0
)

SELECT
'direct' AS account_type,
direct_account_count_bucket AS account_count_bucket,
COUNT(0) AS page_count
FROM ads
GROUP BY direct_account_count_bucket
UNION ALL
SELECT
'reseller' AS account_type,
reseller_account_count_bucket AS account_count_bucket,
COUNT(0) AS page_count
FROM ads
GROUP BY reseller_account_count_bucket
ORDER BY account_count_bucket ASC
LIMIT 200
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
-- ara-trigger-registrations-for-different-destinations-by-destinations.sql
-- Analysis of Attribution Reporting API (ARA) Triggers registered for different destinations by destination domains (i.e., advertiser domains):
-- 1. No. of third-parties that register trigger for the given destination
-- 2. Min. epsilon -- MIN(CASE WHEN epsilon IS NOT NULL THEN epsilon END) AS min_epsilon
-- 3. Avg. epsilon -- AVG(CASE WHEN epsilon IS NOT NULL THEN epsilon END) AS avg_epsilon
-- 4. Max. epsilon -- MAX(CASE WHEN epsilon IS NOT NULL THEN epsilon END) AS max_epsilon
-- [Higher the epsilon, the more the privacy protection] [Epsilon is always undefined, so last 3 columns are removed for this year atleast]
-- Output comprises 9.5K rows and 1 column.

-- Extracting third-parties observed using ARA API on a publisher
CREATE TEMP FUNCTION jsonObjectKeys(input STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
if (!input) {
return [];
}
return Object.keys(JSON.parse(input));
""";

-- Extracting ARA API source registration details being passed by a given third-party (passed as "key")
CREATE TEMP FUNCTION jsonObjectValues(input STRING, key STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
if (!input) {
return [];
}
const jsonObject = JSON.parse(input);
const values = jsonObject[key] || [];
const result = [];

values.forEach(value => {
if (value.toLowerCase().startsWith('attribution-reporting-register-source|')) {
const parts = value.replace('attribution-reporting-register-source|', '').split('|');
parts.forEach(part => {
if (part.startsWith('destination=')) {
const destinations = part.replace('destination=', '').split(',');
destinations.forEach(destination => {
result.push('destination=' + destination.trim());
});
} else {
result.push(part.trim());
}
});
}
});

return result;
""";

WITH ara_features AS (
SELECT
NET.REG_DOMAIN(page) AS publisher,
third_party_domain,
CASE
WHEN ara LIKE 'destination=%' THEN NET.REG_DOMAIN(REPLACE(ara, 'destination=', ''))
ELSE NULL
END AS destination,
CASE
WHEN ara LIKE 'epsilon=%' THEN SAFE_CAST(REPLACE(ara, 'epsilon=', '') AS FLOAT64)
ELSE NULL
END AS epsilon
FROM `httparchive.all.pages`,
UNNEST(jsonObjectKeys(JSON_QUERY(custom_metrics, '$.privacy-sandbox.privacySandBoxAPIUsage'))) AS third_party_domain,
UNNEST(jsonObjectValues(JSON_QUERY(custom_metrics, '$.privacy-sandbox.privacySandBoxAPIUsage'), third_party_domain)) AS ara
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
is_root_page = TRUE AND
rank <= 1000000
max-ostapenko marked this conversation as resolved.
Show resolved Hide resolved
)
SELECT
destination,
COUNT(DISTINCT third_party_domain) AS third_party_count
FROM ara_features
WHERE destination IS NOT NULL
GROUP BY destination
HAVING third_party_count > 0
ORDER BY third_party_count DESC;
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
-- ara-trigger-registrations-for-different-destinations-by-third-parties.sql
-- Analysis of Attribution Reporting API (ARA) Triggers registered for different destinations by Third Party (TP) domains:
-- 1. No. of destinations registered by a given TP
-- 2. Min. epsilon -- MIN(CASE WHEN epsilon IS NOT NULL THEN epsilon END) AS min_epsilon
-- 3. Avg. epsilon -- AVG(CASE WHEN epsilon IS NOT NULL THEN epsilon END) AS avg_epsilon
-- 4. Max. epsilon -- MAX(CASE WHEN epsilon IS NOT NULL THEN epsilon END) AS max_epsilon
-- [Higher the epsilon, the more the privacy protection] [Epsilon is always undefined, so last 3 columns are removed for this year]
-- Output comprises 17 rows and 1 column.

-- Extracting third-parties observed using ARA API on a publisher
CREATE TEMP FUNCTION jsonObjectKeys(input STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
if (!input) {
return [];
}
return Object.keys(JSON.parse(input));
""";

-- Extracting ARA API source registration details being passed by a given third-party (passed as "key")
CREATE TEMP FUNCTION jsonObjectValues(input STRING, key STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
if (!input) {
return [];
}
const jsonObject = JSON.parse(input);
const values = jsonObject[key] || [];
const result = [];

values.forEach(value => {
if (value.toLowerCase().startsWith('attribution-reporting-register-source|')) {
const parts = value.replace('attribution-reporting-register-source|', '').split('|');
parts.forEach(part => {
if (part.startsWith('destination=')) {
const destinations = part.replace('destination=', '').split(',');
destinations.forEach(destination => {
result.push('destination=' + destination.trim());
});
} else {
result.push(part.trim());
}
});
}
});

return result;
""";

WITH ara_features AS (
SELECT
NET.REG_DOMAIN(page) AS publisher,
third_party_domain,
CASE
WHEN ara LIKE 'destination=%' THEN NET.REG_DOMAIN(REPLACE(ara, 'destination=', ''))
ELSE NULL
END AS destination,
CASE
WHEN ara LIKE 'epsilon=%' THEN SAFE_CAST(REPLACE(ara, 'epsilon=', '') AS FLOAT64)
ELSE NULL
END AS epsilon
FROM `httparchive.all.pages`,
UNNEST(jsonObjectKeys(JSON_QUERY(custom_metrics, '$.privacy-sandbox.privacySandBoxAPIUsage'))) AS third_party_domain,
UNNEST(jsonObjectValues(JSON_QUERY(custom_metrics, '$.privacy-sandbox.privacySandBoxAPIUsage'), third_party_domain)) AS ara
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
is_root_page = TRUE AND
rank <= 1000000
)
SELECT
third_party_domain,
COUNT(DISTINCT destination) AS destination_count
FROM ara_features
WHERE third_party_domain IS NOT NULL
GROUP BY third_party_domain
HAVING destination_count > 0
ORDER BY destination_count DESC;
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
-- attested-domains-in-top-1M-using-privacy-sandbox.sql
-- Contributed by: @yohhaan
-- Urls that may have a `/.well-known/related-website-set.json` for Privacy Sandbox Related Website Set Proposal and
-- Urls that may have a `/.well-known/privacy-sandbox-attestations.json` for Privacy Sandbox APIs Attestation file
-- Note: we are only extracting a potential list of origins that may have the file to feed to another crawler (https://github.com/privacysandstorm/well-known-crawler) that will actually check if file is valid by parsing it
-- Test query on `httparchive.sample_data.pages_1k` and `TABLESAMPLE SYSTEM (1.0 PERCENT)` with latest date of crawl
-- Final query on `httparchive.all.pages`


WITH wellknown AS (
SELECT
NET.HOST(page) AS host,
CAST(JSON_VALUE(custom_metrics, '$.well-known."/.well-known/related-website-set.json".found') AS BOOL) AS rws,
CAST(JSON_VALUE(custom_metrics, '$.well-known."/.well-known/privacy-sandbox-attestations.json".found') AS BOOL) AS attestation
FROM
`httparchive.all.pages`
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
is_root_page = TRUE AND
rank <= 1000000
)

SELECT DISTINCT
host,
CASE WHEN rws THEN 1 ELSE 0 END AS related_websites_set,
CASE WHEN attestation THEN 1 ELSE 0 END AS privacy_sandbox_attestation
FROM
wellknown
WHERE
rws OR attestation;
29 changes: 29 additions & 0 deletions sql/2024/privacy/common_ads_variables.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
WITH RECURSIVE pages AS (
SELECT
NET.REG_DOMAIN(page) AS page,
custom_metrics
FROM `httparchive.all.pages`
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
is_root_page = TRUE AND
rank <= 10000
), ads AS (
SELECT
page,
variable,
COUNT(DISTINCT page) OVER() AS total_publishers
FROM pages,
UNNEST(JSON_VALUE_ARRAY(custom_metrics, '$.ads.ads.variables')) AS variable
WHERE
CAST(JSON_VALUE(custom_metrics, '$.ads.ads.account_types.reseller.account_count') AS INT64) > 0 OR
CAST(JSON_VALUE(custom_metrics, '$.ads.ads.account_types.direct.account_count') AS INT64) > 0
)

SELECT
variable,
COUNT(DISTINCT page) AS publishers_count,
ANY_VALUE(total_publishers) AS total_publishers
FROM ads
GROUP BY variable
ORDER BY publishers_count DESC
Loading
Loading