-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Response bodies tables too big to load into BigQuery in one go (15 TB) #225
Comments
The Dataflow job was started and should take 8.5 hours to complete. I also want to note that this will result in ~300k response bodies (0.11%) being truncated — not omitted, I think that was a mistake in my math above. One workaround to the 15 TB limit may be to write to two tables and merge them in a post-load job. But all this work is making me wonder if it's worth the effort. I actively discourage everyone from querying the With the recent support for So for researchers interested in querying over all response bodies on BigQuery, we could encourage them to write a new custom metric, which also has the benefit of enriching the cheaper-to-query @paulcalvano @pmeenan WDYT? |
The job with the 2 MB per row limit failed still. It truncated 217,352 response bodies, but the table still exceeded the 15 TB limit. I'm rerunning it now with a 1.5 MB per row limit. |
1.5 MB still failing. Experimenting with 2 MB again but omitting the new |
Still failing. Trying a different approach. There seem to be many response bodies for images: SELECT
format,
COUNT(0) AS freq
FROM
`httparchive.almanac.summary_response_bodies`
WHERE
date = '2020-08-01' AND
client = 'mobile' AND
type = 'image' AND
body IS NOT NULL
GROUP BY
format
ORDER BY
freq DESC
Let's try omitting any images that are not SVG with the 2 MB truncation. |
It also looks like crawls for response_bodies have stopped after October 2020 for both mobile and desktop, and I assume they're not completing successfully for this same reason? |
Also (FYI), a number of people internally at Google have sudden started to notice this (many related to perf questions) because they've been unable to query data. I understand the argument that most people probably don't want to be paying for queries to the response bodies table, but for folks who aren't paying for it, there still may be value in being able to perform ad-hoc queries. |
Last year the response bodies processing was commented out since the dataset size exceeded 15TB and was causing issues with the DataFlow pipeline. You can see where this is commented out here - https://github.com/HTTPArchive/bigquery/blob/master/dataflow/python/bigquery_import.py#L331 The HAR files still exist in GCS, so once this is resolved we could backfill the data. |
+1 to @paulcalvano I'll add that if the metrics are useful for everyone, we should consider baking them into the Sorry for the inconvenience @housseindjirdeh @philipwalton! |
I'd like to repurpose this issue from triage to designing a solution. One issue with the current pipeline is that we wait until the crawl is done before generating the BQ tables. It may be better to insert test data one row at a time as soon as it's ready. This avoids huge 15+ TB load jobs. It also makes partial data available sooner, but on the other hand I can see how it'd be confusing to appear incomplete. This would require a significant change to the data pipeline. Another option is to shard the data before loading. There are a couple of ways we could do this: naively bisecting the tables so each half is roughly 8 TB each and combining them in a post-processing step, or using GCS to stage a serialized copy of the table using sharding and BQ can reassemble the shards in a single load job. I'm most optimistic about the GCS sharding option, but if a total GCP rewrite of the test pipeline is in our future, maybe streaming inserts are the way to go. @pmeenan @paulcalvano @tunetheweb any thoughts/suggestions on the path forward? |
Is there a table size limit as well or is it just the load job? If you stream individual files as they arrive then you need to remember which files you have uploaded. 2 possible alternatives: 1 - Keep the processing at the end of the crawl but break it up (shard) and process one day at a time. The test ID's are all prefixed by the day when they were run so that should give ~15-20 shards, all well under the limit. 2 - Process each "chunk" of tests as they are uploaded. WPT for HA (because of IA legacy reasons) groups the tests into groups of 5-10k (YYMMDD_GROUP_xxxxx). They are all uploaded as a group, though that was something I was hoping to move away from. The upload process can create a .done file for each group and the load job would just need to keep track of the groups, not individual tests. 1 feels the cleanest with the least amount of additional tracking. Could we have an in-process set of data tables that we load data into and then rename them to the crawl table name when all of the load jobs are done? |
If you do go with an approach that streams chunks/shards/etc. of data into a table, then one request is to ensure that there's some cheap-to-query flag that gets set once you're reasonably confident that the data for a given month is complete. I currently have some automated BigQuery logic that reads from this table, and I'd like to ensure that I could delay executing any expensive queries until a month's results are finalized. |
We were able to remediate @jeffposnick's Workbox use case by creating a |
Fixed by HTTPArchive/bigquery#123 |
The September 2020 crawl completed successfully except for the
response_bodies
table for mobile. Inspecting the Dataflow logs shows this as the error:16492674416640 bytes is 15 TB, the maximum size per BigQuery load job. The corresponding September 2020 desktop table weighs in at 10.82 TB and the August 2020 mobile table is 14.47 TB, so it's plausible that the mobile table finally exceeded 15 TB this month. The underlying CrUX dataset is continuing to grow, so this is another one of the stresses on the data pipeline capacity.
Here's a look at how the response body sizes were distributed in 2020_08_01_mobile:
Some hand-wavey math later, I think what this is telling me is that if we reduce the row limit back down to 2 MB from 100, we can save up to 606 GB, assuming each row also has an average of 100 bytes of bookkeeping (page, url, truncated, requestid). This should be enough headroom to offset the dataset growth and get us under the limit.
I'll rerun the Dataflow job with a limit of 2 MB and report back if it works.
The text was updated successfully, but these errors were encountered: