-
Notifications
You must be signed in to change notification settings - Fork 38
Approximately half of all downloads are not being recorded in BigQuery #30
Comments
Appears linehaul service may be flapping, as reported by Fastly:
|
Linehaul is getting killed by OS with some frequency:
|
Trying to wrap my head around the impact of this. It's probably not a big deal at all for folks who are using the bigquery data to compute relative statistics (e.g., "what percent of my downloads are still python 2?"), since (so far) it sounds like the missing data is random distributed in a fairly homogenous way through time. But it is kind of a big deal for folks who are using it to compute absolute download counts, like we tend to do in grant proposals ("numpy gets XX downloads/month") or the manylinux download counts here. In theory it should be able to "correct" the historical counts by estimating how much data was lost each data, though I'm not sure whether there's any reasonable way to do this within the constraints of bigquery's query language... |
Some cheap memory profiling with The number of A snapshot:
So all but the three problem types are steadyish! This leads me to believe that we need to look into the parse method at https://github.com/pypa/linehaul/blob/0aea04507c6997e0b367949e5513515e6abc97ed/linehaul/parser.py#L149-L176 and the Download class at https://github.com/pypa/linehaul/blob/0aea04507c6997e0b367949e5513515e6abc97ed/linehaul/parser.py#L127-L139 |
diff --git a/linehaul/core.py b/linehaul/core.py
index af76f4f..626e971 100644
--- a/linehaul/core.py
+++ b/linehaul/core.py
@@ -24,6 +24,9 @@ from ._queue import CloseableFlowControlQueue, QueueClosed
from .syslog.protocol import SyslogProtocol
from .user_agents import UnknownUserAgentError
+import gc
+import objgraph
+
BATCH_SIZE = 500
MAX_WAIT = 5 * 60 # 5 minutes
@@ -164,8 +167,12 @@ async def send(client, queue, *, loop):
rows = list(rows)
suffix = rows[0]["json"]["timestamp"].format("YYYYMMDD")
+ logger.info('flushing to bq')
await bq.insert_all(
rows,
template_suffix=suffix,
skip_invalid_rows=True,
)
+ gc.collect()
+ with open("/var/log/linehaul/objgraph.log", 'a') as objgraph_log:
+ objgraph.show_most_common_types(file=objgraph_log) diff used to pull objgraph data. |
@dstufft can you take a look when you have a chance? |
Yea I'll dig into it. |
I have a small website for aggregate pypi downloads and noticed a sharp decline in records recently on GBQ. Was doing a bit of research about it and came across this issue. I'm assuming this is directly related. Just wanted to confirm. |
I've just deployed a new version of Linehaul (pypa/linehaul@trio), and so far we've had an ~hr of uptime and our memory usage has been flat, with no OOM kills in that hour. I'm going to continue to monitor the new version and make sure it stays stable, but so far it looks very promising. |
Oh and just to double check, I've ran: SELECT
TIMESTAMP_TRUNC(timestamp, MINUTE) minute,
COUNT(*)
FROM
`the-psf.pypi.downloads*`
WHERE
_TABLE_SUFFIX BETWEEN "20180727" AND "20180727"
GROUP BY
minute
ORDER BY
minute DESC At 12:37 UTC, and there were 37 results, indicating that so far the new linehaul is at least populating some data in every minute, instead of only for some of them. |
I'm going to close this. So far we've dropped no data on the floor (except when we've been explicitly shutting down Linehaul or restarting it) and everything appears to be stable. I've also improved our handling of unknown user agents to still record those downloads (we just won't have any information about the target Python or such). Thanks everyone for bringing this to our attention1 |
I ran a script today from 01:07 to 09:39 UTC, downloading a file every six minutes. It made a total of 86 downloads, but only 40 of them are in the database:
Looking at the data more generally, there are 1440 minutes in a day, but on 05-23 only 748 of them had any data recorded:
Browsing through those results shows many extended gaps in the data, e.g. from 22:45 to 22:54, and 21:36 to 22:16.
I know this service isn't widely-known, but isn't it intended to be a complete record of all PyPI downloads? If so, something is clearly going wrong.
The text was updated successfully, but these errors were encountered: