Approximately half of all downloads are not being recorded in BigQuery #30

mhsmith · 2018-05-25T10:37:50Z

I ran a script today from 01:07 to 09:39 UTC, downloading a file every six minutes. It made a total of 86 downloads, but only 40 of them are in the database:

SELECT *
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    TIMESTAMP("2018-05-25"),
    TIMESTAMP("2018-05-25")
  )
  where file.project = "six" and file.version = "0.9.0" 
  and country_code = "DE" and details.installer.version = "10.0.1"
  order by timestamp desc

Looking at the data more generally, there are 1440 minutes in a day, but on 05-23 only 748 of them had any data recorded:

#standardSQL
SELECT timestamp_trunc(timestamp, MINUTE) minute, count(*)
FROM `the-psf.pypi.downloads*`
where _table_suffix between "20180523" and "20180523"
  group by minute order by minute desc

Browsing through those results shows many extended gaps in the data, e.g. from 22:45 to 22:54, and 21:36 to 22:16.

I know this service isn't widely-known, but isn't it intended to be a complete record of all PyPI downloads? If so, something is clearly going wrong.

The text was updated successfully, but these errors were encountered:

brainwane · 2018-05-31T21:43:28Z

Email thread: https://groups.google.com/forum/#!topic/pypa-dev/0TYYydsjfx0

ewdurbin · 2018-05-31T21:45:07Z

Appears linehaul service may be flapping, as reported by Fastly:

Last error: read tcp REDACTED-Fastly-IP:42645->REDACTED-PSF-IP:48175: i/o timeout
                Last error time: 2018-05-31T21:43:17.391210755Z
                Broken since: 2018-05-31T21:42:46.555578684Z

ewdurbin · 2018-05-31T21:47:17Z

Linehaul is getting killed by OS with some frequency:

ernestd@linehaul01:/var/log/upstart$ sudo journalctl -f -u linehaul
-- Logs begin at Thu 2018-05-31 14:45:52 UTC. --
May 31 21:17:38 linehaul01.iad1.psf.io systemd[1]: linehaul.service: Failed with result 'signal'.
May 31 21:17:39 linehaul01.iad1.psf.io systemd[1]: linehaul.service: Service hold-off time over, scheduling restart.
May 31 21:17:39 linehaul01.iad1.psf.io systemd[1]: Stopped Linehaul Stats Collection Daemon.
May 31 21:17:39 linehaul01.iad1.psf.io systemd[1]: Started Linehaul Stats Collection Daemon.
May 31 21:44:15 linehaul01.iad1.psf.io systemd[1]: linehaul.service: Main process exited, code=killed, status=9/KILL
May 31 21:44:15 linehaul01.iad1.psf.io systemd[1]: linehaul.service: Unit entered failed state.
May 31 21:44:15 linehaul01.iad1.psf.io systemd[1]: linehaul.service: Failed with result 'signal'.
May 31 21:44:15 linehaul01.iad1.psf.io systemd[1]: linehaul.service: Service hold-off time over, scheduling restart.
May 31 21:44:15 linehaul01.iad1.psf.io systemd[1]: Stopped Linehaul Stats Collection Daemon.
May 31 21:44:15 linehaul01.iad1.psf.io systemd[1]: Started Linehaul Stats Collection Daemon.

ewdurbin · 2018-05-31T21:50:52Z

Appears that linehaul is leaking ram and getting OOM'd based on this measurement of RAM usage:

njsmith · 2018-05-31T22:35:12Z

Trying to wrap my head around the impact of this.

It's probably not a big deal at all for folks who are using the bigquery data to compute relative statistics (e.g., "what percent of my downloads are still python 2?"), since (so far) it sounds like the missing data is random distributed in a fairly homogenous way through time.

But it is kind of a big deal for folks who are using it to compute absolute download counts, like we tend to do in grant proposals ("numpy gets XX downloads/month") or the manylinux download counts here.

In theory it should be able to "correct" the historical counts by estimating how much data was lost each data, though I'm not sure whether there's any reasonable way to do this within the constraints of bigquery's query language...

ewdurbin · 2018-06-01T00:41:03Z

Some cheap memory profiling with objgraph.show_most_common_types shows that we collect hundreds of thousands of persistent dict, Arrow, and tzutc objects in long runs of linehaul.

The number of Arrow and tzutc objects are 1:1, while the dicts grow steadily over time.

A snapshot:

dict                       383879
tzutc                      104560
Arrow                      104560
function                   10556
tuple                      5290
list                       4358
weakref                    3252
builtin_function_or_method 2275
getset_descriptor          1720
type                       1663

dict                       531610
tzutc                      144512
Arrow                      144512
function                   10556
tuple                      5285
list                       4367
weakref                    3252
builtin_function_or_method 2286
getset_descriptor          1720
type                       1663

So all but the three problem types are steadyish!

This leads me to believe that we need to look into the parse method at https://github.com/pypa/linehaul/blob/0aea04507c6997e0b367949e5513515e6abc97ed/linehaul/parser.py#L149-L176 and the Download class at https://github.com/pypa/linehaul/blob/0aea04507c6997e0b367949e5513515e6abc97ed/linehaul/parser.py#L127-L139

ewdurbin · 2018-06-01T12:39:10Z

diff --git a/linehaul/core.py b/linehaul/core.py
index af76f4f..626e971 100644
--- a/linehaul/core.py
+++ b/linehaul/core.py
@@ -24,6 +24,9 @@ from ._queue import CloseableFlowControlQueue, QueueClosed
 from .syslog.protocol import SyslogProtocol
 from .user_agents import UnknownUserAgentError
 
+import gc
+import objgraph
+
 
 BATCH_SIZE = 500
 MAX_WAIT = 5 * 60  # 5 minutes
@@ -164,8 +167,12 @@ async def send(client, queue, *, loop):
                 rows = list(rows)
                 suffix = rows[0]["json"]["timestamp"].format("YYYYMMDD")
 
+                logger.info('flushing to bq')
                 await bq.insert_all(
                     rows,
                     template_suffix=suffix,
                     skip_invalid_rows=True,
                 )
+                gc.collect()
+                with open("/var/log/linehaul/objgraph.log", 'a') as objgraph_log:
+                    objgraph.show_most_common_types(file=objgraph_log)

diff used to pull objgraph data.

ewdurbin · 2018-06-01T13:45:01Z

@dstufft can you take a look when you have a chance?

dstufft · 2018-06-01T18:29:23Z

Yea I'll dig into it.

crflynn · 2018-07-26T01:14:17Z

I have a small website for aggregate pypi downloads and noticed a sharp decline in records recently on GBQ. Was doing a bit of research about it and came across this issue. I'm assuming this is directly related. Just wanted to confirm.

dstufft · 2018-07-27T00:20:19Z

I've just deployed a new version of Linehaul (pypa/linehaul@trio), and so far we've had an ~hr of uptime and our memory usage has been flat, with no OOM kills in that hour. I'm going to continue to monitor the new version and make sure it stays stable, but so far it looks very promising.

dstufft · 2018-07-27T00:39:51Z

Oh and just to double check, I've ran:

SELECT
  TIMESTAMP_TRUNC(timestamp, MINUTE) minute,
  COUNT(*)
FROM
  `the-psf.pypi.downloads*`
WHERE
  _TABLE_SUFFIX BETWEEN "20180727" AND "20180727"
GROUP BY
  minute
ORDER BY
  minute DESC

At 12:37 UTC, and there were 37 results, indicating that so far the new linehaul is at least populating some data in every minute, instead of only for some of them.

dstufft · 2018-07-27T01:34:10Z

It appears that the data loss had gotten much worse recently, here's a visualization showing the # of download events per day, over the last 90 days. I included today on this chart, but obviously we're just barely started on today:

dstufft · 2018-07-27T19:07:48Z

I'm going to close this. So far we've dropped no data on the floor (except when we've been explicitly shutting down Linehaul or restarting it) and everything appears to be stable. I've also improved our handling of unknown user agents to still record those downloads (we just won't have any information about the target Python or such).

Thanks everyone for bringing this to our attention1

mhsmith mentioned this issue Jun 13, 2018

Differentiating organic vs automated installations pypa/pip#5499

Closed

dstufft closed this as completed Jul 27, 2018

crflynn mentioned this issue Aug 22, 2018

Do we know what explain the factor 2 in number of download since July 30 ? crflynn/pypistats.org#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approximately half of all downloads are not being recorded in BigQuery #30

Approximately half of all downloads are not being recorded in BigQuery #30

mhsmith commented May 25, 2018

brainwane commented May 31, 2018

ewdurbin commented May 31, 2018

ewdurbin commented May 31, 2018

ewdurbin commented May 31, 2018

njsmith commented May 31, 2018

ewdurbin commented Jun 1, 2018

ewdurbin commented Jun 1, 2018

ewdurbin commented Jun 1, 2018

dstufft commented Jun 1, 2018

crflynn commented Jul 26, 2018

dstufft commented Jul 27, 2018

dstufft commented Jul 27, 2018

dstufft commented Jul 27, 2018

dstufft commented Jul 27, 2018

Approximately half of all downloads are not being recorded in BigQuery #30

Approximately half of all downloads are not being recorded in BigQuery #30

Comments

mhsmith commented May 25, 2018

brainwane commented May 31, 2018

ewdurbin commented May 31, 2018

ewdurbin commented May 31, 2018

ewdurbin commented May 31, 2018

njsmith commented May 31, 2018

ewdurbin commented Jun 1, 2018

ewdurbin commented Jun 1, 2018

ewdurbin commented Jun 1, 2018

dstufft commented Jun 1, 2018

crflynn commented Jul 26, 2018

dstufft commented Jul 27, 2018

dstufft commented Jul 27, 2018

dstufft commented Jul 27, 2018

dstufft commented Jul 27, 2018