Skip to content
This repository has been archived by the owner on Dec 1, 2021. It is now read-only.

Approximately half of all downloads are not being recorded in BigQuery #30

Closed
mhsmith opened this issue May 25, 2018 · 14 comments
Closed

Comments

@mhsmith
Copy link

mhsmith commented May 25, 2018

I ran a script today from 01:07 to 09:39 UTC, downloading a file every six minutes. It made a total of 86 downloads, but only 40 of them are in the database:

SELECT *
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    TIMESTAMP("2018-05-25"),
    TIMESTAMP("2018-05-25")
  )
  where file.project = "six" and file.version = "0.9.0" 
  and country_code = "DE" and details.installer.version = "10.0.1"
  order by timestamp desc

Looking at the data more generally, there are 1440 minutes in a day, but on 05-23 only 748 of them had any data recorded:

#standardSQL
SELECT timestamp_trunc(timestamp, MINUTE) minute, count(*)
FROM `the-psf.pypi.downloads*`
where _table_suffix between "20180523" and "20180523"
  group by minute order by minute desc

Browsing through those results shows many extended gaps in the data, e.g. from 22:45 to 22:54, and 21:36 to 22:16.

I know this service isn't widely-known, but isn't it intended to be a complete record of all PyPI downloads? If so, something is clearly going wrong.

@brainwane
Copy link

@ewdurbin
Copy link
Member

Appears linehaul service may be flapping, as reported by Fastly:

Last error: read tcp REDACTED-Fastly-IP:42645->REDACTED-PSF-IP:48175: i/o timeout
                Last error time: 2018-05-31T21:43:17.391210755Z
                Broken since: 2018-05-31T21:42:46.555578684Z

@ewdurbin
Copy link
Member

Linehaul is getting killed by OS with some frequency:

ernestd@linehaul01:/var/log/upstart$ sudo journalctl -f -u linehaul
-- Logs begin at Thu 2018-05-31 14:45:52 UTC. --
May 31 21:17:38 linehaul01.iad1.psf.io systemd[1]: linehaul.service: Failed with result 'signal'.
May 31 21:17:39 linehaul01.iad1.psf.io systemd[1]: linehaul.service: Service hold-off time over, scheduling restart.
May 31 21:17:39 linehaul01.iad1.psf.io systemd[1]: Stopped Linehaul Stats Collection Daemon.
May 31 21:17:39 linehaul01.iad1.psf.io systemd[1]: Started Linehaul Stats Collection Daemon.
May 31 21:44:15 linehaul01.iad1.psf.io systemd[1]: linehaul.service: Main process exited, code=killed, status=9/KILL
May 31 21:44:15 linehaul01.iad1.psf.io systemd[1]: linehaul.service: Unit entered failed state.
May 31 21:44:15 linehaul01.iad1.psf.io systemd[1]: linehaul.service: Failed with result 'signal'.
May 31 21:44:15 linehaul01.iad1.psf.io systemd[1]: linehaul.service: Service hold-off time over, scheduling restart.
May 31 21:44:15 linehaul01.iad1.psf.io systemd[1]: Stopped Linehaul Stats Collection Daemon.
May 31 21:44:15 linehaul01.iad1.psf.io systemd[1]: Started Linehaul Stats Collection Daemon.

@ewdurbin
Copy link
Member

Appears that linehaul is leaking ram and getting OOM'd based on this measurement of RAM usage:
screen shot 2018-05-31 at 5 50 24 pm

@njsmith
Copy link

njsmith commented May 31, 2018

Trying to wrap my head around the impact of this.

It's probably not a big deal at all for folks who are using the bigquery data to compute relative statistics (e.g., "what percent of my downloads are still python 2?"), since (so far) it sounds like the missing data is random distributed in a fairly homogenous way through time.

But it is kind of a big deal for folks who are using it to compute absolute download counts, like we tend to do in grant proposals ("numpy gets XX downloads/month") or the manylinux download counts here.

In theory it should be able to "correct" the historical counts by estimating how much data was lost each data, though I'm not sure whether there's any reasonable way to do this within the constraints of bigquery's query language...

@ewdurbin
Copy link
Member

ewdurbin commented Jun 1, 2018

Some cheap memory profiling with objgraph.show_most_common_types shows that we collect hundreds of thousands of persistent dict, Arrow, and tzutc objects in long runs of linehaul.

The number of Arrow and tzutc objects are 1:1, while the dicts grow steadily over time.

A snapshot:

dict                       383879
tzutc                      104560
Arrow                      104560
function                   10556
tuple                      5290
list                       4358
weakref                    3252
builtin_function_or_method 2275
getset_descriptor          1720
type                       1663
dict                       531610
tzutc                      144512
Arrow                      144512
function                   10556
tuple                      5285
list                       4367
weakref                    3252
builtin_function_or_method 2286
getset_descriptor          1720
type                       1663

So all but the three problem types are steadyish!

This leads me to believe that we need to look into the parse method at https://github.com/pypa/linehaul/blob/0aea04507c6997e0b367949e5513515e6abc97ed/linehaul/parser.py#L149-L176 and the Download class at https://github.com/pypa/linehaul/blob/0aea04507c6997e0b367949e5513515e6abc97ed/linehaul/parser.py#L127-L139

@ewdurbin
Copy link
Member

ewdurbin commented Jun 1, 2018

diff --git a/linehaul/core.py b/linehaul/core.py
index af76f4f..626e971 100644
--- a/linehaul/core.py
+++ b/linehaul/core.py
@@ -24,6 +24,9 @@ from ._queue import CloseableFlowControlQueue, QueueClosed
 from .syslog.protocol import SyslogProtocol
 from .user_agents import UnknownUserAgentError
 
+import gc
+import objgraph
+
 
 BATCH_SIZE = 500
 MAX_WAIT = 5 * 60  # 5 minutes
@@ -164,8 +167,12 @@ async def send(client, queue, *, loop):
                 rows = list(rows)
                 suffix = rows[0]["json"]["timestamp"].format("YYYYMMDD")
 
+                logger.info('flushing to bq')
                 await bq.insert_all(
                     rows,
                     template_suffix=suffix,
                     skip_invalid_rows=True,
                 )
+                gc.collect()
+                with open("/var/log/linehaul/objgraph.log", 'a') as objgraph_log:
+                    objgraph.show_most_common_types(file=objgraph_log)

diff used to pull objgraph data.

@ewdurbin
Copy link
Member

ewdurbin commented Jun 1, 2018

@dstufft can you take a look when you have a chance?

@dstufft
Copy link
Member

dstufft commented Jun 1, 2018

Yea I'll dig into it.

@crflynn
Copy link
Contributor

crflynn commented Jul 26, 2018

I have a small website for aggregate pypi downloads and noticed a sharp decline in records recently on GBQ. Was doing a bit of research about it and came across this issue. I'm assuming this is directly related. Just wanted to confirm.

@dstufft
Copy link
Member

dstufft commented Jul 27, 2018

I've just deployed a new version of Linehaul (pypa/linehaul@trio), and so far we've had an ~hr of uptime and our memory usage has been flat, with no OOM kills in that hour. I'm going to continue to monitor the new version and make sure it stays stable, but so far it looks very promising.

@dstufft
Copy link
Member

dstufft commented Jul 27, 2018

Oh and just to double check, I've ran:

SELECT
  TIMESTAMP_TRUNC(timestamp, MINUTE) minute,
  COUNT(*)
FROM
  `the-psf.pypi.downloads*`
WHERE
  _TABLE_SUFFIX BETWEEN "20180727" AND "20180727"
GROUP BY
  minute
ORDER BY
  minute DESC

At 12:37 UTC, and there were 37 results, indicating that so far the new linehaul is at least populating some data in every minute, instead of only for some of them.

@dstufft
Copy link
Member

dstufft commented Jul 27, 2018

It appears that the data loss had gotten much worse recently, here's a visualization showing the # of download events per day, over the last 90 days. I included today on this chart, but obviously we're just barely started on today:

screen-shot-2018-07-26-21-31-40

@dstufft
Copy link
Member

dstufft commented Jul 27, 2018

I'm going to close this. So far we've dropped no data on the floor (except when we've been explicitly shutting down Linehaul or restarting it) and everything appears to be stable. I've also improved our handling of unknown user agents to still record those downloads (we just won't have any information about the target Python or such).

Thanks everyone for bringing this to our attention1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants