Integration testing requirements #1

rabernat · 2021-05-03T17:28:16Z

On today's Pangeo ML WG call, @nbren12 re-raised the perennial issue of the need for integration testing of the many layers of our stack, particularly when it comes to cloud i/o.

This has been discussed in many different related issues:

The feeling from heavy users is that there are still lingering intermittent bugs related to cloud i/o that render workflows unstable. One way to the bottom of this is to get more rigorous about system-wide integration testing.

Some questions we need to resolve to move forward with this idea, and my initial responses, are:

What are the workflows we want to test?
- Write random data to cloud
- Read back data
- Copy data
- Rechunk data
How big does the test need to be in order to be realistic?
- My sense: > 100 GB
Which combinations of libraries do we want to include?
- dask (w/o xarray)
- xarray (via dask)
- gcsfs
- s3fs
- adlfs
- rechunker
How do we orchestrate an integration test like this? Here is one idea
- Build dedicated docker containers with the desired environments using GitHub workflows
- Connect to dask clusters from GitHub workflows (maybe could use Coiled to make this easy; otherwise will need dedicated dask_kubernetes or dask_gateway set up for the purpose)
- Launch integration tests on remote dask clusters from github workflows

If we can agree on a clearly scoped plan, I think we can support the cloud computing costs. We can also considering asking Anaconda (@martindurant) to spend some time on this via our Pangeo Forge contract with them.

cc @jhamman @cisaacstern @TomNicholas @dcherian

rabernat · 2021-05-03T17:28:58Z

Noah, it would be great to get your thoughts on what sort of testing would help surface the issues you've been having.

nbren12 · 2021-05-03T21:33:21Z

Thanks so much for setting up this repo! I think it will be a useful collaboration point. ccing @spencerkclark @oliverwm1 @frodre for attention. We are particularly attuned to the I/O related issues.

What are the workflows we want to test?

Your suggestion are good. Also, my experience is that coarse-graining operations tend to cause many problems. Unlike a time-mean the output will be too large for memory, but the data reduction is enough that the input and output chunk-sizes should differ.

Generally, I think "writes" are less robust than "reads", but the latter is more frequently used by this community.

How big does the test need to be in order to be realistic?

I think so, but I need a better sense in the error rate per GB/HTTP request.

Which combinations of libraries do we want to include?

This often shows up in apache beam jobs for us, since we don't use dask clusters. We still use dask + xarray as a "lazy" array format, just not for parallelism much.

How do we orchestrate an integration test like this?

Agreed that we can build/push docker images from github CI.

For actually running the jobs, a cronjob in an existing k8s cluster might simplify deployment since there will be no need to authenticate access via dask gateway, firewall punching, kubernetes keys, etc. Also, E2E testing can take a long time, so CI is more costly especially if the compute is not actually happening on the CI server. This is the philosophy of CD tools like https://argoproj.github.io/argo-cd/.

If we can agree on a clearly scoped plan, I think we can support the cloud computing costs.

This would be great! I think I should be to justify investing our time on this effort to my bosses, especially since we rely on this stack so much and already contribute to it.

dcherian · 2021-05-03T21:39:23Z

@scottyhq will be interested. If our NASA proposal gets funded, we were planning to put some time towards something like this.

nbren12 · 2021-05-03T21:39:49Z

Common problems arise from using too small chunks. It is easy to exceed google's limit of 1000 API requests/s (for reads), 100 API requests/s (writes), and 1 request/s (write to same object).

rabernat · 2021-05-03T21:43:52Z

It is easy to exceed google's limit of 1000 API requests/s (for reads), 100 API requests/s (writes), and 1 request/s (write to same object).

What would you expect the client libraries to do to handle these limits, particularly in a distributed context? Retries? Exponential backoff?

nbren12 · 2021-05-03T21:47:43Z

Exponential backoff is Google's recommend solution. Gcsfs has this, but not for all operations (see fsspec/gcsfs#387) and the test suite doesn't have great coverage for it (e.g. here's a bug fix I recently contributed fsspec/gcsfs#380).

rabernat · 2021-05-03T22:02:29Z

So the idea here is that the sort of integration testing we are proposing would help surface specific issues (like the ones you already raised), which would help guide upstream development?

nbren12 · 2021-05-04T01:20:21Z

Exactly. Currently I think gcsfs relies on user provided bug reports to find these problems, but usually these problems only occur at scale, and it is hard for users to construct minimal examples that are divorced from their idiosyncratic infrastructure.

nbren12 · 2021-05-04T01:25:53Z

The usual workflow would be

bug shows up here
open issue on gcsfs/...
reproduce the bug in a local unit test of the buggy library (refactoring the code to be testable if required)
fix the bug.

oliverwm1 · 2021-05-04T18:25:49Z

Thanks for getting the ball rolling on this @rabernat! I think the tests you propose would reveal the main issues that have challenged us and make sure these issues don't creep back into the codebase once they are fixed. At least on the gcsfs side, I get the impression that a lot of people have been challenged by intermittent I/O issues: fsspec/gcsfs#327, fsspec/gcsfs#323, fsspec/gcsfs#315, fsspec/gcsfs#316, fsspec/gcsfs#290

Beyond just reliability and catching bugs, would be cool if these tests were also leveraged to monitor performance.

rabernat · 2021-05-04T18:37:24Z

If our NASA proposal gets funded, we were planning to put some time towards something like this.

@scottyhq and @dcherian - when do you expect to know about the NASA proposal? My understanding is that the scope is a lot broader than just cloud i/o. Would it be a problem if we bootstrapped something on a shorter timescale and then handed it off if / when your proposal is funded?

nbren12 · 2021-05-04T19:00:59Z

Another simple way to get some coverage is to install a fixed set of libraries and run their test suites. This will often find errors related to API changes. This is the usual process of preparing a "distribution" like nix or Debian. FYI, I have been working on packaging some of the pangeo stack with the nix package manager. Here's a package I added for gcsfs package. Therefore, there is some level of integration testing being done by their CI servers against specific versions of dask/xarray etc: https://hydra.nixos.org/build/142494940.

dcherian · 2021-05-04T19:26:57Z

when do you expect to know about the NASA proposal?

🤷‍♂️

My understanding is that the scope is a lot broader than just cloud i/o.

Yeah I think so.

Would it be a problem if we bootstrapped something on a shorter timescale and then handed it off if / when your proposal is funded?

Of course! I don't see an issue at all.

scottyhq · 2021-05-05T00:53:19Z

when do you expect to know about the NASA proposal?

Our proposal has a start date of 6/30/2021 to give you a ballpark timeline. If funded I'd definitely be interested in collaborating on this. I don't know how many issues are specific to gcsfs, but NASA is still pretty focused on AWS, so we'd want to use s3 storage as well.

rabernat · 2021-05-11T17:09:21Z

So here is one version of the steps that need to happen to get this working.

On each cloud provider we wish to test

Someone sets up a dask_gateway / Coiled / whatever dask cluster we can connect to using credentials stored as secrets in this repo
Someone sets up a bucket with r/w access via crendentials stored as secrets in this repo

In this repo

We configure a workflow to build a docker image used by the workers (perhaps could bootstrap or skip this step by using pangeo-docker-images)
We add python scripts that do the desired computations
We configure a workflow to run the scripts using a matrix of different clusters / configuration options

nbren12 · 2021-05-11T19:40:11Z

Someone sets up a dask_gateway / Coiled / whatever dask cluster we can connect to using credentials stored as secrets in this repo

Just wondering if it would be simpler to put a k8s cron job like this in each cluster:

apiVersion: batch/v1
kind: CronJob
metadata:
  generateName: pangeo-integration-test
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: accountWithNeededPrivs
          containers:
          - name: hello
            image: pangeo-integration-test:latest
            imagePullPolicy: IfNotPresent
            # all logic happens here
            # and publishes a integration report to a static webpage
            command: ["/bin/bash", "run_tests.sh"]

          restartPolicy: OnFailure

Then, this repo wouldn't need to authenticate against each cluster (or even know they exist) beyond providing a link to the integration test reports. Also, the CI wouldn't need to wait around while the jobs finish. Basically, this decouples the integration tests from the update cycle of this repo (i.e. continuous delivery). We could run some basic checks like read/write from a bucket, run test suite, etc from this repo's CI before publishing the docker image pangeo-integration-test:latest.

yuvipanda · 2021-05-14T18:00:28Z

How about we run a github self-hosted runner on a Kubernetes cluster that also has dask gateway configured? That way, we get all the goodness of GitHub actions without having to deal with nasty auth stuff.

martindurant · 2021-05-14T18:43:36Z

A couple of notes:

for gcsfs, recent improvements should have helped things in the version currently waiting on conda-forge. Of course, full-scale testing would still be useful. This would require a real bucket on GCS and payment
for s3fs, there is already a proposal to add minio as an additional test runner to moto, which would certainly find some edge cases, particularly for non-AWS S3s. Testing against real S3 would, again, require money.
to my mind, designing xarray+dask+storage tests amounts to running some specialised recipes that fill much of the parameter space in (number of chunks, size of chunks, number of files, number of threads/processes

rabernat · 2021-05-15T14:39:29Z

Thanks for the input Martin. The plan is to indeed spend real money to test the whole stack at scale on real cloud providers.

scottyhq mentioned this issue May 5, 2021

benchmarking read and write performance pangeo-data/cog-best-practices#10

Open

nbren12 mentioned this issue May 12, 2021

Minimum viable product #2

Merged

This was referenced May 14, 2021

Set up github workflow for building docker image #3

Open

Develop detailed instructions for setting up kubernetes cluster for running these tests #4

Open

Set up result publishing #5

Open

Flesh out actual tests #6

Open

rabernat mentioned this issue May 17, 2021

Clarify what subset of fsspec's interface we use pangeo-forge/pangeo-forge-recipes#137

Open

rabernat mentioned this issue Jun 7, 2021

Integration test suite before release dask/community#163

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration testing requirements #1

Integration testing requirements #1

rabernat commented May 3, 2021 •

edited

Loading

rabernat commented May 3, 2021

nbren12 commented May 3, 2021 •

edited

Loading

dcherian commented May 3, 2021

nbren12 commented May 3, 2021

rabernat commented May 3, 2021

nbren12 commented May 3, 2021

rabernat commented May 3, 2021

nbren12 commented May 4, 2021

nbren12 commented May 4, 2021 •

edited

Loading

oliverwm1 commented May 4, 2021

rabernat commented May 4, 2021

nbren12 commented May 4, 2021

dcherian commented May 4, 2021

scottyhq commented May 5, 2021

rabernat commented May 11, 2021

nbren12 commented May 11, 2021

yuvipanda commented May 14, 2021

martindurant commented May 14, 2021

rabernat commented May 15, 2021

Integration testing requirements #1

Integration testing requirements #1

Comments

rabernat commented May 3, 2021 • edited Loading

rabernat commented May 3, 2021

nbren12 commented May 3, 2021 • edited Loading

dcherian commented May 3, 2021

nbren12 commented May 3, 2021

rabernat commented May 3, 2021

nbren12 commented May 3, 2021

rabernat commented May 3, 2021

nbren12 commented May 4, 2021

nbren12 commented May 4, 2021 • edited Loading

oliverwm1 commented May 4, 2021

rabernat commented May 4, 2021

nbren12 commented May 4, 2021

dcherian commented May 4, 2021

scottyhq commented May 5, 2021

rabernat commented May 11, 2021

On each cloud provider we wish to test

In this repo

nbren12 commented May 11, 2021

yuvipanda commented May 14, 2021

martindurant commented May 14, 2021

rabernat commented May 15, 2021

rabernat commented May 3, 2021 •

edited

Loading

nbren12 commented May 3, 2021 •

edited

Loading

nbren12 commented May 4, 2021 •

edited

Loading