-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration testing requirements #1
Comments
Noah, it would be great to get your thoughts on what sort of testing would help surface the issues you've been having. |
Thanks so much for setting up this repo! I think it will be a useful collaboration point. ccing @spencerkclark @oliverwm1 @frodre for attention. We are particularly attuned to the I/O related issues.
Your suggestion are good. Also, my experience is that coarse-graining operations tend to cause many problems. Unlike a time-mean the output will be too large for memory, but the data reduction is enough that the input and output chunk-sizes should differ. Generally, I think "writes" are less robust than "reads", but the latter is more frequently used by this community.
I think so, but I need a better sense in the error rate per GB/HTTP request.
This often shows up in apache beam jobs for us, since we don't use dask clusters. We still use dask + xarray as a "lazy" array format, just not for parallelism much.
Agreed that we can build/push docker images from github CI. For actually running the jobs, a cronjob in an existing k8s cluster might simplify deployment since there will be no need to authenticate access via dask gateway, firewall punching, kubernetes keys, etc. Also, E2E testing can take a long time, so CI is more costly especially if the compute is not actually happening on the CI server. This is the philosophy of CD tools like https://argoproj.github.io/argo-cd/.
This would be great! I think I should be to justify investing our time on this effort to my bosses, especially since we rely on this stack so much and already contribute to it. |
@scottyhq will be interested. If our NASA proposal gets funded, we were planning to put some time towards something like this. |
Common problems arise from using too small chunks. It is easy to exceed google's limit of 1000 API requests/s (for reads), 100 API requests/s (writes), and 1 request/s (write to same object). |
What would you expect the client libraries to do to handle these limits, particularly in a distributed context? Retries? Exponential backoff? |
Exponential backoff is Google's recommend solution. Gcsfs has this, but not for all operations (see fsspec/gcsfs#387) and the test suite doesn't have great coverage for it (e.g. here's a bug fix I recently contributed fsspec/gcsfs#380). |
So the idea here is that the sort of integration testing we are proposing would help surface specific issues (like the ones you already raised), which would help guide upstream development? |
Exactly. Currently I think gcsfs relies on user provided bug reports to find these problems, but usually these problems only occur at scale, and it is hard for users to construct minimal examples that are divorced from their idiosyncratic infrastructure. |
The usual workflow would be
|
Thanks for getting the ball rolling on this @rabernat! I think the tests you propose would reveal the main issues that have challenged us and make sure these issues don't creep back into the codebase once they are fixed. At least on the gcsfs side, I get the impression that a lot of people have been challenged by intermittent I/O issues: fsspec/gcsfs#327, fsspec/gcsfs#323, fsspec/gcsfs#315, fsspec/gcsfs#316, fsspec/gcsfs#290 Beyond just reliability and catching bugs, would be cool if these tests were also leveraged to monitor performance. |
@scottyhq and @dcherian - when do you expect to know about the NASA proposal? My understanding is that the scope is a lot broader than just cloud i/o. Would it be a problem if we bootstrapped something on a shorter timescale and then handed it off if / when your proposal is funded? |
Another simple way to get some coverage is to install a fixed set of libraries and run their test suites. This will often find errors related to API changes. This is the usual process of preparing a "distribution" like nix or Debian. FYI, I have been working on packaging some of the pangeo stack with the nix package manager. Here's a package I added for gcsfs package. Therefore, there is some level of integration testing being done by their CI servers against specific versions of dask/xarray etc: https://hydra.nixos.org/build/142494940. |
🤷♂️
Yeah I think so.
Of course! I don't see an issue at all. |
Our proposal has a start date of 6/30/2021 to give you a ballpark timeline. If funded I'd definitely be interested in collaborating on this. I don't know how many issues are specific to gcsfs, but NASA is still pretty focused on AWS, so we'd want to use s3 storage as well. |
So here is one version of the steps that need to happen to get this working. On each cloud provider we wish to test
In this repo
|
Just wondering if it would be simpler to put a k8s cron job like this in each cluster:
Then, this repo wouldn't need to authenticate against each cluster (or even know they exist) beyond providing a link to the integration test reports. Also, the CI wouldn't need to wait around while the jobs finish. Basically, this decouples the integration tests from the update cycle of this repo (i.e. continuous delivery). We could run some basic checks like read/write from a bucket, run test suite, etc from this repo's CI before publishing the docker image |
How about we run a github self-hosted runner on a Kubernetes cluster that also has dask gateway configured? That way, we get all the goodness of GitHub actions without having to deal with nasty auth stuff. |
A couple of notes:
|
Thanks for the input Martin. The plan is to indeed spend real money to test the whole stack at scale on real cloud providers. |
On today's Pangeo ML WG call, @nbren12 re-raised the perennial issue of the need for integration testing of the many layers of our stack, particularly when it comes to cloud i/o.
This has been discussed in many different related issues:
The feeling from heavy users is that there are still lingering intermittent bugs related to cloud i/o that render workflows unstable. One way to the bottom of this is to get more rigorous about system-wide integration testing.
Some questions we need to resolve to move forward with this idea, and my initial responses, are:
If we can agree on a clearly scoped plan, I think we can support the cloud computing costs. We can also considering asking Anaconda (@martindurant) to spend some time on this via our Pangeo Forge contract with them.
cc @jhamman @cisaacstern @TomNicholas @dcherian
The text was updated successfully, but these errors were encountered: