-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revised integration test framework #12368
Conversation
This is a big PR. It is all-in-one so folks can see the whole picture. If it helps, this can be broken into smaller chunks, at loss of overall context. Here's a quick summary of what's new vs. what's refactored:
Also, as noted before, this PR moves authorization aside into separate files. Authorization is not yet enabled. |
Continuing to whack build details. Who knew that each task did a pre-build before the build in which the pre-build builds everything except distribution. This cases |
Sorry, had to do some major surgery to the Maven module structure, which required renaming the modules and their directories. See the description in maven.md. Other than that, only minor tweaks as I try to run the gauntlet of the zillions of checks run on the code. |
The new IT task passed, hooray! Whacked a few more static checking issues. There is one I don't understand. It appears that we've got JS problems, but I didn't change anything in JS:
Is this saying that the build itself has broken code? If so, maybe it will go away on the next build? |
5787848
to
ed9db43
Compare
Rebased on latest master to try to fix the prior issue. Unfortunately, the issue didn't resolve. Now getting a different unrelated failure:
|
New commit. We have to exclude test test code from Jacoco since it is not unit tested. That was painful because the test classes were in "generic" Druid packages. Moved the test code into a dedicated package so we can just exclude that one package. Migrated the remainder of the batch index tests. This showed some redundancy in the required test code, so created a test runner to hide that boilerplate. Test conversion is now very easy -- at least for the sample of tests converted thus far. Also includes other minor doc changes and build issue fixes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for re-organizing the content, @paul-rogers . It's much easier to follow now.
I have given a partial feedback, mostly minor nitpicks/suggestions.
I am going through the rest of it and will try to finish my review soon.
core/src/main/java/org/apache/druid/metadata/MetadataStorageConnectorConfig.java
Outdated
Show resolved
Hide resolved
@@ -23,6 +23,19 @@ | |||
import java.util.Map; | |||
|
|||
/** | |||
* Configuration for tests. Opinionated about the shape of the cluster: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these!
integration-tests/src/main/java/org/apache/druid/testing/IntegrationTestingConfig.java
Outdated
Show resolved
Hide resolved
@LazySingleton | ||
@SuppressForbidden(reason = "System#err") | ||
public CuratorFramework makeCurator(ZkEnablementConfig zkEnablementConfig, CuratorConfig config, EnsembleProvider ensembleProvider, Lifecycle lifecycle) | ||
public static CuratorFramework createCurator(CuratorConfig config, EnsembleProvider ensembleProvider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I guess this method can be private now. Also, does it need to be static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tricky one. The original code creates a curator framework via a Guice provider. We have to keep that as that's what Druid services require.
The test code wants to use ZK, via curator, but without Guice, since Guice adds extra complexity. It turns out that what we want is to use the builder from the two config items. Rather than copy/paste that code, this refactoring makes it available outside of Guice.
The new static method have to be public so tests can reach them. I believe that the existing instance methods have to also be public so Guice can call them.
pom.xml
Outdated
<exclude>org/apache/druid/server/coordination/ServerManagerForQueryErrorTest.class</exclude> | ||
<exclude>org/apache/druid/guice/SleepModule.class</exclude> | ||
<exclude>org/apache/druid/guice/CustomNodeRoleClientModule.class</exclude> | ||
<exclude>org/apache/druid/cli/CustomNodeRoleCommandCreator.class</exclude> | ||
<exclude>org/apache/druid/cli/QueryRetryTestCommandCreator.class</exclude> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you have put them in a separate package, I guess these exclusions are not needed anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are now for the "old" versions. Added a comment to clarify.
docker-tests/it-base/src/test/java/org/apache/druid/testing2/cluster/ClusterClient.java
Outdated
Show resolved
Hide resolved
docker-tests/it-base/src/test/java/org/apache/druid/test/TestClusterConfig.java
Outdated
Show resolved
Hide resolved
pom.xml
Outdated
@@ -1505,6 +1525,7 @@ | |||
<!--@TODO After fixing https://github.com/apache/druid/issues/4964 remove this parameter--> | |||
-Ddruid.indexing.doubleStorage=double | |||
</argLine> | |||
<skipTests>${skipUTs}</skipTests> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this okay? Wouldn't this end up skipping all tests and not just UTs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment. Surefire runs the UTs. It's sister plugin, Failsafe, runs the ITs. Here, we want to skip the Surefire tests only. Let me know if the comment makes this clear, else I'll add to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great work, @paul-rogers !
Overall it looks great to me!
I ran the tests
- it's pretty easy to build, start, stop the docker cluster now
- running single tests from the IDE is also a lot of help
- tests can easily be run and re-run any number of times (depending on whether they are performing the teardown cleanup)
- logs etc are populated properly in the
target/shared
folders
I have some questions/requests:
- As all the tests would now run in a single maven command, would there be a way to retry only failed tests after a failure happens in the first run?
- The documents are fairly well detailed, but they seem to be more from the point of view of implementation details rather than usage. Given the size of this, it would be nice to have a usage doc which just lists out the steps (or points to another doc) for typical actions: writing a new test group, migrating a test group from old ITs, configuring the cluster for a test, debugging a test, running all tests, running all tests of a group, etc. Most of this stuff is already there but spread out.
- As we start to migrate the existing tests, test flakiness is something we would need to detect and address. What would be an approach to do that? (maybe we could add a section in
conversion.md
for tips and pitfalls)
import org.apache.druid.testing.IntegrationTestingConfig; | ||
import org.apache.druid.testing.guice.TestClient; | ||
import org.apache.druid.testing.utils.SqlTestQueryHelper; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be some imports from the original integration-tests
. Do we want to retain these as is?
I guess this is why there is a dependency on druid-integration-tests
in the pom.xml for this test group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The thought is to reuse the original code where possible. The classes that migrated to the new page are for the cases where something in the code needed to change, typically something about configuration. It is a bit awkward that all the IT "framework" code is mixed in with the actual IT tests in the old structure: that's why we have to depend on the entire druid-integration-tests
module.
If we can migrate all the tests, then we can merge the old and new files to create a single base project.
# Starts the test-specific test cluster using Docker compose using | ||
# versions and other settings gathered when building the images. | ||
|
||
SCRIPT_DIR=$(cd $(dirname $0) && pwd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I see it, the contents of cluster.sh
would be the same for every test group. Only the cluster config changes. Is it possible to avoid the duplication of the cluster.sh
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy the
cluster.sh
script from an existing test. Add lines to copy
any test-specific files into thetarget/shared
folder.
I see this mentioned in docs/conversion.md
as something that might prevent this. Just guessing here, but couldn't that be done in some other way, say by putting them in src/test/resources/shared
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've been thinking about how to do this. The two "groups" we have now ended up needing the same setup. I'm waiting to see how a few more groups work out to determine if these are special cases, or if the script really does end up being the same. If the same, we can bump it to the parent directory.
@kfaraz, thank you for your thorough review, and for trying out the new setup. Always great to know it runs on a machine other than my own! You mentioned flaky test and how to retry them. Two thoughts on that. First, we should not have flaky tests. IMHO, such tests either:
The new framework eliminates the first issue. The framework ensures that services are ready before launching tests. This means that either the test or Druid is flaky. Either way, we should fix he issue: remove the test if it is not useful, else fix it or fix Druid (perhaps adding a way to synchronize when needed for testing.) |
All that said, there is the second issues: rerunning specific tests. This is a harder issue than one would think. The reason to combine tests is that, in this new system, the bulk of the time for each "test group" is consumed with building Druid. If we keep the tests split up, we end up rebuilding Druid over and over and over. Allowing retries means retaining our current extravagant abuse of Travis resources. The obvious solution to the redundancy issue is to build Druid and the image once, then run all the test groups that use that particular configuration. Since we have multiple configurations, the various configurations would still run in parallel, but the test "groups" would run in series within each configuration. Of course, if we retain flaky tests, then we want to play "whack-a-mole" in builds: keep rerunning only those tests that failed until we get lucky and they pass. By combining tests, we decrease the probability of getting lucky. As mentioned above, the obvious answer is to fix the flaky tests, we we are starting to do. Another constraint is how Travis seems to work. We can only rerun jobs within Travis's build matrix. It does not seem we can parameterize the job to say, "just run the ITs, with only these two projects." To be able to rerun one test "group" we have to let each group (for each configuration) build all of Druid, which gets us back to the redundancy issue. Short term, I'm thinking to do an experiment in which each test "group" is triggered by a separate Maven profile. We can then also have an "all-its" profile that enables all the groups. Until we resolve flaky tests, we can opt to waste resources and build profile-by-profile (that is, group-by-group) as we do today. Later, when tests are fixed (or if we identify groups which are not flaky), we can combine them via profiles. I'll try that in a separate commit so I can easily back it out if it does not work out. |
@kfaraz, good point on the docs. Yes, the docs started as my attempt to remember the many details of how the original tests worked, and what I changed as I created this new version. Per your suggestion, I created a The idea on conversion is to try out a few groups here, then convert the others over time. I was perhaps lucky: the groups I converted so far mostly "just worked." I've encountered no flakiness in those tests, in my limited runs, after I made sure the cluster was up before running the tests. We'll have to see, as we convert more, if the others are as easy, or if there will be tricky bits. If there are test that are still flaky, we'll have to sort that out on a case-by-case basis, as suggested above. Let's also remember that there there is a big chunk of work not addressed in this PR: running a secured cluster. There is code in the old ITs to create certs, configure security, etc. Tests run that way will be very difficult to debug, by definition. That whole areas is left as an open issue in this PR, in part because this one is already large enough. |
dc800ea
to
0d5d227
Compare
This branch has been open long enough that it drifted out of sync with master. Rebasing ran into the usual issues when a zillion files change. So, squashed commits so the rebase would succeed. Fortunately, the squashed commits are those that have already been reviewed, no additional changes were made before squashing occurred. New changes show up as new commits on top of the squash. In this latest commit, updated the project from 0.23.0 to 0.24.0 so that the builds will work. |
Getting an odd failure:
First, the project on which this fails does not include the artifact. Second, the project that does use it already built, so the artifact should be cached locally. Third,, why is the peer not authenticated? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments, @paul-rogers .
+1 after CI passes.
You might have mentioned this already but could you please confirm the phase of Travis CI where these new docker-tests would be executed?
.travis.yml
Outdated
@@ -73,6 +73,19 @@ jobs: | |||
stage: Tests - phase 1 | |||
script: ${MVN} animal-sniffer:check --fail-at-end | |||
|
|||
# Experimental run of the revised ITs. Done early to debug issues | |||
# Move later once things work. | |||
- name: "experimental docker tests" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these have to be moved to a later stage before we can merge this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and disabled this step for now: it has done its task of proving that the new ITs work in Travis. We'll revisit as we decide how to migrate from the existing IT groups to the new ones.
@kfaraz, thanks for the review. It's been a long slog to resolve the many Maven issues with all our many static checks. You asked about the "experimental docker tests" task in this PR. Yes, it is experimental: I'll remove (or disable) it before we commit. For now, I envision we won't run the tests in the maven build since they duplicate existing tests. Instead, a good next step would be to migrate each IT one by one: convert it to the new framework, replace the current IT tasks with a new version (based on the "experimental" one), and verify the results. The plan is to get a clean build. Once that is done, I'll remove the experimental step and we can commit this monster. As we move ahead, the new framework will run in phase 2, in place of the existing items. During the interim, we can mix-and-match mechanisms: the Travis builds are all independent of one another. That is a problem in general (we redo the same work over and over) but turns out to be a help during the transition. |
Currently trying to track down a mysterious error. In `it-high-availability, Maven is unable to find a particular jar file. Looks like it works one time (in Java 8), but fails another time (in Java 11):
This is pushing against the edge of my Maven knowledge. I'm hoping it is just something silly I'm doing that shows up as the above mysterious error. Anyone seen something similar? For example, did the first attempt above succeed? Or, is the error a failure in that attempt, but the error is reported later? Or, is Maven trying to get the jar twice, the first worked, but the second failed? Seems the 5.5.1 release is still available, so that isn't a problem. (It is old, and has vulnerabilities, so we should probably upgrade.) This seems to be a transitive dependency brought in from |
4b8d802
to
af6a3c8
Compare
Rebased on latest master to try to overcome the perpetual "confluent" jar errors. Let's see if this let's us get a clean build. |
Sad. This latest run should have worked, but it seems there are issues on Travis with finding the JRE. Sigh. We have to wait for those issues to be resolved, then can some committer please restart the build. |
7b55fb6
to
87db1a2
Compare
Getting a clean build is proving quite difficult. Out of desperation, we'll pull two groups of changes out of this PR into separate PRs so the build issues are easier to debug. In particular, it is hard at present to separate out actual errors in the "old" ITs from the flaky ITs. Let's get those other two PRs done, then we'll rebase this on the updated master so that only the new IR code remains. That way, if an old IT fails, we'll have some confidence that it is just flaky, not broken. |
This commit contains changes made to the existing ITs to support the new ITs. Changes: - Make the "custom node role" code usable by the new ITs. - Use flag `-DskipITs` to skips the integration tests but runs unit tests. - Use flag `-DskipUTs` skips unit tests but runs the "new" integration tests. - Expand the existing Druid profile, `-P skip-tests` to skip both ITs and UTs.
Restructuring of the integration tests to make the code simpler and much easier for developers to use on their own machine. * New project, docker-tests for this work * Project to build the test Docker image * Projects per test group * Restructuring of configuration * Two example test groups For now, this version exists parallel to the original version.
2a0c56f
to
e62a56c
Compare
* Moved test code to a single module, as previously * Initialization handles customization * Shared cluster config across categories * Revised to use JUnit categories * Rebased on latest master * Corresponding doc updates
e62a56c
to
d50fe25
Compare
Support for all of the IntegrationTestingConfig properties Wrapper script for IT operations Additional documentation Build fix
Improved env var-based configuration Test createion guide Enable new ITs in travis Parameterized tests
a1a3687
to
e034f60
Compare
Revised to prepare for merge:
|
The latest PR converted two ITs to use the new framework. Both pass in Travis. (And there was much rejoicing.) However, two of the old ITs fail for obscure reasons. A security test fails with an auth failure, but the same test was clean in a prior build. Another IT can't find its input file, though this PR changes none of the input files or paths in the old tests. These are not the usual flaky IT suspects. So we probably have to sort out what's-what. Once we do, this should be good to go. |
@kfaraz, thanks for your approval of this PR. The final changes are in for this PR and the build is clean. Please take a quick final look, and merge the PR at your convenience. |
|
||
<profiles> | ||
<profile> | ||
<id>IT-HighAvailability</id> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stylistic nit: I think that the indentation looks a bit off in the <profile>...</profile>
tags.
Thanks for the update, @paul-rogers ! I will take another look today and merge this. |
Description
Issue #12359 proposes an approach to simplify and streamline integration tests, especially around the developer experience, but also for Travis. See that issue for the background.
This PR is big, but most of that comes from creating revised versions of existing files. Unfortunately, there is no good way using GitHub to compare two copies of the same file. For the most part, these are config files and you can assume that the new versions work (because, when they didn't, the cluster stubbornly refused to start or stay up.)
Developer Experience
With this framework, it is possible to:
The result is that integration tests go from being a nightmare to being an efficient way to develop and test code changes. This author used it to create tests for PR #12222. The process was quick and easy. Not as efficient as just using unit tests (we still want the single-process server), but still pretty good. (By contrast, the new tests were ported to the existing framework, and that is still difficult for the reasons we're trying to address here.)
One huge win is that, with this approach, one can start a Docker cluster and leave it up indefinitely to try out APIs, to create or refactor tests, etc. Though there are many details to get right to use Docker and Docker Compose, once those are addressed, using the cluster becomes quite simple and productive.
Contents of this First Cut
This PR is a first draft of the approach which provides:
integration-tests-ex
that holds the new integration test structure. (For now, the existingintegration-tests
is left unchanged.)druid-it-tools
to hold code placed into the Docker image.druid-it-image
to build the Druid-only test image from the tarball produced indistribution
. (Dependencies live in their "official" image.)druid-it-cases
that holds the revised tests and the framework itself. The framework includes file-based test configuration, test-specific clients, test initialization and updated versions of some of the common test support classes.The integration test setup is primarily a huge mass of details. This approach refactors many of those details: from how the image is built and configured to how the Docker Compose scripts are structured to test configuration. An extensive set of "readme" files explains those details. Rather than repeat that material here, please consult those files for explanations.
Limitations
This version is the result of several months of iteration to work out details around builds on various systems. The framework itself is now pretty solid, as is the Druid image. This PR includes two converted tests, and lessons from several others which are in-flight. We expect to refine the framework as we create and convert other tests.
For now, the new framework is intended to exist parallel to the current one so we experiment. The new framework is ignored unless you select the Maven profiles which enable it. (See the docs for details.) Eventually we will retire the
integration-tests
versions in favor of theintegration-tests-ex
versions, but we will do so only after the new versions are rock-solid.There are many other test groups not yet touched. A good approach is to use this framework for new integration tests, and to convert old ones when someone needs to modify them. The cost of converting to this framework is low, and the productivity gain is large.
Other limitations include:
Next Steps
This PR itself will continue to evolve as some of the final details are sorted out. However, it is at the stage where it will benefit from others taking a look and making suggestions.
The thought is that this PR is large enough already: let's get it reviewed, then tackle the additional issues listed above as the opportunity arrises and step-by-step.
Alternatives
The approach in the PR is based on the existing approach, but re-arranges the parts. Since the integration test are pretty much "nothing but details", there are many approaches that could be taken. Here are a few that were considered.
That said, this PR is all about details. Your thoughts, suggestions and corrections are encouraged to ensure we've got our bases covered.
Detailed Changes
A number of specific changes are worth calling out that do not appear in the docs.
pom.xml
file as dependencies.runtime.properties
file. Instead, properties come from a newdocker.yaml
configuration file, from a binding to environment variables, or from command line options. Of these,docker.yaml
is preferred for fixed or default properties, environment variables for properties (such as credentials) that vary per run. Avoid use of the command line as that makes test hard to debug in an IDE.DruidTestRunner
provides a way to add test-specific Guice modules, along with other configuration.This PR has: