Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revised integration test framework #12368

Merged
merged 8 commits into from
Aug 24, 2022
Merged

Conversation

paul-rogers
Copy link
Contributor

@paul-rogers paul-rogers commented Mar 25, 2022

Description

Issue #12359 proposes an approach to simplify and streamline integration tests, especially around the developer experience, but also for Travis. See that issue for the background.

This PR is big, but most of that comes from creating revised versions of existing files. Unfortunately, there is no good way using GitHub to compare two copies of the same file. For the most part, these are config files and you can assume that the new versions work (because, when they didn't, the cluster stubbornly refused to start or stay up.)

Developer Experience

With this framework, it is possible to:

  • Do a normal distribution build.
  • Build the Docker image in less than a minute. (Most of that is Maven determining what not to do. After the first build, you can use a script to rebuild in a few seconds, depending on what Docker must rebuild.)
  • Launch the cluster in a few seconds.
  • Debug an integration test as a JUnit test in your favorite IDE.

The result is that integration tests go from being a nightmare to being an efficient way to develop and test code changes. This author used it to create tests for PR #12222. The process was quick and easy. Not as efficient as just using unit tests (we still want the single-process server), but still pretty good. (By contrast, the new tests were ported to the existing framework, and that is still difficult for the reasons we're trying to address here.)

One huge win is that, with this approach, one can start a Docker cluster and leave it up indefinitely to try out APIs, to create or refactor tests, etc. Though there are many details to get right to use Docker and Docker Compose, once those are addressed, using the cluster becomes quite simple and productive.

Contents of this First Cut

This PR is a first draft of the approach which provides:

  • A new directory, integration-tests-ex that holds the new integration test structure. (For now, the existing integration-tests is left unchanged.)
  • Maven module druid-it-tools to hold code placed into the Docker image.
  • Maven module druid-it-image to build the Druid-only test image from the tarball produced in distribution. (Dependencies live in their "official" image.)
  • Maven module druid-it-cases that holds the revised tests and the framework itself. The framework includes file-based test configuration, test-specific clients, test initialization and updated versions of some of the common test support classes.

The integration test setup is primarily a huge mass of details. This approach refactors many of those details: from how the image is built and configured to how the Docker Compose scripts are structured to test configuration. An extensive set of "readme" files explains those details. Rather than repeat that material here, please consult those files for explanations.

Limitations

This version is the result of several months of iteration to work out details around builds on various systems. The framework itself is now pretty solid, as is the Druid image. This PR includes two converted tests, and lessons from several others which are in-flight. We expect to refine the framework as we create and convert other tests.

For now, the new framework is intended to exist parallel to the current one so we experiment. The new framework is ignored unless you select the Maven profiles which enable it. (See the docs for details.) Eventually we will retire the integration-tests versions in favor of the integration-tests-ex versions, but we will do so only after the new versions are rock-solid.

There are many other test groups not yet touched. A good approach is to use this framework for new integration tests, and to convert old ones when someone needs to modify them. The cost of converting to this framework is low, and the productivity gain is large.

Other limitations include:

  • The original tests appear to run not only in Docker, but also against a local QuickStart cluster and against Kubernetes. Neither of these other two modes have been tested in the new framework. (Though, it is now so easy to start and use a Docker cluster that that it may be easier to use Docker than the QuickStart cluster.)
  • The original tests always have security enabled. While it is important to test security, having security enabled makes debugging far harder (by design.) So, this draft has security disabled. The various scripts and configs are pulled aside. The thought is to enable security as an option when needed, and run without it when debugging things other than the security mechanism.
  • The supporting classes have the basics, but have been used for only the one integration test group.
  • This framework is not yet integrated into Travis. A test that exists only in the new framework won't run in the Travis build. We have a working version of the Travis build in a private branch, but that step will be commented out in this PR prior to merge; we'll enable Travis builds as a separate PR as we transition old tests to the new framework.

Next Steps

This PR itself will continue to evolve as some of the final details are sorted out. However, it is at the stage where it will benefit from others taking a look and making suggestions.

The thought is that this PR is large enough already: let's get it reviewed, then tackle the additional issues listed above as the opportunity arrises and step-by-step.

Alternatives

The approach in the PR is based on the existing approach, but re-arranges the parts. Since the integration test are pretty much "nothing but details", there are many approaches that could be taken. Here are a few that were considered.

  • Run the tests as-is in an AWS instance. Because the tests are very difficult to run on a developer machine, many folks set up an AWS instance to run them. While this can work, it is slow: one has to shuffle code from the laptop to the instance and back. Or, just do development on the instance. The tests are not really set up for debugging, so even on the instance, it is still tedious to make and debug test changes.
  • Run the tests in Travis as part of a PR. This is the default approach. However, it is akin to the development process of old: submit the changes to a batch run, wait many hours for the answers, plow though the logs, find issues, fix them, and repeat. That process was not efficient in the era of punch cards, and is still not very efficient today. A turnaround of a minute or less is the garget, which Travis approach cannot provide.
  • Modify the existing integration tests. This is the obvious approach. But, the set of existing ITs is so large that attempting to change everything in one go becomes overwhelming. The chosen approach allows incremental test-by-test conversion without breaking the great mass of existing tests.
  • Status-quo. I'm working on a project that requires many integration tests. It is faster to fix the test framework once, and do the tests quickly, than to fight with the framework for each of the required tests.

That said, this PR is all about details. Your thoughts, suggestions and corrections are encouraged to ensure we've got our bases covered.

Detailed Changes

A number of specific changes are worth calling out that do not appear in the docs.

  • The tests use Guice to create various Druid objects. However, they do not use the Druid extension mechanism: the tests don't have visibility to a Druid installation. Instead, any required extensions are expected to appear as normal jars on the class path. That is, they should be listed in the pom.xml file as dependencies.
  • Tests don't have access to the usual runtime.properties file. Instead, properties come from a new docker.yaml configuration file, from a binding to environment variables, or from command line options. Of these, docker.yaml is preferred for fixed or default properties, environment variables for properties (such as credentials) that vary per run. Avoid use of the command line as that makes test hard to debug in an IDE.
  • The tests use "official" Docker images for dependencies such as MySQL, ZooKeeper and Kafka. A solution for Hadoop is under investigation.
  • A custom DruidTestRunner provides a way to add test-specific Guice modules, along with other configuration.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster (in the sense that this PR is for running such a cluster in Docker.)

@clintropolis clintropolis added Area - Testing Area - Dev For items related to the project itself, like dev docs and checklists, but not CI labels Mar 28, 2022
@paul-rogers
Copy link
Contributor Author

This is a big PR. It is all-in-one so folks can see the whole picture. If it helps, this can be broken into smaller chunks, at loss of overall context.

Here's a quick summary of what's new vs. what's refactored:

  • docker-tests/docs is all new and is meant to capture all the learnings from this exercise, along with information needed to move froward. This is the main resource for understanding this PR.
  • docker-tests project is new so it does not conflict with the existing integration-tests project: both can exist.
  • docker-tests/base-test is mostly new. It contains the revised test config code and a new cluster client.
    • ClusterConfig is the YAML-based config mechanism.
    • Initializer has a bunch of complexity to force the server-minded Guice config to work in a client. Loads the DB. Etc.
  • docker-tests/test-image is a greatly refactored set of Docker build scripts. The Dockerfile is heavily refactored to remove the third-party dependencies and rearrange how Druid is laid out (unpacked from the distribution tarball). DB setup is removed.
  • docker-tests/testing-tools is mostly a copy/paste of extensions-core/testing-tools with the custom node role from integration-tests added.
  • docker-tests/high-availability is a refactor of one test from integration-tests. The Docker Compose script is specific to this one test, refactored from those in integration-tests. The idea is that this test contains just the files for this "group". Other groups will follow this pattern.
  • Other files are mostly clean-up uncovered while debugging. In some cases, code was refactored so the test "clients" could use code that was previously tightly coupled with the server.
  • yaml files: refactor of Docker Compose with new test config.

Also, as noted before, this PR moves authorization aside into separate files. Authorization is not yet enabled.

@paul-rogers
Copy link
Contributor Author

Continuing to whack build details. Who knew that each task did a pre-build before the build in which the pre-build builds everything except distribution. This cases test-image to fail when looking for the non-existent distribution dependency. Using a bit of profile magic to hind this dependency from the pre-build.

@paul-rogers
Copy link
Contributor Author

Sorry, had to do some major surgery to the Maven module structure, which required renaming the modules and their directories. See the description in maven.md.

Other than that, only minor tweaks as I try to run the gauntlet of the zillions of checks run on the code.

@paul-rogers
Copy link
Contributor Author

The new IT task passed, hooray! Whacked a few more static checking issues.

There is one I don't understand. It appears that we've got JS problems, but I didn't change anything in JS:

added 235 packages from 867 contributors and audited 235 packages in 12.562s
found 4 vulnerabilities (2 moderate, 1 high, 1 critical)
  run `npm audit fix` to fix them, or `npm audit` for details
events.js:183
      throw er; // Unhandled 'error' event
      ^
TypeError: Cannot read property 'forEach' of undefined
    at unpackage (/home/travis/build/apache/druid/node_modules/jacoco-parse/source/index.js:27:14)
    at /home/travis/build/apache/druid/node_modules/jacoco-parse/source/index.js:114:22
    at Parser.<anonymous> (/home/travis/build/apache/druid/node_modules/xml2js/lib/parser.js:304:18)
    at emitOne (events.js:116:13)
    at Parser.emit (events.js:211:7)
    at SAXParser.onclosetag (/home/travis/build/apache/druid/node_modules/xml2js/lib/parser.js:262:26)
    at emit (/home/travis/build/apache/druid/node_modules/sax/lib/sax.js:624:35)
    at emitNode (/home/travis/build/apache/druid/node_modules/sax/lib/sax.js:629:5)
    at closeTag (/home/travis/build/apache/druid/node_modules/sax/lib/sax.js:889:7)
    at SAXParser.write (/home/travis/build/apache/druid/node_modules/sax/lib/sax.js:1436:13)
    at Parser.exports.Parser.Parser.parseString (/home/travis/build/apache/druid/node_modules/xml2js/lib/parser.js:323:31)
    at Parser.parseString (/home/travis/build/apache/druid/node_modules/xml2js/lib/parser.js:5:59)
    at exports.parseString (/home/travis/build/apache/druid/node_modules/xml2js/lib/parser.js:369:19)
    at Object.parse.parseContent (/home/travis/build/apache/druid/node_modules/jacoco-parse/source/index.js:107:5)
    at /home/travis/build/apache/druid/node_modules/jacoco-parse/source/index.js:129:15
    at FSReqWrap.readFileAfterClose [as oncomplete] (fs.js:511:3)
****FAILED****

Is this saying that the build itself has broken code? If so, maybe it will go away on the next build?

@paul-rogers
Copy link
Contributor Author

paul-rogers commented Apr 13, 2022

Rebased on latest master to try to fix the prior issue. Unfortunately, the issue didn't resolve.

Now getting a different unrelated failure:

[ERROR] org.apache.druid.query.groupby.epinephelinae.BufferHashGrouperTest.testGrowingOverflowingInteger  Time elapsed: 0.003 s  <<< ERROR!
java.lang.OutOfMemoryError
	at sun.misc.Unsafe.allocateMemory(Native Method)
	at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:127)
	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
	at org.apache.druid.query.groupby.epinephelinae.BufferHashGrouperTest.makeGrouper(BufferHashGrouperTest.java:187)

@paul-rogers
Copy link
Contributor Author

New commit. We have to exclude test test code from Jacoco since it is not unit tested. That was painful because the test classes were in "generic" Druid packages. Moved the test code into a dedicated package so we can just exclude that one package.

Migrated the remainder of the batch index tests. This showed some redundancy in the required test code, so created a test runner to hide that boilerplate. Test conversion is now very easy -- at least for the sample of tests converted thus far.

Also includes other minor doc changes and build issue fixes.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for re-organizing the content, @paul-rogers . It's much easier to follow now.
I have given a partial feedback, mostly minor nitpicks/suggestions.

I am going through the rest of it and will try to finish my review soon.

docker-tests/README.md Outdated Show resolved Hide resolved
@@ -23,6 +23,19 @@
import java.util.Map;

/**
* Configuration for tests. Opinionated about the shape of the cluster:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these!

@LazySingleton
@SuppressForbidden(reason = "System#err")
public CuratorFramework makeCurator(ZkEnablementConfig zkEnablementConfig, CuratorConfig config, EnsembleProvider ensembleProvider, Lifecycle lifecycle)
public static CuratorFramework createCurator(CuratorConfig config, EnsembleProvider ensembleProvider)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I guess this method can be private now. Also, does it need to be static?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky one. The original code creates a curator framework via a Guice provider. We have to keep that as that's what Druid services require.

The test code wants to use ZK, via curator, but without Guice, since Guice adds extra complexity. It turns out that what we want is to use the builder from the two config items. Rather than copy/paste that code, this refactoring makes it available outside of Guice.

The new static method have to be public so tests can reach them. I believe that the existing instance methods have to also be public so Guice can call them.

pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated
Comment on lines 1195 to 1208
<exclude>org/apache/druid/server/coordination/ServerManagerForQueryErrorTest.class</exclude>
<exclude>org/apache/druid/guice/SleepModule.class</exclude>
<exclude>org/apache/druid/guice/CustomNodeRoleClientModule.class</exclude>
<exclude>org/apache/druid/cli/CustomNodeRoleCommandCreator.class</exclude>
<exclude>org/apache/druid/cli/QueryRetryTestCommandCreator.class</exclude>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you have put them in a separate package, I guess these exclusions are not needed anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are now for the "old" versions. Added a comment to clarify.

pom.xml Outdated
@@ -1505,6 +1525,7 @@
<!--@TODO After fixing https://github.com/apache/druid/issues/4964 remove this parameter-->
-Ddruid.indexing.doubleStorage=double
</argLine>
<skipTests>${skipUTs}</skipTests>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this okay? Wouldn't this end up skipping all tests and not just UTs?

Copy link
Contributor Author

@paul-rogers paul-rogers May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. Surefire runs the UTs. It's sister plugin, Failsafe, runs the ITs. Here, we want to skip the Surefire tests only. Let me know if the comment makes this clear, else I'll add to it.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great work, @paul-rogers !
Overall it looks great to me!
I ran the tests

  • it's pretty easy to build, start, stop the docker cluster now
  • running single tests from the IDE is also a lot of help
  • tests can easily be run and re-run any number of times (depending on whether they are performing the teardown cleanup)
  • logs etc are populated properly in the target/shared folders

I have some questions/requests:

  • As all the tests would now run in a single maven command, would there be a way to retry only failed tests after a failure happens in the first run?
  • The documents are fairly well detailed, but they seem to be more from the point of view of implementation details rather than usage. Given the size of this, it would be nice to have a usage doc which just lists out the steps (or points to another doc) for typical actions: writing a new test group, migrating a test group from old ITs, configuring the cluster for a test, debugging a test, running all tests, running all tests of a group, etc. Most of this stuff is already there but spread out.
  • As we start to migrate the existing tests, test flakiness is something we would need to detect and address. What would be an approach to do that? (maybe we could add a section in conversion.md for tips and pitfalls)

docker-tests/docs/docker.md Outdated Show resolved Hide resolved
docker-tests/docs/docker.md Outdated Show resolved Hide resolved
docker-tests/docs/docker.md Outdated Show resolved Hide resolved
docker-tests/docs/docker.md Outdated Show resolved Hide resolved
docker-tests/README.md Outdated Show resolved Hide resolved
docker-tests/docs/debugging.md Outdated Show resolved Hide resolved
Comment on lines 32 to 34
import org.apache.druid.testing.IntegrationTestingConfig;
import org.apache.druid.testing.guice.TestClient;
import org.apache.druid.testing.utils.SqlTestQueryHelper;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be some imports from the original integration-tests. Do we want to retain these as is?

I guess this is why there is a dependency on druid-integration-tests in the pom.xml for this test group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The thought is to reuse the original code where possible. The classes that migrated to the new page are for the cases where something in the code needed to change, typically something about configuration. It is a bit awkward that all the IT "framework" code is mixed in with the actual IT tests in the old structure: that's why we have to depend on the entire druid-integration-tests module.

If we can migrate all the tests, then we can merge the old and new files to create a single base project.

docker-tests/docs/tests.md Outdated Show resolved Hide resolved
docker-tests/docs/tests.md Outdated Show resolved Hide resolved
# Starts the test-specific test cluster using Docker compose using
# versions and other settings gathered when building the images.

SCRIPT_DIR=$(cd $(dirname $0) && pwd)
Copy link
Contributor

@kfaraz kfaraz Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see it, the contents of cluster.sh would be the same for every test group. Only the cluster config changes. Is it possible to avoid the duplication of the cluster.sh?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy the cluster.sh script from an existing test. Add lines to copy
any test-specific files into the target/shared folder.

I see this mentioned in docs/conversion.md as something that might prevent this. Just guessing here, but couldn't that be done in some other way, say by putting them in src/test/resources/shared?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've been thinking about how to do this. The two "groups" we have now ended up needing the same setup. I'm waiting to see how a few more groups work out to determine if these are special cases, or if the script really does end up being the same. If the same, we can bump it to the parent directory.

@paul-rogers
Copy link
Contributor Author

@kfaraz, thank you for your thorough review, and for trying out the new setup. Always great to know it runs on a machine other than my own!

You mentioned flaky test and how to retry them. Two thoughts on that.

First, we should not have flaky tests. IMHO, such tests either:

  • Are flaky because they start running before the cluster is stable,
  • Are not telling us anything if the test themselves are flaky (because they depend on timing, or on behavior which is inherently non-deterministic, such as the ordering of events from different services.)
  • Are point out actual issues with Druid: that clients would have to retry operations. We should either a) fix that issue, or b) document it. Either way, the tests should be prepared for whatever race or non-deterministic condition is in question.

The new framework eliminates the first issue. The framework ensures that services are ready before launching tests. This means that either the test or Druid is flaky. Either way, we should fix he issue: remove the test if it is not useful, else fix it or fix Druid (perhaps adding a way to synchronize when needed for testing.)

@paul-rogers
Copy link
Contributor Author

All that said, there is the second issues: rerunning specific tests. This is a harder issue than one would think.

The reason to combine tests is that, in this new system, the bulk of the time for each "test group" is consumed with building Druid. If we keep the tests split up, we end up rebuilding Druid over and over and over. Allowing retries means retaining our current extravagant abuse of Travis resources.

The obvious solution to the redundancy issue is to build Druid and the image once, then run all the test groups that use that particular configuration. Since we have multiple configurations, the various configurations would still run in parallel, but the test "groups" would run in series within each configuration.

Of course, if we retain flaky tests, then we want to play "whack-a-mole" in builds: keep rerunning only those tests that failed until we get lucky and they pass. By combining tests, we decrease the probability of getting lucky. As mentioned above, the obvious answer is to fix the flaky tests, we we are starting to do.

Another constraint is how Travis seems to work. We can only rerun jobs within Travis's build matrix. It does not seem we can parameterize the job to say, "just run the ITs, with only these two projects." To be able to rerun one test "group" we have to let each group (for each configuration) build all of Druid, which gets us back to the redundancy issue.

Short term, I'm thinking to do an experiment in which each test "group" is triggered by a separate Maven profile. We can then also have an "all-its" profile that enables all the groups. Until we resolve flaky tests, we can opt to waste resources and build profile-by-profile (that is, group-by-group) as we do today. Later, when tests are fixed (or if we identify groups which are not flaky), we can combine them via profiles.

I'll try that in a separate commit so I can easily back it out if it does not work out.

@paul-rogers
Copy link
Contributor Author

@kfaraz, good point on the docs. Yes, the docs started as my attempt to remember the many details of how the original tests worked, and what I changed as I created this new version. Per your suggestion, I created a quickstart.md page with usage info. We can expand that as we figure out what additional information is most often needed. I added links into the more detailed docs for when someone needs more information.

The idea on conversion is to try out a few groups here, then convert the others over time. I was perhaps lucky: the groups I converted so far mostly "just worked." I've encountered no flakiness in those tests, in my limited runs, after I made sure the cluster was up before running the tests.

We'll have to see, as we convert more, if the others are as easy, or if there will be tricky bits. If there are test that are still flaky, we'll have to sort that out on a case-by-case basis, as suggested above.

Let's also remember that there there is a big chunk of work not addressed in this PR: running a secured cluster. There is code in the old ITs to create certs, configure security, etc. Tests run that way will be very difficult to debug, by definition. That whole areas is left as an open issue in this PR, in part because this one is already large enough.

@paul-rogers
Copy link
Contributor Author

This branch has been open long enough that it drifted out of sync with master. Rebasing ran into the usual issues when a zillion files change. So, squashed commits so the rebase would succeed. Fortunately, the squashed commits are those that have already been reviewed, no additional changes were made before squashing occurred. New changes show up as new commits on top of the squash. In this latest commit, updated the project from 0.23.0 to 0.24.0 so that the builds will work.

@paul-rogers
Copy link
Contributor Author

paul-rogers commented May 19, 2022

Getting an odd failure:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (process-resource-bundles) 
on project it-high-availability: 
Error resolving project artifact: 
Could not transfer artifact io.confluent:kafka-schema-registry-client:pom:5.5.1 
from/to sigar (https://repository.jboss.org/nexus/content/repositories/thirdparty-uploads/): 
Transfer failed for https://repository.jboss.org/nexus/content/repositories/thirdparty-uploads/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom 
for project io.confluent:kafka-schema-registry-client:jar:5.5.1: 
peer not authenticated -> [Help 1]

First, the project on which this fails does not include the artifact. Second, the project that does use it already built, so the artifact should be cached locally. Third,, why is the peer not authenticated?

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments, @paul-rogers .
+1 after CI passes.

You might have mentioned this already but could you please confirm the phase of Travis CI where these new docker-tests would be executed?

.travis.yml Outdated
@@ -73,6 +73,19 @@ jobs:
stage: Tests - phase 1
script: ${MVN} animal-sniffer:check --fail-at-end

# Experimental run of the revised ITs. Done early to debug issues
# Move later once things work.
- name: "experimental docker tests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these have to be moved to a later stage before we can merge this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and disabled this step for now: it has done its task of proving that the new ITs work in Travis. We'll revisit as we decide how to migrate from the existing IT groups to the new ones.

@paul-rogers
Copy link
Contributor Author

paul-rogers commented May 23, 2022

@kfaraz, thanks for the review. It's been a long slog to resolve the many Maven issues with all our many static checks.

You asked about the "experimental docker tests" task in this PR. Yes, it is experimental: I'll remove (or disable) it before we commit. For now, I envision we won't run the tests in the maven build since they duplicate existing tests. Instead, a good next step would be to migrate each IT one by one: convert it to the new framework, replace the current IT tasks with a new version (based on the "experimental" one), and verify the results.

The plan is to get a clean build. Once that is done, I'll remove the experimental step and we can commit this monster.

As we move ahead, the new framework will run in phase 2, in place of the existing items. During the interim, we can mix-and-match mechanisms: the Travis builds are all independent of one another. That is a problem in general (we redo the same work over and over) but turns out to be a help during the transition.

@paul-rogers
Copy link
Contributor Author

Currently trying to track down a mysterious error. In `it-high-availability, Maven is unable to find a particular jar file. Looks like it works one time (in Java 8), but fails another time (in Java 11):

[INFO] Downloading from google-maven-central: https://maven-central.storage-download.googleapis.com/maven2/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom
[INFO] Downloading from sonatype: https://oss.sonatype.org/content/repositories/releases/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom
[INFO] Downloading from sonatype-apache: https://repository.apache.org/content/repositories/releases/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom
[INFO] Downloading from apache.snapshots: https://repository.apache.org/snapshots/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom
[INFO] Downloading from sigar: https://repository.jboss.org/nexus/content/repositories/thirdparty-uploads/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (process-resource-bundles) on project it-high-availability: 
Error resolving project artifact: 
Could not transfer artifact io.confluent:kafka-schema-registry-client:pom:5.5.1 from/to sigar (https://repository.jboss.org/nexus/content/repositories/thirdparty-uploads/): 
Transfer failed for https://repository.jboss.org/nexus/content/repositories/thirdparty-uploads/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom for project io.confluent:kafka-schema-registry-client:jar:5.5.1: 
peer not authenticated -> [Help 1]

This is pushing against the edge of my Maven knowledge. I'm hoping it is just something silly I'm doing that shows up as the above mysterious error. Anyone seen something similar? For example, did the first attempt above succeed? Or, is the error a failure in that attempt, but the error is reported later? Or, is Maven trying to get the jar twice, the first worked, but the second failed?

Seems the 5.5.1 release is still available, so that isn't a problem. (It is old, and has vulnerabilities, so we should probably upgrade.)

This seems to be a transitive dependency brought in from druid-integration-tests. Yet, that module seems to compile. Still scratching my head...

@paul-rogers
Copy link
Contributor Author

Rebased on latest master to try to overcome the perpetual "confluent" jar errors. Let's see if this let's us get a clean build.

@paul-rogers
Copy link
Contributor Author

Sad. This latest run should have worked, but it seems there are issues on Travis with finding the JRE. Sigh. We have to wait for those issues to be resolved, then can some committer please restart the build.

@paul-rogers
Copy link
Contributor Author

Getting a clean build is proving quite difficult. Out of desperation, we'll pull two groups of changes out of this PR into separate PRs so the build issues are easier to debug. In particular, it is hard at present to separate out actual errors in the "old" ITs from the flaky ITs. Let's get those other two PRs done, then we'll rebase this on the updated master so that only the new IR code remains. That way, if an old IT fails, we'll have some confidence that it is just flaky, not broken.

kfaraz pushed a commit that referenced this pull request Jun 23, 2022
This commit contains the cleanup needed for the new integration test framework.

Changes:
- Fix log lines, misspellings, docs, etc.
- Allow the use of some of Druid's "JSON config" objects in tests
- Fix minor bug in `BaseNodeRoleWatcher`
paul-rogers added a commit to paul-rogers/druid that referenced this pull request Jun 24, 2022
@paul-rogers paul-rogers marked this pull request as draft June 25, 2022 18:43
kfaraz pushed a commit that referenced this pull request Jun 25, 2022
This commit contains changes made to the existing ITs to support the new ITs.

Changes:
- Make the "custom node role" code usable by the new ITs. 
- Use flag `-DskipITs` to skips the integration tests but runs unit tests.
- Use flag `-DskipUTs` skips unit tests but runs the "new" integration tests.
- Expand the existing Druid profile, `-P skip-tests` to skip both ITs and UTs.
Restructuring of the integration tests to make the code
simpler and much easier for developers to use on their
own machine.

* New project, docker-tests for this work
* Project to build the test Docker image
* Projects per test group
* Restructuring of configuration
* Two example test groups

For now, this version exists parallel to the original
version.
* Moved test code to a single module, as previously
* Initialization handles customization
* Shared cluster config across categories
* Revised to use JUnit categories
* Rebased on latest master
* Corresponding doc updates
@paul-rogers paul-rogers changed the title First cut at restructuring the integration tests Revised integration test framework Aug 16, 2022
Support for all of the IntegrationTestingConfig properties
Wrapper script for IT operations
Additional documentation
Build fix
@paul-rogers paul-rogers marked this pull request as ready for review August 19, 2022 22:05
Improved env var-based configuration
Test createion guide
Enable new ITs in travis
Parameterized tests
@paul-rogers
Copy link
Contributor Author

Revised to prepare for merge:

  • Parameterized tests
  • Test creation guide
  • Main IT script: it.sh
  • Enhanced configuration options: env vars, etc.
  • Test runner supports parameterized tests
  • Test runner allows test-specific code configuration (add Guice modules, etc.)
  • Other cleanup, bug fixes

@paul-rogers
Copy link
Contributor Author

The latest PR converted two ITs to use the new framework. Both pass in Travis. (And there was much rejoicing.)

However, two of the old ITs fail for obscure reasons. A security test fails with an auth failure, but the same test was clean in a prior build. Another IT can't find its input file, though this PR changes none of the input files or paths in the old tests. These are not the usual flaky IT suspects. So we probably have to sort out what's-what. Once we do, this should be good to go.

@paul-rogers
Copy link
Contributor Author

@kfaraz, thanks for your approval of this PR. The final changes are in for this PR and the build is clean. Please take a quick final look, and merge the PR at your convenience.


<profiles>
<profile>
<id>IT-HighAvailability</id>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylistic nit: I think that the indentation looks a bit off in the <profile>...</profile> tags.

@kfaraz
Copy link
Contributor

kfaraz commented Aug 24, 2022

Thanks for the update, @paul-rogers ! I will take another look today and merge this.

@kfaraz kfaraz merged commit cfed036 into apache:master Aug 24, 2022
@abhishekagarwal87 abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Dev For items related to the project itself, like dev docs and checklists, but not CI Area - Testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants