Add screenshot testing #1068

tihuan · 2020-04-16T04:10:16Z

Description of proposed changes

This PR adds jest-image-snapshot and jest.retryTimes() to do basic screenshot diffing for website's functionality

Related issue(s)

Related to #917

Testing

What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

Please help test this PR locally by:

Make sure you have the default zika.json data via: curl http://data.nextstrain.org/zika.json --compressed -o data/zika.json
Run npm run dev
In another terminal tab run: npm run integration-test and all snapshot tests should pass:

Thank you for contributing to Nextstrain!

tihuan · 2020-04-16T04:10:45Z

.eslintrc

@@ -7,6 +7,7 @@ globals:
  browser: true
  context: true
  jestPuppeteer: true
+  BASE_URL: true


This was missing from .eslintrc, thus causing the linter error. Sorry!

tihuan · 2020-04-16T04:11:00Z

.gitignore

@@ -16,3 +16,6 @@ s3/
 node_modules/
 npm-debug.log
 *tgz
+
+### IDE ###
+.vscode/*


Ignoring .vscode settings

tihuan · 2020-04-16T04:12:11Z

puppeteer.setup.js

@@ -7,6 +10,8 @@ import { setDefaultOptions } from 'expect-puppeteer';
 jest.setTimeout(30 * 1000);
 setDefaultOptions({ timeout: 3 * 1000 });

+jest.retryTimes(4);


Add retry 4 times for a total test run of 5 times before jest reporting a failed test. This is useful for avoiding minor flakiness when doing UI testing

I was finding it relatively hard to force an error (I was trying to do so by mucking with CSS, to cause a noticeable visual change), I think partially because retries were set high enough that it was taking quite some time to exit.

In deciding on your retry/timeout settings, mind doing the same, just to confirm that an expected failure occurs in a reasonable period of time, and bubbles appropriately.

src/components/controls/color-by.js

tihuan · 2020-04-16T04:18:50Z

test/integration/helpers.js

@@ -0,0 +1,32 @@
+const WAIT_BETWEEN_SCREENSHOTS_MS = 100;
+
+export async function waitUntilScreenStable() {


This function ensures the screen's pixels are no longer changing, before we run toMatchImageSnapshot()

jameshadfield · 2020-04-16T09:28:25Z

This is really nice. I had a chance to test it out and the code seemed good (although I am new to these kinds of tests) and I'm excited with all the things we can do here. Do you know why the images I'm getting may have failed? (they look the same to your screenshots, to my eyes at least)

 PASS  test/sum.test.js
 PASS  test/sortedDomain.test.js
 FAIL  test/integration/zika.test.js (176.319s)
  ● Zika › Color by › author › matches the screenshot

    Expected image to match or be a close match to snapshot but was 2.158760730131173% different from snapshot (28649 differing pixels).
    See diff for details: /Users/naboo/github/nextstrain/auspice/test/integration/__image_snapshots__/__diff_output__/Color by: author-diff.png

(I think it's fine to use the "live" zika build as this isn't changing currently, but one day soon i'll make a bunch of "test" datasets available which we can guarantee won't change.)

Also, could I ask you to

Rebase onto master. The tests in this PR should automatically run as part of the GitHub CI via https://github.com/nextstrain/auspice/blob/master/.github/workflows/ci.yaml#L21-L34
Add some docs to DEV_DOCS.md, no matter how brief.

bcoe · 2020-04-16T16:30:21Z

@tihuan @jameshadfield I'm excited to see this approach in action, I've talked to peers who swear by image snapshots for integration tests (I know this approach is used heavily at Facebook, as an example).

Thought for improvement

For CI/CD my only concern is, can we get it so a user contributing to the project can easily see the visual difference they've caused? Reading up on Jest Image Snapshot, it sounds like we could upload the image somewhere custom. What if we did something slick like leaving the image as a comment on a failing PR? I can dig in to how practical this would be.

DEV_DOCS.md vs., CONTRIBUTING.md

@jameshadfield this is absolutely a nit, so feel free to ignore me 😆 but it's fairly common to have a file called CONTRIBUTING.md to describe how to start contributing to a project; including steps like getting tests up and running. I believe GitHub actually picks up on this file, and displays it in their UI in a few places.

Would you be open to CONTRIBUTING.md rather than DEV_DOCS.md?

bcoe · 2020-04-16T16:17:42Z

test/integration/zika.test.js

-  })
+    describe("region", () => {
+      it("matches the screenshot", async () => {
+        await toMatchImageSnapshot("region", async () => {


I like this approach, you're ensuring that specific components on the page match visually (this is quite nice).

tihuan · 2020-04-16T16:41:38Z

Thanks for the quick feedback on this, everyone 🤩🙏!!

Thanks for the headsup! I just rebased this from upstream/master, so it seems to be picking up the GH Actions now. Thanks for adding that @bcoe ! Super exciting to get more automation in place 🎉

Ah thanks for the reminder, I will add some doc today!

For the image diffing sometimes fails but our eyes can't tell the difference thing, I did read something yesterday about different machines, graphic cards, etc. could cause subtle pixel diff that only machines can tell. Which is part of the reason why I added blur. But looks like that's not enough lol

We might need to tweak the threshold for failing, like 3%, 4%, 5% or something? Depends on how much margin of error we're comfortable here!

First time doing image diffing, so not sure what other heuristic config people do to prevent flakiness 😆

@bcoe Love the idea of uploading image diffing and show that to people! Depends on how Auspice is set up, we will need to hook it up to an S3 bucket and management permission etc to get GH Action to upload it and send the image link. That'd be super convenient and awesome!!

tihuan · 2020-04-16T17:05:37Z

Just checked the failed integration test output and we're getting almost 18% diff! So probably a real failure

From what @jameshadfield was saying, it sounds like the "live" zika dataset is probably different from the seeded one, so that could be why?

If that's the case, should we add seed datasets in this PR and run the tests against them to have consistency?

Thanks all 😄 !

tihuan · 2020-04-16T17:34:32Z

uhh interesting, I just added 5% failure threshold, and the tests magically passed now -.- Given that it was almost 18% diff, not sure what made it to pass now! HMM

jameshadfield · 2020-04-16T20:52:04Z

but it's fairly common to have a file called CONTRIBUTING.md to describe how to start contributing to a project;

Until recently we had a contributing.md but PR #978 moved these to DEV_DOCS so that all projects can pull in a generic "nextstrain" contributing.md. Happy to discuss directions in another issue, but for this PR let's add to dev_docs.

Re: CI tests. I see a multi-step process:

As it stands, my understanding is that these will be run via the GitHub workflow. Failures might be hard to reason with, but it's a big win to know that a PR resulted in them failing.
In a separate PR we can add automatic posting of why the test failed. I can definitely make a S3 bucket and add credentials as a GitHub secret (and share with you both for development purposes) for making the GitHub job more intelligent.

it sounds like the "live" zika dataset is probably different from the seeded one

Zika hasn't been updated since last year, so that's not the case. But more generally we should add a "test set" of JSONs to the S3 bucket which nextstrain draws from so that datasets don't change underneath us. This can be done in a future PR. (Please don't add dataset JSONs to this repo, they will add too much weight!)

I think the only thing for this PR is understanding why the image diff percentages are varying. I don't think there's anything stochastic about how we render auspice. I'll also look into this today.

tihuan · 2020-04-16T23:02:10Z

Oh great! Love the idea of adding test set JSONs to S3 instead of the repo 🎉

Yeah another possibility is the waitUntilScreenStable() helper function waits 100ms between snapshots, so if the CI machine is slow, the screen might not update much between two snapshots, so by the time the real toMatchSnapshot() call happens, it's diffing an incomplete snapshot against the expected snapshot

I think I might do a different approach to test snapshot against the expected snapshot for up to N times or until timeout happens, say 10 seconds. So we don't need to rely on waitUntilScreenStable() at all. This paradigm will be closer to how jest-puppeteer works too, in the sense that their expect API also auto poll until timeout to see if the expected element eventually shows up in the DOM

WIll update the PR later today! Thank you!

tihuan · 2020-04-17T16:56:56Z

OK I updated the code to use the new approach, and GH Action tests have been passing consistently!

Can anyone try to run locally too for sanity check?

If that works, then the last thing I need to do is to update DEV_DOCS!

Thanks all 😄 !

bcoe

Left a couple comments from my own testing @tihuan 👌 I'm excited to see if I can figure out how to make screenshots upload to the PR with GitHub actions, as follow on work.

bcoe · 2020-04-17T18:01:38Z

puppeteer.setup.js

@@ -7,6 +10,8 @@ import { setDefaultOptions } from 'expect-puppeteer';
 jest.setTimeout(30 * 1000);
 setDefaultOptions({ timeout: 3 * 1000 });

+jest.retryTimes(4);


I was finding it relatively hard to force an error (I was trying to do so by mucking with CSS, to cause a noticeable visual change), I think partially because retries were set high enough that it was taking quite some time to exit.

In deciding on your retry/timeout settings, mind doing the same, just to confirm that an expected failure occurs in a reasonable period of time, and bubbles appropriately.

bcoe · 2020-04-17T18:04:30Z

test/integration/zika.test.js

+     * https://github.com/americanexpress/jest-image-snapshot#%EF%B8%8F-api
+     */
+    const SNAPSHOT_CONFIG = {
+      failureThreshold: 5,


similar to my prior comment, I was finding I was able to cause quite a few visual oddities in the page, before I triggered an error with the 5% threshold. To decide on the failureThreshold and blur values, I think a good approach might be to:

purposely remove a couple elements from the page that aren't tested (make sure the missing elements trigger a failure in a reasonable amount of time).

muck with CSS to make the page look noticeably visually different, and make sure we get a failure in a reasonable amount of time 👌

tihuan · 2020-04-17T18:21:53Z

Thanks for the excellent feedback, @bcoe ! Will play with the failureThreshold 👌

As for retryTimes(), maybe we can start with 2 and see if it resolves most of the random flakiness instead of 4?

tihuan · 2020-04-19T06:51:12Z

Hi all!

I removed failureThreshold and now GH Action's integration test is now consistently getting a roughly 17% diff.

Digging a little deeper why it was passing before, I learned that failureThreshold actually expects a value between [0, 1], so when I set it previously to 5 it actually meant 500% 😂 , thus any image diff would pass. Sorry, MY BAD 🙏 !

Also, thanks to @bcoe 's awesome image diff comment service here(super convenient! Thank you!✨ 👏), it seems like the 17% is caused by the rendering difference between macOS and Linux? Like font, default browser element styling, etc..

And looking at this article, they suggest running image diffing in a Docker image for consistent environment, so that could increase the scope of this PR quite a bit. Or maybe we can somehow come up with a creative temp solution to reap enough benefits of screenshot testing now?

bcoe · 2020-04-19T07:44:31Z

It seems like the 17% is caused by the rendering difference between macOS and Linux?

@tihuan, GitHub actions actually supports a macos test runner. Perhaps we could test switching out the runner we use, and see if it reduces the delta? (I added you as a collaborator to my testing repo, if you want to experiment with the macos option).

Another thing worth checking ...

Have any of the datasets changed significantly since your initial snapshots, you might try pulling a fresh set of data.

I like @jameshadfield's idea of having a "test set", would help eliminate false positives in the future.

anescobar1991 · 2020-04-19T17:09:34Z

Zika hasn't been updated since last year, so that's not the case. But more generally we should add a "test set" of JSONs to the S3 bucket which nextstrain draws from so that datasets don't change underneath us. This can be done in a future PR. (Please don't add dataset JSONs to this repo, they will add too much weight!)

I like that idea but I would make sure that those test assets are versioned, otherwise tests may start failing without any changes to the code which is probably not the intention.

package.json

anescobar1991 · 2020-04-19T17:18:19Z

You may want to enable git lfs so that over time as more commits and image snapshots are added git does not slow to a crawl.

tihuan · 2020-04-20T16:59:48Z

Thanks so much for all the amazing suggestions, @bcoe and @anescobar1991 !

Changing GH Action's integration-test's runs-on to macos-10.15 worked 🎉 👏 ! Do we need to worry about contributors who don't run integration tests locally on a Mac?

Will look into adding git lfs today! Thank you!

CC: @jameshadfield for the macos question 😄 Thanks!

bcoe · 2020-04-20T20:56:43Z

Do we need to worry about contributors who don't run integration tests locally on a Mac?

@jameshadfield has the real say on this matter (as I've just swooped in and started volunteering 😆).... but,

If the macos-10.15 could get us to a starting point, I'm supportive ... but it might be worth following on soon after (like you suggest) with thorough documentation on how to run through a docker container. At which point, we should re-snapshot with Linux using that container. Since puppeteer is a wire protocol, it should be easy to communicate with the container without mounting a local filesystem, so I think it will actually ultimately not be too painful for folks.

Changing GH Action's integration-test's runs-on to macos-10.15 worked 🎉 👏

How did testing go with purposefully breaking a few elements? One thought, could we test that we could have caught this real regression reported by @jameshadfield:

Great! I tested it on a few examples, it worked, I merged it. But this broke legend display for numeric legend traits, which I didn't test on 😢 . These are the kind of things I'd love testing to automate out of my life!.

👆If we could catch this breakage, it would be a great indicator that our comparison is sensitive enough.

tihuan · 2020-04-21T04:04:09Z

@bcoe Sounds great! Yeah I also feel like getting some snapshots to work now is better than nothing, but definitely up to @jameshadfield 😆

I tried to get git lfs to work, but given that macOS Catalina has tightened security, the tool is not working well for me until they get developer signature from Apple: git-lfs/git-lfs#3714. I have a feeling they'll get a certificate soon, so maybe we can create an issue to add git lfs as a separate task?

Re: intentionally breaking the app to make sure the snapshot works, since I removed failureThreshold, it's definitely catching any pixel diff as failure now! For example, I changed Phylogeny to hello, and the test caught that successfully 😄

jameshadfield · 2020-04-22T09:48:17Z

Hey @tihuan, @bcoe et al., this all looks great. I'm happy where this branch has gotten to -- it introduces integration tests, gives examples, and provides easy-to-follow documentation. Since they work in CI, they are immediately valuable for finding regressions on future PRs 🎉

The tests do appear to be very browser specific. I'm using MacOS 10.14.6 and the screenshot tests fail for me, with all differences appearing to be related to text rendering:

If I regenerate the snapshots myself then the tests pass 💯

So that's to say that I think the testing approach is fine, and the browser-specificity can be tackled in a separate PR. Following your comments above, before adding further tests, I think we should tackle the following in separate PRs:

Add a lightweight docker container (or similar) in which to run the tests & generate the snapshots. (See also Strawman: VSCode dev container mode #1012, although it may be tangential.)
I'll create some docs / pseudocode / real-code for high-priority tests, probably involving narratives -- example here.
Work on automatically reporting errors in GitHub (test: display comment on PR if visual integration tests fail #1076)

One last question before merge: edit: I think we just do the first one here...

The 4 screenshots already add ~1Mb to the repo. This isn't a deal-breaker, but we can't keep going in this direction. So, what are your thoughts on what we should do here?

Merge as is, and move to git-lfs for future snapshot tests (I believe all collaborators will need git lfs. This is probably my preferred option)
Remove ~3/4 of the pngs from the history in this branch to save space & add them back later when we have git-lfs.
Use git-lfs now (requires going back in the history of this branch I think)
Consider another place to store screenshots (S3?, my least preferable option)

Is there anything else to consider before merging this?

Other notes:

As an aside, the legend-display-bug linked above knocked out the whole tree, so it would have been caught at almost any % threshold!
Running the tests in headful mode is pretty cool to be able to see what's going on!

tihuan · 2020-04-23T20:54:28Z

Hi all!

GOOD NEWS, there's a way around installing git lfs! But we will still need @jameshadfield's help to set up git lfs in auspice first, before forks can use it.

I just tried pushing up a commit to my fork and got:

batch response: @tihuan can not upload new objects to public fork tihuan/auspice

To install git lfs, I wrote the following .md update (which I can't push to my fork yet because of lfs, so pasting below):

Install git lfs

We use git lfs to store large files, such as expected test image snapshots
If you're on MacOS, before this issue is resolved, please follow the steps here to ensure you get to run the tool successfully:
1. In terminal, sudo spctl --master-disable
2. Install git-lfs as normal. E.g., brew install git-lfs or port install git-lfs
3. Run sudo spctl --master-enable

After git lfs is installed, we can start tracking files:

git lfs install
git lfs track "*.png"
git add .
Make a git commit and merge the PR into auspice!

That should do 🤞

Re: Docker, yeah I think that's the only sure way to have OS/browser consistent snapshot testing, although I'm not super familiar with Docker so would be helpful to see if other people want to take on that task!

Thanks all!

See additions to dev-docs for rationale. Having this in master should allow us to rebase #1068 and therefore have git-lfs used from day zero.

jameshadfield · 2020-04-24T07:09:11Z

Thanks so much @tihuan, @bcoe and @anescobar1991. This has now been merged via #1084 which was a rebase of this onto master so that we use git-lfs from the get-go 😄

tihuan force-pushed the Add-screenshot-testing branch from d347ebd to 6ef74f1 Compare April 16, 2020 04:16

tihuan commented Apr 16, 2020

View reviewed changes

tihuan mentioned this pull request Apr 16, 2020

Add E2E testing #917

Open

jameshadfield mentioned this pull request Apr 16, 2020

test: text-based dataset smoke test #1057

Merged

2 tasks

bcoe approved these changes Apr 16, 2020

View reviewed changes

tihuan force-pushed the Add-screenshot-testing branch from 6ef74f1 to 91d057d Compare April 16, 2020 16:34

tihuan force-pushed the Add-screenshot-testing branch from 8004f3f to 9e3202c Compare April 17, 2020 01:18

bcoe reviewed Apr 17, 2020

View reviewed changes

bcoe mentioned this pull request Apr 19, 2020

test: display comment on PR if visual integration tests fail #1076

Open

tihuan force-pushed the Add-screenshot-testing branch 8 times, most recently from b4c6ce9 to 89f4896 Compare April 19, 2020 06:16

anescobar1991 reviewed Apr 19, 2020

View reviewed changes

package.json Show resolved Hide resolved

tihuan force-pushed the Add-screenshot-testing branch 2 times, most recently from c62c0bd to 22f2bfd Compare April 20, 2020 16:45

Add screenshot testing

86a3453

tihuan force-pushed the Add-screenshot-testing branch from 22f2bfd to 86a3453 Compare April 21, 2020 03:45

jameshadfield approved these changes Apr 23, 2020

View reviewed changes

jameshadfield added a commit that referenced this pull request Apr 24, 2020

Add git-lfs to be used (upcoming) snapshot images

7f9b487

See additions to dev-docs for rationale. Having this in master should allow us to rebase #1068 and therefore have git-lfs used from day zero.

This was referenced Apr 24, 2020

Add git-lfs to be used for (upcoming) snapshot images #1083

Merged

Add screenshot testing (rebased) #1084

Merged

jameshadfield closed this Apr 24, 2020

anescobar1991 mentioned this pull request May 9, 2020

OSX Note americanexpress/jest-image-snapshot#203

Closed

tihuan deleted the Add-screenshot-testing branch July 3, 2020 02:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add screenshot testing #1068

Add screenshot testing #1068

tihuan commented Apr 16, 2020

tihuan Apr 16, 2020

tihuan Apr 16, 2020

tihuan Apr 16, 2020

bcoe Apr 17, 2020

tihuan Apr 16, 2020

jameshadfield commented Apr 16, 2020

bcoe commented Apr 16, 2020

bcoe Apr 16, 2020

tihuan commented Apr 16, 2020

tihuan commented Apr 16, 2020

tihuan commented Apr 16, 2020

jameshadfield commented Apr 16, 2020

tihuan commented Apr 16, 2020

tihuan commented Apr 17, 2020

bcoe left a comment

bcoe Apr 17, 2020

bcoe Apr 17, 2020

tihuan commented Apr 17, 2020

tihuan commented Apr 19, 2020

bcoe commented Apr 19, 2020 •

edited

Loading

anescobar1991 commented Apr 19, 2020

anescobar1991 commented Apr 19, 2020

tihuan commented Apr 20, 2020 •

edited

Loading

bcoe commented Apr 20, 2020 •

edited

Loading

tihuan commented Apr 21, 2020

jameshadfield commented Apr 22, 2020 •

edited

Loading

tihuan commented Apr 23, 2020

jameshadfield commented Apr 24, 2020

		@@ -0,0 +1,32 @@
		const WAIT_BETWEEN_SCREENSHOTS_MS = 100;

		export async function waitUntilScreenStable() {

Add screenshot testing #1068

Add screenshot testing #1068

Conversation

tihuan commented Apr 16, 2020

Description of proposed changes

Related issue(s)

Testing

Thank you for contributing to Nextstrain!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield commented Apr 16, 2020

bcoe commented Apr 16, 2020

Thought for improvement

DEV_DOCS.md vs., CONTRIBUTING.md

Choose a reason for hiding this comment

tihuan commented Apr 16, 2020

tihuan commented Apr 16, 2020

tihuan commented Apr 16, 2020

jameshadfield commented Apr 16, 2020

tihuan commented Apr 16, 2020

tihuan commented Apr 17, 2020

bcoe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tihuan commented Apr 17, 2020

tihuan commented Apr 19, 2020

bcoe commented Apr 19, 2020 • edited Loading

anescobar1991 commented Apr 19, 2020

anescobar1991 commented Apr 19, 2020

tihuan commented Apr 20, 2020 • edited Loading

bcoe commented Apr 20, 2020 • edited Loading

tihuan commented Apr 21, 2020

jameshadfield commented Apr 22, 2020 • edited Loading

tihuan commented Apr 23, 2020

Install git lfs

jameshadfield commented Apr 24, 2020

bcoe commented Apr 19, 2020 •

edited

Loading

tihuan commented Apr 20, 2020 •

edited

Loading

bcoe commented Apr 20, 2020 •

edited

Loading

jameshadfield commented Apr 22, 2020 •

edited

Loading