Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add screenshot testing #1068

Closed
wants to merge 1 commit into from

Conversation

tihuan
Copy link
Contributor

@tihuan tihuan commented Apr 16, 2020

Description of proposed changes

This PR adds jest-image-snapshot and jest.retryTimes() to do basic screenshot diffing for website's functionality

Related issue(s)

Related to #917

Testing

What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

Please help test this PR locally by:

  1. Make sure you have the default zika.json data via: curl http://data.nextstrain.org/zika.json --compressed -o data/zika.json

  2. Run npm run dev

  3. In another terminal tab run: npm run integration-test and all snapshot tests should pass:

Screen Shot 2020-04-15 at 9 09 27 PM

Thank you for contributing to Nextstrain!

@tihuan tihuan force-pushed the Add-screenshot-testing branch from d347ebd to 6ef74f1 Compare April 16, 2020 04:16
.eslintrc Outdated
@@ -7,6 +7,7 @@ globals:
browser: true
context: true
jestPuppeteer: true
BASE_URL: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missing from .eslintrc, thus causing the linter error. Sorry!

@@ -16,3 +16,6 @@ s3/
node_modules/
npm-debug.log
*tgz

### IDE ###
.vscode/*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring .vscode settings

@@ -7,6 +10,8 @@ import { setDefaultOptions } from 'expect-puppeteer';
jest.setTimeout(30 * 1000);
setDefaultOptions({ timeout: 3 * 1000 });

jest.retryTimes(4);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add retry 4 times for a total test run of 5 times before jest reporting a failed test. This is useful for avoiding minor flakiness when doing UI testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was finding it relatively hard to force an error (I was trying to do so by mucking with CSS, to cause a noticeable visual change), I think partially because retries were set high enough that it was taking quite some time to exit.

In deciding on your retry/timeout settings, mind doing the same, just to confirm that an expected failure occurs in a reasonable period of time, and bubbles appropriately.

src/components/controls/color-by.js Show resolved Hide resolved
@@ -0,0 +1,32 @@
const WAIT_BETWEEN_SCREENSHOTS_MS = 100;

export async function waitUntilScreenStable() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function ensures the screen's pixels are no longer changing, before we run toMatchImageSnapshot()

@tihuan tihuan mentioned this pull request Apr 16, 2020
@jameshadfield
Copy link
Member

This is really nice. I had a chance to test it out and the code seemed good (although I am new to these kinds of tests) and I'm excited with all the things we can do here. Do you know why the images I'm getting may have failed? (they look the same to your screenshots, to my eyes at least)

 PASS  test/sum.test.js
 PASS  test/sortedDomain.test.js
 FAIL  test/integration/zika.test.js (176.319s)
  ● Zika › Color by › author › matches the screenshot

    Expected image to match or be a close match to snapshot but was 2.158760730131173% different from snapshot (28649 differing pixels).
    See diff for details: /Users/naboo/github/nextstrain/auspice/test/integration/__image_snapshots__/__diff_output__/Color by: author-diff.png

(I think it's fine to use the "live" zika build as this isn't changing currently, but one day soon i'll make a bunch of "test" datasets available which we can guarantee won't change.)

Also, could I ask you to

@bcoe
Copy link
Contributor

bcoe commented Apr 16, 2020

@tihuan @jameshadfield I'm excited to see this approach in action, I've talked to peers who swear by image snapshots for integration tests (I know this approach is used heavily at Facebook, as an example).

Thought for improvement

For CI/CD my only concern is, can we get it so a user contributing to the project can easily see the visual difference they've caused? Reading up on Jest Image Snapshot, it sounds like we could upload the image somewhere custom. What if we did something slick like leaving the image as a comment on a failing PR? I can dig in to how practical this would be.

DEV_DOCS.md vs., CONTRIBUTING.md

@jameshadfield this is absolutely a nit, so feel free to ignore me 😆 but it's fairly common to have a file called CONTRIBUTING.md to describe how to start contributing to a project; including steps like getting tests up and running. I believe GitHub actually picks up on this file, and displays it in their UI in a few places.

Would you be open to CONTRIBUTING.md rather than DEV_DOCS.md?

})
describe("region", () => {
it("matches the screenshot", async () => {
await toMatchImageSnapshot("region", async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach, you're ensuring that specific components on the page match visually (this is quite nice).

@tihuan tihuan force-pushed the Add-screenshot-testing branch from 6ef74f1 to 91d057d Compare April 16, 2020 16:34
@tihuan
Copy link
Contributor Author

tihuan commented Apr 16, 2020

Thanks for the quick feedback on this, everyone 🤩🙏!!

Thanks for the headsup! I just rebased this from upstream/master, so it seems to be picking up the GH Actions now. Thanks for adding that @bcoe ! Super exciting to get more automation in place 🎉

Ah thanks for the reminder, I will add some doc today!

For the image diffing sometimes fails but our eyes can't tell the difference thing, I did read something yesterday about different machines, graphic cards, etc. could cause subtle pixel diff that only machines can tell. Which is part of the reason why I added blur. But looks like that's not enough lol

We might need to tweak the threshold for failing, like 3%, 4%, 5% or something? Depends on how much margin of error we're comfortable here!

First time doing image diffing, so not sure what other heuristic config people do to prevent flakiness 😆

@bcoe Love the idea of uploading image diffing and show that to people! Depends on how Auspice is set up, we will need to hook it up to an S3 bucket and management permission etc to get GH Action to upload it and send the image link. That'd be super convenient and awesome!!

@tihuan
Copy link
Contributor Author

tihuan commented Apr 16, 2020

Just checked the failed integration test output and we're getting almost 18% diff! So probably a real failure

Screen Shot 2020-04-16 at 10 01 22 AM

From what @jameshadfield was saying, it sounds like the "live" zika dataset is probably different from the seeded one, so that could be why?

If that's the case, should we add seed datasets in this PR and run the tests against them to have consistency?

Thanks all 😄 !

@tihuan
Copy link
Contributor Author

tihuan commented Apr 16, 2020

uhh interesting, I just added 5% failure threshold, and the tests magically passed now -.- Given that it was almost 18% diff, not sure what made it to pass now! HMM

@jameshadfield
Copy link
Member

but it's fairly common to have a file called CONTRIBUTING.md to describe how to start contributing to a project;

Until recently we had a contributing.md but PR #978 moved these to DEV_DOCS so that all projects can pull in a generic "nextstrain" contributing.md. Happy to discuss directions in another issue, but for this PR let's add to dev_docs.

Re: CI tests. I see a multi-step process:

  • As it stands, my understanding is that these will be run via the GitHub workflow. Failures might be hard to reason with, but it's a big win to know that a PR resulted in them failing.
  • In a separate PR we can add automatic posting of why the test failed. I can definitely make a S3 bucket and add credentials as a GitHub secret (and share with you both for development purposes) for making the GitHub job more intelligent.

it sounds like the "live" zika dataset is probably different from the seeded one

Zika hasn't been updated since last year, so that's not the case. But more generally we should add a "test set" of JSONs to the S3 bucket which nextstrain draws from so that datasets don't change underneath us. This can be done in a future PR. (Please don't add dataset JSONs to this repo, they will add too much weight!)


I think the only thing for this PR is understanding why the image diff percentages are varying. I don't think there's anything stochastic about how we render auspice. I'll also look into this today.

@tihuan
Copy link
Contributor Author

tihuan commented Apr 16, 2020

Oh great! Love the idea of adding test set JSONs to S3 instead of the repo 🎉

Yeah another possibility is the waitUntilScreenStable() helper function waits 100ms between snapshots, so if the CI machine is slow, the screen might not update much between two snapshots, so by the time the real toMatchSnapshot() call happens, it's diffing an incomplete snapshot against the expected snapshot

I think I might do a different approach to test snapshot against the expected snapshot for up to N times or until timeout happens, say 10 seconds. So we don't need to rely on waitUntilScreenStable() at all. This paradigm will be closer to how jest-puppeteer works too, in the sense that their expect API also auto poll until timeout to see if the expected element eventually shows up in the DOM

WIll update the PR later today! Thank you!

@tihuan tihuan force-pushed the Add-screenshot-testing branch from 8004f3f to 9e3202c Compare April 17, 2020 01:18
@tihuan
Copy link
Contributor Author

tihuan commented Apr 17, 2020

OK I updated the code to use the new approach, and GH Action tests have been passing consistently!

Can anyone try to run locally too for sanity check?

If that works, then the last thing I need to do is to update DEV_DOCS!

Thanks all 😄 !

Copy link
Contributor

@bcoe bcoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments from my own testing @tihuan 👌 I'm excited to see if I can figure out how to make screenshots upload to the PR with GitHub actions, as follow on work.

@@ -7,6 +10,8 @@ import { setDefaultOptions } from 'expect-puppeteer';
jest.setTimeout(30 * 1000);
setDefaultOptions({ timeout: 3 * 1000 });

jest.retryTimes(4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was finding it relatively hard to force an error (I was trying to do so by mucking with CSS, to cause a noticeable visual change), I think partially because retries were set high enough that it was taking quite some time to exit.

In deciding on your retry/timeout settings, mind doing the same, just to confirm that an expected failure occurs in a reasonable period of time, and bubbles appropriately.

* https://github.com/americanexpress/jest-image-snapshot#%EF%B8%8F-api
*/
const SNAPSHOT_CONFIG = {
failureThreshold: 5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to my prior comment, I was finding I was able to cause quite a few visual oddities in the page, before I triggered an error with the 5% threshold. To decide on the failureThreshold and blur values, I think a good approach might be to:

  1. purposely remove a couple elements from the page that aren't tested (make sure the missing elements trigger a failure in a reasonable amount of time).
  2. muck with CSS to make the page look noticeably visually different, and make sure we get a failure in a reasonable amount of time 👌

@tihuan
Copy link
Contributor Author

tihuan commented Apr 17, 2020

Thanks for the excellent feedback, @bcoe ! Will play with the failureThreshold 👌

As for retryTimes(), maybe we can start with 2 and see if it resolves most of the random flakiness instead of 4?

@tihuan tihuan force-pushed the Add-screenshot-testing branch 8 times, most recently from b4c6ce9 to 89f4896 Compare April 19, 2020 06:16
@tihuan
Copy link
Contributor Author

tihuan commented Apr 19, 2020

Hi all!

I removed failureThreshold and now GH Action's integration test is now consistently getting a roughly 17% diff.

Digging a little deeper why it was passing before, I learned that failureThreshold actually expects a value between [0, 1], so when I set it previously to 5 it actually meant 500% 😂 , thus any image diff would pass. Sorry, MY BAD 🙏 !

Also, thanks to @bcoe 's awesome image diff comment service here(super convenient! Thank you!✨ 👏), it seems like the 17% is caused by the rendering difference between macOS and Linux? Like font, default browser element styling, etc..

And looking at this article, they suggest running image diffing in a Docker image for consistent environment, so that could increase the scope of this PR quite a bit. Or maybe we can somehow come up with a creative temp solution to reap enough benefits of screenshot testing now?

@bcoe
Copy link
Contributor

bcoe commented Apr 19, 2020

It seems like the 17% is caused by the rendering difference between macOS and Linux?

@tihuan, GitHub actions actually supports a macos test runner. Perhaps we could test switching out the runner we use, and see if it reduces the delta? (I added you as a collaborator to my testing repo, if you want to experiment with the macos option).

Another thing worth checking ...

Have any of the datasets changed significantly since your initial snapshots, you might try pulling a fresh set of data.

I like @jameshadfield's idea of having a "test set", would help eliminate false positives in the future.

@anescobar1991
Copy link

Zika hasn't been updated since last year, so that's not the case. But more generally we should add a "test set" of JSONs to the S3 bucket which nextstrain draws from so that datasets don't change underneath us. This can be done in a future PR. (Please don't add dataset JSONs to this repo, they will add too much weight!)

I like that idea but I would make sure that those test assets are versioned, otherwise tests may start failing without any changes to the code which is probably not the intention.

@anescobar1991
Copy link

You may want to enable git lfs so that over time as more commits and image snapshots are added git does not slow to a crawl.

@tihuan tihuan force-pushed the Add-screenshot-testing branch 2 times, most recently from c62c0bd to 22f2bfd Compare April 20, 2020 16:45
@tihuan
Copy link
Contributor Author

tihuan commented Apr 20, 2020

Thanks so much for all the amazing suggestions, @bcoe and @anescobar1991 !

Changing GH Action's integration-test's runs-on to macos-10.15 worked 🎉 👏 ! Do we need to worry about contributors who don't run integration tests locally on a Mac?

Will look into adding git lfs today! Thank you!

CC: @jameshadfield for the macos question 😄 Thanks!

@bcoe
Copy link
Contributor

bcoe commented Apr 20, 2020

Do we need to worry about contributors who don't run integration tests locally on a Mac?

@jameshadfield has the real say on this matter (as I've just swooped in and started volunteering 😆).... but,

If the macos-10.15 could get us to a starting point, I'm supportive ... but it might be worth following on soon after (like you suggest) with thorough documentation on how to run through a docker container. At which point, we should re-snapshot with Linux using that container. Since puppeteer is a wire protocol, it should be easy to communicate with the container without mounting a local filesystem, so I think it will actually ultimately not be too painful for folks.

Changing GH Action's integration-test's runs-on to macos-10.15 worked 🎉 👏

How did testing go with purposefully breaking a few elements? One thought, could we test that we could have caught this real regression reported by @jameshadfield:

Great! I tested it on a few examples, it worked, I merged it. But this broke legend display for numeric legend traits, which I didn't test on 😢 . These are the kind of things I'd love testing to automate out of my life!.

👆If we could catch this breakage, it would be a great indicator that our comparison is sensitive enough.

@tihuan tihuan force-pushed the Add-screenshot-testing branch from 22f2bfd to 86a3453 Compare April 21, 2020 03:45
@tihuan
Copy link
Contributor Author

tihuan commented Apr 21, 2020

@bcoe Sounds great! Yeah I also feel like getting some snapshots to work now is better than nothing, but definitely up to @jameshadfield 😆

I tried to get git lfs to work, but given that macOS Catalina has tightened security, the tool is not working well for me until they get developer signature from Apple: git-lfs/git-lfs#3714. I have a feeling they'll get a certificate soon, so maybe we can create an issue to add git lfs as a separate task?

Re: intentionally breaking the app to make sure the snapshot works, since I removed failureThreshold, it's definitely catching any pixel diff as failure now! For example, I changed Phylogeny to hello, and the test caught that successfully 😄

@jameshadfield
Copy link
Member

jameshadfield commented Apr 22, 2020

Hey @tihuan, @bcoe et al., this all looks great. I'm happy where this branch has gotten to -- it introduces integration tests, gives examples, and provides easy-to-follow documentation. Since they work in CI, they are immediately valuable for finding regressions on future PRs 🎉


The tests do appear to be very browser specific. I'm using MacOS 10.14.6 and the screenshot tests fail for me, with all differences appearing to be related to text rendering:
image
If I regenerate the snapshots myself then the tests pass 💯

So that's to say that I think the testing approach is fine, and the browser-specificity can be tackled in a separate PR. Following your comments above, before adding further tests, I think we should tackle the following in separate PRs:


One last question before merge: edit: I think we just do the first one here...

The 4 screenshots already add ~1Mb to the repo. This isn't a deal-breaker, but we can't keep going in this direction. So, what are your thoughts on what we should do here?

  • Merge as is, and move to git-lfs for future snapshot tests (I believe all collaborators will need git lfs. This is probably my preferred option)
  • Remove ~3/4 of the pngs from the history in this branch to save space & add them back later when we have git-lfs.
  • Use git-lfs now (requires going back in the history of this branch I think)
  • Consider another place to store screenshots (S3?, my least preferable option)

Is there anything else to consider before merging this?


Other notes:

  • As an aside, the legend-display-bug linked above knocked out the whole tree, so it would have been caught at almost any % threshold!
  • Running the tests in headful mode is pretty cool to be able to see what's going on!

@tihuan
Copy link
Contributor Author

tihuan commented Apr 23, 2020

Hi all!

GOOD NEWS, there's a way around installing git lfs! But we will still need @jameshadfield's help to set up git lfs in auspice first, before forks can use it.

I just tried pushing up a commit to my fork and got:

batch response: @tihuan can not upload new objects to public fork tihuan/auspice

To install git lfs, I wrote the following .md update (which I can't push to my fork yet because of lfs, so pasting below):

Install git lfs

  1. We use git lfs to store large files, such as expected test image snapshots

  2. If you're on MacOS, before this issue is resolved, please follow the steps here to ensure you get to run the tool successfully:

    1. In terminal, sudo spctl --master-disable
    2. Install git-lfs as normal. E.g., brew install git-lfs or port install git-lfs
    3. Run sudo spctl --master-enable

After git lfs is installed, we can start tracking files:

  1. git lfs install
  2. git lfs track "*.png"
  3. git add .
  4. Make a git commit and merge the PR into auspice!

That should do 🤞

Re: Docker, yeah I think that's the only sure way to have OS/browser consistent snapshot testing, although I'm not super familiar with Docker so would be helpful to see if other people want to take on that task!

Thanks all!

jameshadfield added a commit that referenced this pull request Apr 24, 2020
See additions to dev-docs for rationale. Having this in master should allow us to rebase #1068 and therefore have git-lfs used from day zero.
@jameshadfield
Copy link
Member

Thanks so much @tihuan, @bcoe and @anescobar1991. This has now been merged via #1084 which was a rebase of this onto master so that we use git-lfs from the get-go 😄

@tihuan tihuan deleted the Add-screenshot-testing branch July 3, 2020 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants