Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix e2e flake by not consuming abort signal in apiq query options helper #2614

Merged
merged 1 commit into from
Dec 11, 2024

Conversation

david-crespo
Copy link
Collaborator

@david-crespo david-crespo commented Dec 11, 2024

Short version

Passing React Query's abort signal into our API fetch calls means the calls get aborted when queries get canceled due to being unmounted. One inconvenient time for this to happen is when React Strict Mode does its thing in dev mode, causing an unmount and remount in the middle of a prefetch in a loader, which doesn't itself blow up because prefetchQuery eats errors, but components that expect prefetched data do blow up because the prefetch failed to populate the query cache. There isn't much advantage to canceling queries, so we can fix this by just not passing the signal through.

Also read this: https://tanstack.com/query/v5/docs/framework/react/guides/query-cancellation

Medium version

With #2597 we started using an options helper with prefetchQuery that passes the abort signal from React Query through into our API calls, which means they get aborted when RQ cancels a query. The existing prefetchQuery did not pass through the signal.

React's strict mode does everything twice (in development mode only) in order to surface things that should be cleaned up and redone between renders, and when it unmounts and remounts the firewall rules query, that causes RQ to cancel the fetch, which was causing our API call to abort, which means it wasn't in the query cache when usePrefetchedQuery expects it to be. I assume it only happens some of the time because of a race, though I don't understand yet what is racing. It's not a race with the request completing because lowering the mock API's random latency of 200-400ms to 50-100ms does not seem to affect the rate of failure.

In any case, after reading the React Query doc on Query Cancellation, I don't think we actually ever need to pass through the abort signal, whether for prefetches or regular queries. There's very little advantage to canceling a request that's already in flight unless you're trying to avoid downloading some big response. Mutations might make more sense to cancel. We take advantage of this during image upload, for example, where we have 6 requests in flight. It makes sense to cancel outstanding ones rather than letting them complete. Mutations are unaffected by this apiq helper — for now we are still using the old useApiMutation for all mutations.

The story of my pain

Beginning with #2597 (though I didn't notice for a few days) we started having a test flake in the create VPC e2e test (example failure). It was reproducible locally in multiple browsers. It failed about half the time with a prefetch error after the transition to VPC detail for a newly created VPC (line 40).

// New VPC form
await page.getByRole('link', { name: 'New Vpc' }).click()
await page.getByRole('textbox', { name: 'Name', exact: true }).fill('another-vpc')
await page.getByRole('textbox', { name: 'DNS name' }).fill('another-vpc')
await page.getByRole('button', { name: 'Create VPC' }).click()
// now we're on the VPC detail, on the firewall rules tab
await expect(page.getByRole('heading', { name: 'another-vpc' })).toBeVisible()

The error showed up as a failure of the firewall rules prefetch to seed the cache. The query was there after the prefetch call but stuck in pending:

error shot

It turned out this one line from #2597 was enough to cause the problem.

 VpcFirewallRulesTab.loader = async ({ params }: LoaderFunctionArgs) => {
   const { project, vpc } = getVpcSelector(params)
-  await apiQueryClient.prefetchQuery('vpcFirewallRulesView', { query: { project, vpc } })
+  await queryClient.prefetchQuery(apiq('vpcFirewallRulesView', { query: { project, vpc } }))
   return null
 }

See if you can spot the difference between the two implementations involved:

apiq

queryFn: ({ signal }) => api[method](params, { signal }).then(handleResult(method)),

apiQueryClient.prefetchQuery

queryFn: () => api[method](params).then(handleResult(method)),

CancelledError

The cause was hard to figure out with prefetchQuery because that function eats all errors with a .catch(noop). To see what was actually going wrong, I changed it to a fetchQuery, which does through errors if they happen, and showed me this was a CancelledError coming out RQ cleaning up a subscribe due to commitDoubleInvokeEffectsInDEV, is a strict mode thing.

CancelledError stack trace
Error: CancelledError

CancelledError — retryer.ts:62
cancel — retryer.ts:84
removeObserver — query.ts:329
destroy — queryObserver.ts:143
onUnsubscribe — queryObserver.ts:119
(anonymous function) — subscribable.ts:15
safelyCallDestroy — react-dom.development.js:22971
commitHookEffectListUnmount — react-dom.development.js:23139
invokePassiveEffectUnmountInDEV — react-dom.development.js:25246
invokeEffectsInDev — react-dom.development.js:27390
commitDoubleInvokeEffectsInDEV — react-dom.development.js:27363
flushPassiveEffectsImpl — react-dom.development.js:27095
flushPassiveEffects — react-dom.development.js:27023
commitRootImpl — react-dom.development.js:26974
commitRoot — react-dom.development.js:26721
performSyncWorkOnRoot — react-dom.development.js:26156
flushSyncCallbacks — react-dom.development.js:12042
(anonymous function) — react-dom.development.js:25690

Copy link

vercel bot commented Dec 11, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
console ✅ Ready (Inspect) Visit Preview Dec 11, 2024 0:24am

@david-crespo david-crespo merged commit 7a8ee0a into main Dec 11, 2024
8 checks passed
@david-crespo david-crespo deleted the dont-abort-queries branch December 11, 2024 16:27
david-crespo added a commit to oxidecomputer/omicron that referenced this pull request Dec 13, 2024
oxidecomputer/console@927c8b6...c1ebd8d

* [c1ebd8d9](oxidecomputer/console@c1ebd8d9) take inventory switches tab back out, DB is empty
* [2a139028](oxidecomputer/console@2a139028) tone down empty switches table message
* [2b0c3f12](oxidecomputer/console@2b0c3f12) oxidecomputer/console#2621
* [c1fbf631](oxidecomputer/console@c1fbf631) oxidecomputer/console#2622
* [1dd25f63](oxidecomputer/console@1dd25f63) oxidecomputer/console#2615
* [11486f8b](oxidecomputer/console@11486f8b) oxidecomputer/console#2620
* [ada302ce](oxidecomputer/console@ada302ce) oxidecomputer/console#2488
* [bc3161ae](oxidecomputer/console@bc3161ae) minor: remove stub e2e test
* [9e1d53c6](oxidecomputer/console@9e1d53c6) sentence case idp form heading
* [aaf1154f](oxidecomputer/console@aaf1154f) add extra assert for instance create with additional disks test flake
* [79d610dc](oxidecomputer/console@79d610dc) oxidecomputer/console#2618
* [bdbc02b7](oxidecomputer/console@bdbc02b7) oxidecomputer/console#2589
* [7a8ee0ab](oxidecomputer/console@7a8ee0ab) oxidecomputer/console#2614
* [0b5220a1](oxidecomputer/console@0b5220a1) npm audit fix
* [0c873cf4](oxidecomputer/console@0c873cf4) oxidecomputer/console#2610
* [d031c8ff](oxidecomputer/console@d031c8ff) bump playwright for navigation hang fix
* [dbd8545e](oxidecomputer/console@dbd8545e) oxidecomputer/console#2609
* [dc5562fe](oxidecomputer/console@dc5562fe) move error log to avoid failing to log certain errors
david-crespo added a commit to oxidecomputer/omicron that referenced this pull request Dec 14, 2024
oxidecomputer/console@927c8b6...c1ebd8d

* [c1ebd8d9](oxidecomputer/console@c1ebd8d9)
take inventory switches tab back out, DB is empty
* [2a139028](oxidecomputer/console@2a139028)
tone down empty switches table message
* [2b0c3f12](oxidecomputer/console@2b0c3f12)
oxidecomputer/console#2621
* [c1fbf631](oxidecomputer/console@c1fbf631)
oxidecomputer/console#2622
* [1dd25f63](oxidecomputer/console@1dd25f63)
oxidecomputer/console#2615
* [11486f8b](oxidecomputer/console@11486f8b)
oxidecomputer/console#2620
* [ada302ce](oxidecomputer/console@ada302ce)
oxidecomputer/console#2488
* [bc3161ae](oxidecomputer/console@bc3161ae)
minor: remove stub e2e test
* [9e1d53c6](oxidecomputer/console@9e1d53c6)
sentence case idp form heading
* [aaf1154f](oxidecomputer/console@aaf1154f)
add extra assert for instance create with additional disks test flake
* [79d610dc](oxidecomputer/console@79d610dc)
oxidecomputer/console#2618
* [bdbc02b7](oxidecomputer/console@bdbc02b7)
oxidecomputer/console#2589
* [7a8ee0ab](oxidecomputer/console@7a8ee0ab)
oxidecomputer/console#2614
* [0b5220a1](oxidecomputer/console@0b5220a1)
npm audit fix
* [0c873cf4](oxidecomputer/console@0c873cf4)
oxidecomputer/console#2610
* [d031c8ff](oxidecomputer/console@d031c8ff)
bump playwright for navigation hang fix
* [dbd8545e](oxidecomputer/console@dbd8545e)
oxidecomputer/console#2609
* [dc5562fe](oxidecomputer/console@dc5562fe)
move error log to avoid failing to log certain errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant