Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky] rspec ./spec/system/consumer/caching/shops_caching_spec.rb:17 #11010

Closed
Tracked by #8293
filipefurtad0 opened this issue Jun 14, 2023 · 25 comments · Fixed by #11019, #11284 or #12585
Closed
Tracked by #8293

[Flaky] rspec ./spec/system/consumer/caching/shops_caching_spec.rb:17 #11010

filipefurtad0 opened this issue Jun 14, 2023 · 25 comments · Fixed by #11019, #11284 or #12585
Assignees

Comments

@filipefurtad0
Copy link
Contributor

What we should change and why (this is tech debt)

rspec ./spec/system/consumer/caching/shops_caching_spec.rb:17

Failures:

  1) Shops caching caching enterprises AMS data caches data for all enterprises, with the provided options
     Failure/Error: expect(Rails.cache.exist?(key, options)).to be true
     
       expected true
            got false
     
     [Screenshot Image]: /home/runner/work/openfoodnetwork/openfoodnetwork/tmp/capybara/screenshots/failures_r_spec_example_groups_shops_caching_caching_enterprises_ams_data_caches_data_for_all_enterprises__with_the_provided_options_155.png

Context

https://github.com/openfoodfoundation/openfoodnetwork/actions/runs/5263043188/jobs/9512817149
https://openfoodnetwork.slack.com/archives/C012LE8LLDS/p1686718490274599?thread_ts=1686601349.929099&cid=C012LE8LLDS

Impact and timeline

@filipefurtad0
Copy link
Contributor Author

@mkllnk
Copy link
Member

mkllnk commented Jun 15, 2023

Looks like it got introduced by merging:

Confusingly, the next merge introduced another failing spec:

@mkllnk
Copy link
Member

mkllnk commented Jun 15, 2023

I can't reproduce this. 🤔

@filipefurtad0
Copy link
Contributor Author

Humm, this one always fails for me locally - which is a good thing I guess 😁

@mkllnk
Copy link
Member

mkllnk commented Jun 16, 2023

This error still appears after merging your pull request, @filipefurtad0.

@mkllnk mkllnk reopened this Jun 16, 2023
@rioug rioug self-assigned this Jun 16, 2023
@rioug
Copy link
Collaborator

rioug commented Jun 16, 2023

I managed to make it fail once in a 100 tries...
I noticed that we use the file system as cache store for the test environment:

config.cache_store = :file_store, Rails.root.join("tmp", "cache", "paralleltests#{ENV['TEST_ENV_NUMBER']}")

The only thing I can think of is maybe the hardrive being overloaded and the cache entry not being written yet by the time we check for it. Adding a sleep might fix the issue ? @mkllnk @filipefurtad0 any other idea ?
Other potential avenue, cache being cleared by another process ?

@mkllnk
Copy link
Member

mkllnk commented Jun 16, 2023

Interesting. It did fail for Filipe almost every time. I don't think that a file system would be that slow. I would also imagine a filesystem keeping track of changes to be written and blocking if needed.

All environments except test use Redis. And we do set up Redis on CI because it's used by several parts of the application. Maybe we should try to switch test to Redis as well? It may even be faster, who knows. Definitely more realistic.

@mkllnk
Copy link
Member

mkllnk commented Jun 16, 2023

This patch works for me but I have no idea if it will be flaky on CI:

diff --git a/config/environments/test.rb b/config/environments/test.rb
index 71228c5e3..70ebf9678 100644
--- a/config/environments/test.rb
+++ b/config/environments/test.rb
@@ -14,7 +14,11 @@ Openfoodnetwork::Application.configure do
   config.public_file_server.headers = { 'Cache-Control' => 'public, max-age=3600' }
 
   # Separate cache stores when running in parallel
-  config.cache_store = :file_store, Rails.root.join("tmp", "cache", "paralleltests#{ENV['TEST_ENV_NUMBER']}")
+  config.cache_store = :redis_cache_store, {
+    driver: :hiredis,
+    url: ENV.fetch("OFN_REDIS_URL", "redis://localhost:6379/1"),
+    reconnect_attempts: 1
+  }
 
   # Show full error reports and disable caching
   config.consider_all_requests_local       = true

@rioug
Copy link
Collaborator

rioug commented Jun 16, 2023

I think I found the issue, I managed to reproduce it by pausing the shops_caching_spec.rb before the expect and then run another test that calls Rails.cache.clear, I used spec/models/spree/preferences/store_spec.rb.
Rails.cache.clear just clears everything so I don't think moving to redis would fix the issue. I'll see if I can remove the Rails.cache.clear and replace it by something more granular.

@mkllnk
Copy link
Member

mkllnk commented Jun 16, 2023

I tried the Redis version on CI and it still failed. So your finding it much better.

@mkllnk
Copy link
Member

mkllnk commented Jun 16, 2023

CI is running many containers with groups of specs but each container runs only on rspec process, not in parallel, as far as I know. Is Redis shared on Github Actions? Is cache clearing executed async?

We had the flaky spec before Filipe introduced the explicit cache clearing on the two caching specs. Is the cache automatically cleared by Rspec? In that case it would be difficult to be more granular.

Can we isolate our cache with a Thread id or something similar? Others must have had this problem before...

@rioug
Copy link
Collaborator

rioug commented Jun 16, 2023

Interestingly, I just found this bit of code in spec/base_spec_helper.rb :

 config.before(:each) do
    reset_spree_preferences do |spree_config|
      # These are all settings that differ from Spree's defaults
      spree_config.default_country_id = default_country_id
      spree_config.checkout_zone = checkout_zone
      spree_config.currency = currency
      spree_config.shipping_instructions = true
    end
  end

reset_spree_preferences does Rails.cache.clear , so cache get cleared before each example. But as you mention rspec processes are not running in parallel. Feels like I am looking at the wrong thing.

@rioug
Copy link
Collaborator

rioug commented Jun 16, 2023

Can we isolate our cache with a Thread id or something similar? Others must have had this problem before...

Indeed : rails/rails#48341 , but by the look of it rails doesn't offer any solution.

@rioug
Copy link
Collaborator

rioug commented Jun 16, 2023

I am out of ideas, the only thing I can say is @filipefurtad0 's fix isn't changing anything because a Rails.cache.clear already happen before each example (see comment above).

@filipefurtad0
Copy link
Contributor Author

Thanks for that investigation @rioug. I guess that explains why it still keeps failing...

Locally, it went from constant failing to passing; hence, I was confident that the change would bring improvement. Let's revert it, once we have another approach for this 👍

@rioug
Copy link
Collaborator

rioug commented Jun 19, 2023

No worries, this one seems to consistently fails on my fork. But it's a weird one because we put something in cache and try to check the cache straight after and the entry is missing 😕
Anyway @mkllnk made some config changes to use redis for test as well : #11075 let's see if this still occurs once it's merged.

@rioug rioug removed their assignment Jun 20, 2023
@rbroemeling
Copy link

Can we isolate our cache with a Thread id or something similar?

@mkllnk Sorry that this is biting you folks as well. I'm dropping in because I noticed the link from rails/rails#48341 back to this.

I was able to isolate my cache by thread identifier using this sort of code as part of the parallelization spin-up in test/test_helper.rb:

# Protect from collisions during parallelized testing by namespacing our caching prefix by our worker identifier.
# Without this protection, all test executors share the same cache, which can lead to transient failures due to cache collisions.
# This protection is partial -- even with it, cache state still travels across tests executing in the same worker, which the author must be cautious of.
parallelize_setup do |worker|
  Rails.cache.options[:namespace] += "#{Time.now.to_i}.#{worker}:"
end

I used #{worker} because that gives the worker number in Rails test parallelization, but you could easily use a thread identifier (Thread.current.object_id) as well. Just be very careful that the adjustment to the :namespace takes place after parallelization (i.e., if you are running threads you need to ensure that it executes after the threads have split off; if it runs before the threads are created then it won't achieve the desired ends).

Not sure if this will help you, but I thought maybe it would so I'd drop it in here in case it was useful -- hope that you find a relatively easy way past this contention!

@abdellani
Copy link
Member

Another new occurrence (just FYI).
./spec/system/consumer/caching/shops_caching_spec.rb:22

filipefurtad0 added a commit to filipefurtad0/openfoodnetwork that referenced this issue Jul 26, 2023
As discussed here (openfoodfoundation#11010 (comment)), reset_spree_preferences already does Rails.cache.clear
@mkllnk mkllnk self-assigned this Jul 26, 2023
@mkllnk
Copy link
Member

mkllnk commented Jul 26, 2023

I've been thinking about this spec more. The cache entry expires after 15 seconds which seems plenty but there may be a reason why the system is waiting somewhere. It is the first spec in the spec run which can trigger more boot time like compiling assets. So here are ways it can fail:

      visit shops_path

      # wait until cache expires
      sleep 16

      key, options = CacheService::FragmentCaching.ams_shops
      expect_cached "views/#{key}", options
      visit shops_path
      sleep 10
      visit shops_path
      sleep 6

      key, options = CacheService::FragmentCaching.ams_shops
      expect_cached "views/#{key}", options

I tried the second example to know if another spec could influence this but having the visits in two different it blocks doesn't fail the spec because the cache is reset between each spec. So I'm looking for a delay between the write to the cache and the rendering of the page.

Locally, I'm using Spring to run tests quicker but to simulate CI conditions I stopped spring and then tried again. It usually passes on my machine. But if I add sleep 12 then it fails. The whole test takes 17 seconds. So there's a delay of at least 3 seconds on my machine. The CI machines can be a lot slower and there we may see a delay of 15 seconds which causes the cache to expire before we can test for it.

Then I had the idea of clearing the cache with rails tmp:clear and run the spec again. Suddenly it took ages. It's running a nodejs process, compiling Javascript, I guess. Now the result is even worse:

Ferrum::PendingConnectionsError
 RuntimeError: Requests did not finish in 60 seconds

After that, I ran the test again. Some JS has been compiled and the rest isn't taking as long. The test ran for 36 seconds and failed. I tried to surround the example with Timecope.freeze but it still failed. I guess that Redis has its own clock. Maybe that would have worked with the file system cache. So my last idea is to preload the page once so that all the JS is compiled when we do the real test. Seems to work. I'll open a PR for that.

@rioug
Copy link
Collaborator

rioug commented Jul 27, 2023

Nice find !

@mkllnk
Copy link
Member

mkllnk commented Jan 15, 2024

This came back:

The previous PR fixed one spec example but if the execution order of examples changed then this could happen again. I'm looking into a more sustainable solution.

@filipefurtad0
Copy link
Contributor Author

Some failures were also tracked in this issue.

@github-project-automation github-project-automation bot moved this to All the things in OFN Delivery board Feb 15, 2024
@RachL RachL moved this from All the things to In Dev 💪 in OFN Delivery board Feb 15, 2024
@sigmundpetersen
Copy link
Contributor

@filipefurtad0 @mkllnk any of you planning to work on this issue?
If not we should move it back to All the things

@filipefurtad0
Copy link
Contributor Author

filipefurtad0 commented Jun 3, 2024

Thank you for pinging @sigmundpetersen. From my side, we can move it to All the things. Would be great to fix it, but I don't think this is a priority.

@mkllnk
Copy link
Member

mkllnk commented Jun 4, 2024

I might continue this again this week. Otherwise I'll move it back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment