Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Successive requests for "sitemap-theses.xml" returns differing sets of results: related to #3289 #3361

Closed
jefferya opened this issue Jan 30, 2024 · 5 comments
Assignees

Comments

@jefferya
Copy link
Contributor

jefferya commented Jan 30, 2024

Possibly one cause of Google Search Console errors: #3289

Requests to https://era.library.ualberta.ca/sitemap-theses.xml will intermittently return a different set of results.

For example, the following is a recording of multiple requests in a short timespan. Notice the differing size of the resulting file. The resulting file with the smaller size seems to miss some (maybe all) theses.

j@598739:~/dev/google_search_console/site-map/2024-01-30/rev2$ wget https://era.library.ualberta.ca/sitemap-theses.xml
--2024-01-30 09:20:43--  https://era.library.ualberta.ca/sitemap-theses.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.104
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-theses.xml.1'

sitemap-theses.xml.1          [ <=>                               ]   6.32M  --.-KB/s    in 0.1s

2024-01-30 09:20:43 (53.3 MB/s) - 'sitemap-theses.xml.1' saved [6631714]
j@598739:~/dev/google_search_console/site-map/2024-01-30/rev2$ wget https://era.library.ualberta.ca/sitemap-theses.xml
--2024-01-30 09:20:47--  https://era.library.ualberta.ca/sitemap-theses.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.104
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-theses.xml.2'

sitemap-theses.xml.2          [ <=>                               ]   6.99M  40.0MB/s    in 0.2s

2024-01-30 09:20:48 (40.0 MB/s) - 'sitemap-theses.xml.2' saved [7326358]
j@598739:~/dev/google_search_console/site-map/2024-01-30/rev2$ wget https://era.library.ualberta.ca/sitemap-theses.xml
--2024-01-30 09:21:23--  https://era.library.ualberta.ca/sitemap-theses.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.104
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-theses.xml.3'

sitemap-theses.xml.3          [ <=>                               ]   6.32M  --.-KB/s    in 0.1s

2024-01-30 09:21:23 (48.4 MB/s) - 'sitemap-theses.xml.3' saved [6631714]
j@598739:~/dev/google_search_console/site-map/2024-01-30/rev2$ wget https://era.library.ualberta.ca/sitemap-theses.xml
--2024-01-30 09:21:28--  https://era.library.ualberta.ca/sitemap-theses.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.104
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-theses.xml.4'

sitemap-theses.xml.4          [ <=>                               ]   6.32M  --.-KB/s    in 0.1s

2024-01-30 09:21:29 (52.6 MB/s) - 'sitemap-theses.xml.4' saved [6631714]
j@598739:~/dev/google_search_console/site-map/2024-01-30/rev2$ wget https://era.library.ualberta.ca/sitemap-theses.xml
--2024-01-30 09:21:33--  https://era.library.ualberta.ca/sitemap-theses.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.104
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-theses.xml.5'

sitemap-theses.xml.5          [ <=>                               ]   6.99M  41.0MB/s    in 0.2s

2024-01-30 09:21:33 (41.0 MB/s) - 'sitemap-theses.xml.5' saved [7326358]

Interesting correlations:

j@598739:~/dev/google_search_console/site-map/2024-01-30/rev2$ grep 97bd4b30-9e0d-41b3-b668-f58bcd1ea06b *
sitemap-items.xml:    <loc>https://era.library.ualberta.ca/items/97bd4b30-9e0d-41b3-b668-f58bcd1ea06</loc>
sitemap-theses.xml.1:    <loc>https://era.library.ualberta.ca/items/97bd4b30-9e0d-41b3-b668-f58bcd1ea06b</loc>
sitemap-theses.xml.3:    <loc>https://era.library.ualberta.ca/items/97bd4b30-9e0d-41b3-b668-f58bcd1ea06b</loc>
sitemap-theses.xml.4:    <loc>https://era.library.ualberta.ca/items/97bd4b30-9e0d-41b3-b668-f58bcd1ea06b</loc>
j@598739:~/dev/google_search_console/site-map/2024-01-30/rev2$ grep 269f3d9e-eecb-4902-84eb-16cd70ff7347 *
sitemap-theses.xml:    <loc>https://era.library.ualberta.ca/items/269f3d9e-eecb-4902-84eb-16cd70ff7347</loc>
sitemap-theses.xml.2:    <loc>https://era.library.ualberta.ca/items/269f3d9e-eecb-4902-84eb-16cd70ff7347</loc>
sitemap-theses.xml.5:    <loc>https://era.library.ualberta.ca/items/269f3d9e-eecb-4902-84eb-16cd70ff7347</loc>
  • Is the sitemap-thesis.xml route intermittently returning the sitemap-item.xml results?

To Reproduce

Expected behavior

@jefferya
Copy link
Contributor Author

I also tested this with the sitemap-items.xml route and experienced a similar issue where the result sizes and contents differ (see below). Is there a correlation with the load balancing (e.g., is one server correct and the other wrong)?

j@598739:~/dev/google_search_console/site-map/2024-01-30/rev2$ wget https://era.library.ualberta.ca/sitemap-items.xml
--2024-01-30 10:07:46--  https://era.library.ualberta.ca/sitemap-items.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.104
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-items.xml.1'

sitemap-items.xml.1           [ <=>                               ]   6.32M  --.-KB/s    in 0.1s

2024-01-30 10:07:46 (44.8 MB/s) - 'sitemap-items.xml.1' saved [6631714]
j@598739:~/dev/google_search_console/site-map/2024-01-30/rev2$ wget https://era.library.ualberta.ca/sitemap-items.xml
--2024-01-30 10:07:51--  https://era.library.ualberta.ca/sitemap-items.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.104
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-items.xml.2'

sitemap-items.xml.2           [ <=>                               ]   6.99M  39.5MB/s    in 0.2s

2024-01-30 10:07:52 (39.5 MB/s) - 'sitemap-items.xml.2' saved [7326358]
j@598739:~/dev/google_search_console/site-map/2024-01-30/rev2$ wget https://era.library.ualberta.ca/sitemap-items.xml
--2024-01-30 10:07:58--  https://era.library.ualberta.ca/sitemap-items.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.104
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-items.xml.3'

sitemap-items.xml.3           [ <=>                               ]   6.32M  --.-KB/s    in 0.1s

@jefferya jefferya changed the title Successive requests for "sitemap-theses.xml" returns differing sets of results Successive requests for "sitemap-theses.xml" returns differing sets of results: related to #3289 Feb 28, 2024
@jefferya
Copy link
Contributor Author

jefferya commented Feb 28, 2024

Summary:

  • local dev env functions as expected (caching disabled by default); prod weird/inconsistent results for items and theses sitemap (likely due to multiple app servers)
  • next steps:
    • check if related to cache in app/views/sitemap/_object.xml.builder and/or prod env and/or multiple app servers or another
      • by default the local dev instance doesn't have the memory cache enable whereas prod does
      • after enabling caching in the local dev environment, one can reproduce problems with the theses and items sitemaps -- seems like a shared cache key as the first of the two routes cached is returned for both routes
      • is the cache key based on the partial _object.xml.builder thus causing the theses and items sitemap routes to used same cached partial view for both?

Test 1. Does a local environment

Results: unable to reproduce in a dev environment

Test 2. Does each production application server return the same response?

Test: change local /etc/hosts to direct era.library.ualberta.ca to the individual application server IPs and test

Results: sitemap-theses.xml and sitemap-item.xml are identical on each app server. Why the discrepancy with past results?
Expectation: the results should be two distinct lists for items and theses

$ wget https://era.library.ualberta.ca/sitemap-theses.xml
--2024-02-28 15:30:13-- https://era.library.ualberta.ca/sitemap-theses.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.54
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-theses.xml.6'

sitemap-theses.xml.6 [ <=> ] 6.99M --.-KB/s in 0.1s

2024-02-28 15:30:14 (65.0 MB/s) - 'sitemap-theses.xml.6' saved [7326358]

$ wget https://era.library.ualberta.ca/sitemap-theses.xml
--2024-02-28 15:31:32-- https://era.library.ualberta.ca/sitemap-theses.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.53
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.53|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-theses.xml.14'

sitemap-theses.xml.14 [ <=> ] 6.99M --.-KB/s in 0.09s

2024-02-28 15:31:33 (80.7 MB/s) - 'sitemap-theses.xml.14' saved [7326358]

$ wget https://era.library.ualberta.ca/sitemap-theses.xml
--2024-02-28 15:32:05-- https://era.library.ualberta.ca/sitemap-theses.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.52
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.52|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-theses.xml.15'

sitemap-theses.xml.15 [ <=> ] 6.99M --.-KB/s in 0.09s

2024-02-28 15:32:06 (75.1 MB/s) - 'sitemap-theses.xml.15' saved [7326358]

$ wget https://era.library.ualberta.ca/sitemap-items.xml
--2024-02-28 15:32:28-- https://era.library.ualberta.ca/sitemap-items.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.52
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.52|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-items.xml'

sitemap-items.xml [ <=> ] 6.99M --.-KB/s in 0.1s

2024-02-28 15:32:29 (65.2 MB/s) - 'sitemap-items.xml' saved [7326358]

$ wget https://era.library.ualberta.ca/sitemap-items.xml
--2024-02-28 15:32:54-- https://era.library.ualberta.ca/sitemap-items.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.53
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.53|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-items.xml.1'

sitemap-items.xml.1 [ <=> ] 6.99M --.-KB/s in 0.1s

2024-02-28 15:32:54 (68.7 MB/s) - 'sitemap-items.xml.1' saved [7326358]

$ wget https://era.library.ualberta.ca/sitemap-items.xml
--2024-02-28 15:33:13-- https://era.library.ualberta.ca/sitemap-items.xml
Resolving era.library.ualberta.ca (era.library.ualberta.ca)... 129.128.216.54
Connecting to era.library.ualberta.ca (era.library.ualberta.ca)|129.128.216.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: 'sitemap-items.xml.2'

sitemap-items.xml.2 [ <=> ] 6.99M --.-KB/s in 0.1s

2024-02-28 15:33:14 (54.9 MB/s) - 'sitemap-items.xml.2' saved [7326358]

@jefferya
Copy link
Contributor Author

jefferya commented Feb 29, 2024

Testing the hypothesis that the sitemap-items.xml and sitemap-theses.xml routes reuse the same cache key.

The sitemap-items.xml and sitemap-theses.xml routes rely on the views views/sitemap/items.xml.builder and views/sitemap/items.xml.builder with both relying on the partial views/sitemap/_object.xml.builder (that contains the cache instruction). The hypothesis is that the cache key for the _object partial is the same for both the item and thesis routes therefore causing the response sometimes contain theses content for the item route and items content for the theses route (depending on the time until the cache expires and which application server delivers the response)

To verify the hypothesis

  1. Setup a local dev environment with seeds
  2. Enable caching
diff --git a/config/environments/development.rb b/config/environments/development.rb
index b5f7947c..3bbaf5ab 100644
--- a/config/environments/development.rb
+++ b/config/environments/development.rb
@@ -30,9 +30,9 @@ Rails.application.configure do
       'Cache-Control' => "public, max-age=#{2.days.to_i}"
     }
   else
-    config.action_controller.perform_caching = false
+    config.action_controller.perform_caching = true
 
-    config.cache_store = :null_store
+    config.cache_store = :memory_store
   end
  1. Start the Rails Console bundle exec rails console -e dev --sandbox
  2. Test the Theses route app.get "http://era.lvh.me/sitemap-theses.xml"
Started GET "/sitemap-theses.xml" for 127.0.0.1 at 2024-02-29 21:45:37 +0000
...
Processing by SitemapController#theses as XML
  Parameters: {"subdomain"=>"era"}
...
  Thesis Count (45.5ms)  SELECT COUNT(*) FROM "theses" WHERE "theses"."visibility" = $1  [["visibility", "http://terms.library.ualberta.ca/public"]]
  Rendering sitemap/theses.xml.builder
  Thesis Load (625.1ms)  SELECT "theses".* FROM "theses" WHERE "theses"."visibility" = $1  [["visibility", "http://terms.library.ualberta.ca/public"]]
  Rendered sitemap/_object.xml.builder (Duration: 33817.6ms | Allocations: 29713703) [cache miss]
  Rendered sitemap/theses.xml.builder (Duration: 33829.4ms | Allocations: 29713952)
Completed 200 OK in 33880ms (Views: 33205.7ms | ActiveRecord: 671.2ms | Allocations: 29716336)
4.a. notice the [cache miss] on the sitemap/_object.xml.builder
  1. What does the cache key list look like? Rails.cache.instance_variable_get(:@data).keys
=> ["read_only_mode.first.enabled", "views/sitemap/_object:29b55f4c76b6c3fb21553304686aae88/sitemap"]
5.a. notice that nothing uniquely identifies that the theses route was used to build the cache key
  1. Testing the items route: app.get "http://era.lvh.me/sitemap-items.xml"
Started GET "/sitemap-items.xml" for 127.0.0.1 at 2024-02-29 21:46:33 +0000
...
Processing by SitemapController#items as XML
  Parameters: {"subdomain"=>"era"}
...
  Item Count (0.9ms)  SELECT COUNT(*) FROM "items" WHERE "items"."visibility" = $1  [["visibility", "http://terms.library.ualberta.ca/public"]]
  Rendering sitemap/items.xml.builder
  Rendered sitemap/_object.xml.builder (Duration: 14.1ms | Allocations: 117) [cache hit]
  Rendered sitemap/items.xml.builder (Duration: 102.8ms | Allocations: 267)
Completed 200 OK in 115ms (Views: 103.3ms | ActiveRecord: 5.2ms | Allocations: 3783)
6.a. notice the [cache hit] on the sitemap/_object.xml.builder -- this should be a cache miss as this is a new route but the previous _object.xml.builder cached content (theses content) is being used to populate items.
  1. Let's view the content of the cache again Rails.cache.instance_variable_get(:@data).keys
=> ["read_only_mode.first.enabled", "views/sitemap/_object:29b55f4c76b6c3fb21553304686aae88/sitemap"]
7.a. key list is unchanged
  1. To test contents: document = Nokogiri::XML(@response.body)
    8.a. theses in the sitemap when the last request was for items

To fix

  • when the theses and items view calls the partial, add an additional parameter to differentiate the cache key in views/sitemap/_object.xml.builder

References

jefferya added a commit that referenced this issue Mar 1, 2024
…3361

* adds a distinct cache key to the partial used by both the theses and items sitemap view
@jefferya jefferya self-assigned this Mar 1, 2024
@jefferya
Copy link
Contributor Author

jefferya commented Mar 4, 2024

After changes that cause sitemap/_object.xml.bulder to generate a unique cache key based on the view (theses or items) calling the partial (notice the unique cache key):

irb(main):004> Rails.cache.instance_variable_get(:@data).keys
=>
["views/sitemap/_object:29b55f4c76b6c3fb21553304686aae88/sitemap-theses",
 "read_only_mode.first.enabled",
 "views/sitemap/_object:29b55f4c76b6c3fb21553304686aae88/sitemap-items"]

The output looks correct:

$ wget http://era.lvh.me:3000/sitemap-items.xml
--2024-03-04 13:54:32--  http://era.lvh.me:3000/sitemap-items.xml
Resolving era.lvh.me (era.lvh.me)... 127.0.0.1
Connecting to era.lvh.me (era.lvh.me)|127.0.0.1|:3000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24668 (24K) [application/xml]
Saving to: 'sitemap-items.xml'

sitemap-items.xml         100%[==================================>]  24.09K  --.-KB/s    in 0s

2024-03-04 13:54:32 (533 MB/s) - 'sitemap-items.xml' saved [24668/24668]
$ wget http://era.lvh.me:3000/sitemap-theses.xml
--2024-03-04 13:54:37--  http://era.lvh.me:3000/sitemap-theses.xml
Resolving era.lvh.me (era.lvh.me)... 127.0.0.1
Connecting to era.lvh.me (era.lvh.me)|127.0.0.1|:3000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20647873 (20M) [application/xml]
Saving to: 'sitemap-theses.xml'

sitemap-theses.xml        100%[==================================>]  19.69M  --.-KB/s    in 0.03s

2024-03-04 13:55:07 (726 MB/s) - 'sitemap-theses.xml' saved [20647873/20647873]

jefferya added a commit that referenced this issue Mar 4, 2024
…3361

* adds a distinct cache key to the partial used by both the theses and items sitemap view
jefferya added a commit that referenced this issue Mar 4, 2024
…3361

* adds a distinct cache key to the partial used by both the theses and items sitemap view
murny added a commit that referenced this issue Mar 14, 2024
Prevent theses and items sitemap views from using the same cache key #3361
@pgwillia
Copy link
Member

In production

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants