Skip to content

Commit

Permalink
Make HTTP tests more robust by adding retries to the tests (#9652)
Browse files Browse the repository at this point in the history
- As asked for by @hubertp who was encountering flaky test failures on CI in the Http_Spec and related ones, I'm adding retry logic to make such cases much less likely.
- I've made the test server randomly fail 50% of tests and with the retry logic the tests are still passing, so I think that should be much more robust, in practice the failure rate is much much less (I imagine <1% as most of the time these tests were working and we do a ton of requests in a single CI run).
- I move the `with_retries` method to now be `Test.with_retries` which can be used anywhere in our tests for the retry logic.
- It sleeps for 0.1s between retries. Not all kinds of tests need it, this was mostly for propagation delays in the Cloud in our tests. I was thinking if the delay should be configurable, but I think the 0.1s delay is not problematic and if our tests are sometimes failing due to high machine load, the delay could also help.
- This _does not_ add retry logic to raw HTTP operations or `Data.fetch`. We may add that later, but that needs some further design. In such case we may remove some retries from tests if they become unnecessary.
  • Loading branch information
radeusgd authored Apr 9, 2024
1 parent 2c78f4e commit 354ee94
Show file tree
Hide file tree
Showing 14 changed files with 132 additions and 123 deletions.
42 changes: 42 additions & 0 deletions distribution/lib/Standard/Test/0.0.0-dev/src/Test.enso
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ import project.Suite.Suite
import project.Suite.Suite_Builder
import project.Test_Result.Test_Result


polyglot java import java.lang.Thread


## Contains only static methods
type Test
## Construct a Test Suite object
Expand Down Expand Up @@ -167,3 +171,41 @@ type Test
result = behavior
State.put Clue prev_clue
result

## A helper method that retries the action a few times if it panics.
It allows to make flaky tests more robust.

It waits for a short period of time between retries (in case the failures
are related e.g. to network conditions or propagation delays).

This function should be placed inside of the `specify` block. It can be
used to retry the whole test, or it can be applied to a specific block
inside of the test to only re-run that specific block (in such case, any
side-effects must be considered carefully).

The method returns the first successful value returned by the action,
or fails with the last error thrown by the action after exhausting retry
attempts.
with_retries : Any -> Any
with_retries ~action =
loc = Meta.get_source_location 1

milliseconds_between_attempts = 100
## We give the CI a bit more attempts, as failures are more annoying there.
For local development, a bit less retries is enough - there it is more
likely that a failure is not flaky but it is an actual error so we
don't want to spend too much time retrying when debugging the tests locally.
max_retries = if Environment.get "CI" . is_nothing . not then 100 else 20

go i =
Panic.catch Any action caught_panic->
# If the iterations are exhausted, we rethrow the panic.
if i > max_retries then Panic.throw caught_panic else
if i % 10 == 0 then
IO.println "Still failing after "+i.to_text+" retries. ("+loc.to_display_text+")"
Thread.sleep milliseconds_between_attempts
## TODO This used to be
@Tail_Call go (i+1)
We should re-add the tail call once https://github.com/enso-org/enso/issues/9251 is fixed.
go (i+1)
go 1
9 changes: 4 additions & 5 deletions test/AWS_Tests/src/S3_Spec.enso
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ import enso_dev.Table_Tests.Util
from Standard.Test import all

import enso_dev.Base_Tests.Network.Enso_Cloud.Cloud_Tests_Setup.Cloud_Tests_Setup
from enso_dev.Base_Tests.Network.Enso_Cloud.Cloud_Tests_Setup import with_retries


test_credentials -> AWS_Credential | Nothing =
Expand Down Expand Up @@ -159,7 +158,7 @@ add_specs suite_builder =
Panic.with_finalizer secret_key_id.delete <|
secret_key_value = Enso_Secret.create "my_test_secret-AWS-secretkey" test_credentials.secret_access_key
secret_key_value.should_succeed
Panic.with_finalizer secret_key_value.delete <| with_retries <|
Panic.with_finalizer secret_key_value.delete <| Test.with_retries <|
r2 = S3.list_buckets (AWS_Credential.Key secret_key_id secret_key_value)
r2.should_succeed
r2.should_be_a Vector
Expand Down Expand Up @@ -328,7 +327,7 @@ add_specs suite_builder =

new_file.delete . should_succeed

with_retries <|
Test.with_retries <|
new_file.exists . should_be_false
my_writable_dir.list . should_not_contain new_file

Expand All @@ -339,7 +338,7 @@ add_specs suite_builder =
new_file.read . should_equal "Hello"

"World".write new_file on_existing_file=Existing_File_Behavior.Overwrite . should_succeed
with_retries <|
Test.with_retries <|
new_file.read . should_equal "World"

group_builder.specify "should not be able to append to a file" <|
Expand Down Expand Up @@ -529,7 +528,7 @@ add_specs suite_builder =
Panic.with_finalizer secret_key_id.delete <|
secret_key_value = Enso_Secret.create "datalink-secret-AWS-secretkey" test_credentials.secret_access_key
secret_key_value.should_succeed
Panic.with_finalizer secret_key_value.delete <| with_retries <|
Panic.with_finalizer secret_key_value.delete <| Test.with_retries <|
transformed_data_link_file.read . should_equal "Hello WORLD!"

group_builder.specify "should be able to read a data link with a custom file format set" <| with_default_credentials <|
Expand Down
3 changes: 1 addition & 2 deletions test/Base_Tests/src/Data/XML/XML_Spec.enso
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,7 @@ add_specs suite_builder =
doc = Data.read data.test_file
doc.root_element.name . should_equal "class"

group_builder.specify "Can read from an endpoint" <|
group_builder.specify "Can read from an endpoint" <| Test.with_retries <|
doc = Data.fetch "https://enso-data-samples.s3.us-west-1.amazonaws.com/sample.xml"
doc.root_element.name . should_equal "class"
doc.root_element.at 1 . name . should_equal "teacher"
Expand Down Expand Up @@ -295,4 +295,3 @@ main filter=Nothing =
suite = Test.build suite_builder->
add_specs suite_builder
suite.run_with_filter filter

Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ from Standard.Test import all
import Standard.Test.Test_Environment

import project.Network.Enso_Cloud.Cloud_Tests_Setup.Cloud_Tests_Setup
from enso_dev.Base_Tests.Network.Enso_Cloud.Cloud_Tests_Setup import with_retries

add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environment <|
suite_builder.group "DataLinks in Enso Cloud" pending=setup.real_cloud_pending group_builder->
Expand Down
26 changes: 0 additions & 26 deletions test/Base_Tests/src/Network/Enso_Cloud/Cloud_Tests_Setup.enso
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ import Standard.Base.Errors.Illegal_State.Illegal_State
import Standard.Test.Test_Environment


polyglot java import java.lang.Thread
polyglot java import java.security.KeyStore
polyglot java import javax.net.ssl.SSLContext
polyglot java import javax.net.ssl.TrustManagerFactory
Expand Down Expand Up @@ -158,28 +157,3 @@ type Mock_Credentials
about_to_expire self -> Mock_Credentials =
new_expire_at = Date_Time.now + (Duration.new minutes=1)
Mock_Credentials.Value self.access_token new_expire_at self.refresh_token self.refresh_url self.client_id

## PRIVATE
A helper method that retries the action a few times, to allow tests that may fail due to propagation delays to pass.
This is needed, because after creating a secret, there is a slight delay before it shows up within `list`.
To make tests robust, we add this retry logic.
with_retries ~action =
loc = Meta.get_source_location 1

# Delays are in seconds
sleep_time = 0.1
total_sleep_delay = if Environment.get "CI" . is_nothing . not then 10 else 2

max_iterations = total_sleep_delay / sleep_time
go i =
Panic.catch Any action caught_panic->
# If the iterations are exhausted, we rethrow the panic.
if i > max_iterations then Panic.throw caught_panic else
if i % 10 == 0 then
IO.println "Still failing after "+i.to_text+" retries. ("+loc.to_display_text+")"
Thread.sleep (1000*sleep_time . floor)
## TODO This used to be
@Tail_Call go (i+1)
We should re-add the tail call once https://github.com/enso-org/enso/issues/9251 is fixed.
go (i+1)
go 1
7 changes: 3 additions & 4 deletions test/Base_Tests/src/Network/Enso_Cloud/Enso_File_Spec.enso
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ from Standard.Test import all
import Standard.Test.Test_Environment

import project.Network.Enso_Cloud.Cloud_Tests_Setup.Cloud_Tests_Setup
from enso_dev.Base_Tests.Network.Enso_Cloud.Cloud_Tests_Setup import with_retries

add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environment <|
suite_builder.group "Enso Cloud Files" pending=setup.real_cloud_pending group_builder->
Expand All @@ -30,15 +29,15 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
delete_on_fail caught_panic =
my_dir.delete
Panic.throw caught_panic
Panic.catch Any handler=delete_on_fail <| with_retries <|
Panic.catch Any handler=delete_on_fail <| Test.with_retries <|
my_dir.is_directory . should_be_true
my_dir.exists . should_be_true
my_dir.name . should_equal my_name
Enso_File.root.list . should_contain my_dir

my_dir.delete . should_succeed

with_retries <|
Test.with_retries <|
Enso_File.root.list . should_not_contain my_dir

# TODO the dir still shows as 'existing' after deletion, probably because it still is there in the Trash
Expand Down Expand Up @@ -98,7 +97,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen

dir1.delete . should_succeed

with_retries <|
Test.with_retries <|
dir1.exists . should_be_false
# The inner directory should also have been trashed if its parent is removed
dir2.exists . should_be_false
Expand Down
35 changes: 17 additions & 18 deletions test/Base_Tests/src/Network/Enso_Cloud/Secrets_Spec.enso
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ from Standard.Test.Execution_Context_Helpers import run_with_and_without_output

import project.Network.Enso_Cloud.Cloud_Tests_Setup.Cloud_Tests_Setup
import project.Network.Enso_Cloud.Cloud_Tests_Setup.Mock_Credentials
from project.Network.Enso_Cloud.Cloud_Tests_Setup import with_retries

polyglot java import org.enso.base.enso_cloud.EnsoSecretAccessDenied
polyglot java import org.enso.base.enso_cloud.ExternalLibrarySecretHelper
Expand All @@ -35,19 +34,19 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
my_secret.name . should_equal "my_test_secret"
my_secret.id.is_empty . should_be_false

delete_on_fail my_secret <| with_retries <|
delete_on_fail my_secret <| Test.with_retries <|
Enso_Secret.list . should_contain my_secret

my_secret.delete . should_succeed

with_retries <|
Test.with_retries <|
Enso_Secret.list . should_not_contain my_secret

group_builder.specify "should allow to get a secret by name or path" <|
created_secret = Enso_Secret.create "my_test_secret-2" "my_secret_value"
created_secret.should_succeed
Panic.with_finalizer created_secret.delete <|
with_retries <|
Test.with_retries <|
fetched_secret = Enso_Secret.get "my_test_secret-2"
fetched_secret . should_equal created_secret

Expand All @@ -62,7 +61,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
created_secret.should_succeed
wait_until_secret_is_propagated created_secret
Panic.with_finalizer created_secret.delete <|
with_retries <|
Test.with_retries <|
r1 = Enso_Secret.create "my_test_secret-3" "my_secret_value"

## If the secret was created due to race condition - we clean it up
Expand All @@ -76,7 +75,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
secret1 = Enso_Secret.create "my_test_secret-6" "Yet another Mystery"
secret1.should_succeed

Panic.with_finalizer secret1.delete <| with_retries <|
Panic.with_finalizer secret1.delete <| Test.with_retries <|
https = setup.httpbin_secure_client
response = https.request (Request.get (setup.httpbin_secure_uri / "get") headers=[Header.new "X-My-Secret" secret1])
response.decode_as_json.at "headers" . at "X-My-Secret" . should_equal "Yet another Mystery"
Expand All @@ -85,7 +84,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
secret_token = Enso_Secret.create "my_test_secret-7" "MySecretToken"
secret_token.should_succeed

Panic.with_finalizer secret_token.delete <| with_retries <|
Panic.with_finalizer secret_token.delete <| Test.with_retries <|
https = setup.httpbin_secure_client
response = https.request (Request.get (setup.httpbin_secure_uri / "get") headers=[Header.authorization_bearer secret_token])
response_json = response.decode_as_json
Expand All @@ -97,7 +96,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
Panic.with_finalizer secret_username.delete <|
secret_password = Enso_Secret.create "my_test_secret-9" "MyP@ssword"
secret_password.should_succeed
Panic.with_finalizer secret_password.delete <| with_retries <|
Panic.with_finalizer secret_password.delete <| Test.with_retries <|
https = setup.httpbin_secure_client
response = https.request (Request.get (setup.httpbin_secure_uri / "get") headers=[Header.authorization_basic secret_username secret_password])

Expand All @@ -108,7 +107,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
group_builder.specify "should allow to derive values from secrets" <|
secret1 = Enso_Secret.create "my_test_secret-10" "Something"
secret1.should_succeed
Panic.with_finalizer secret1.delete <| with_retries <|
Panic.with_finalizer secret1.delete <| Test.with_retries <|
x = Derived_Secret_Value.from "X"
y = Derived_Secret_Value.from "Y"
v1 = x + y
Expand All @@ -134,7 +133,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
group_builder.specify "does not allow secrets in HTTP headers" pending=setup.httpbin_pending <|
secret1 = Enso_Secret.create "my_test_secret-11" "Something"
secret1.should_succeed
Panic.with_finalizer secret1.delete <| with_retries <|
Panic.with_finalizer secret1.delete <| Test.with_retries <|
uri = setup.httpbin_uri / "get"
r1 = uri.fetch headers=[Header.new "X-My-Secret" secret1]
r1.should_fail_with Illegal_Argument
Expand All @@ -143,7 +142,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
group_builder.specify "API exposing secrets to external libraries should not be accessible from unauthorized code" <|
secret1 = Enso_Secret.create "my_test_secret-12" "Something"
secret1.should_succeed
Panic.with_finalizer secret1.delete <| with_retries <|
Panic.with_finalizer secret1.delete <| Test.with_retries <|
java_repr = as_hideable_value secret1
Test.expect_panic EnsoSecretAccessDenied <|
ExternalLibrarySecretHelper.resolveValue java_repr
Expand All @@ -156,7 +155,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
nested_secret.should_succeed

delete_on_fail nested_secret <|
with_retries <|
Test.with_retries <|
Enso_Secret.list parent=subdirectory . should_contain nested_secret
Enso_Secret.exists "my-nested-secret-1" parent=subdirectory . should_be_true
Enso_Secret.get "my-nested-secret-1" parent=subdirectory . should_equal nested_secret
Expand All @@ -169,7 +168,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
nested_secret.delete . should_succeed

# Secret should disappear from the list after deletion:
with_retries <|
Test.with_retries <|
Enso_Secret.list parent=subdirectory . should_not_contain nested_secret
Enso_Secret.exists "my-nested-secret-1" parent=subdirectory . should_be_false
Enso_Secret.get "my-nested-secret-1" parent=subdirectory . should_fail_with Not_Found
Expand All @@ -181,7 +180,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
nested_secret = Enso_Secret.create "my-nested-secret-2" "NESTED_secret_value" parent=subdirectory
nested_secret.should_succeed
Panic.with_finalizer nested_secret.delete <|
with_retries <|
Test.with_retries <|
https = setup.httpbin_secure_client
response = https.request (Request.get (setup.httpbin_secure_uri / "get") headers=[Header.new "X-My-Nested-Secret" nested_secret])
response.decode_as_json.at "headers" . at "X-My-Nested-Secret" . should_equal "NESTED_secret_value"
Expand All @@ -193,15 +192,15 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
nested_secret = Enso_Secret.create "my-nested-secret-3" "Value-A" parent=subdirectory
nested_secret.should_succeed
Panic.with_finalizer nested_secret.delete <|
with_retries <|
Test.with_retries <|
https = setup.httpbin_secure_client
response = https.request (Request.get (setup.httpbin_secure_uri / "get") headers=[Header.new "X-My-Nested-Secret" nested_secret])
response.decode_as_json.at "headers" . at "X-My-Nested-Secret" . should_equal "Value-A"

nested_secret.update_value "Value-New-B" . should_succeed

# Not exactly sure if retries are needed here, but for test stability preferred to keep them.
with_retries <|
Test.with_retries <|
# Flushing caches to avoid the old value getting stuck after the first retry fails due to lack of propagation yet.
Enso_User.flush_caches
https = setup.httpbin_secure_client
Expand All @@ -221,7 +220,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
Enso_Secret.create ("foo"+Random.uuid) "baz" . should_fail_with Forbidden_Operation

# Get should still work
with_retries <| Enso_Secret.get "my_test_secret-13" . should_equal secret1
Test.with_retries <| Enso_Secret.get "my_test_secret-13" . should_equal secret1

group_builder.specify "should be able to retry fetching a secret if the token is expired" pending=setup.httpbin_pending <|
mock_setup = Cloud_Tests_Setup.prepare_mock_setup
Expand Down Expand Up @@ -250,7 +249,7 @@ main filter=Nothing =


wait_until_secret_is_propagated secret =
with_retries <| Enso_Secret.list . should_contain secret
Test.with_retries <| Enso_Secret.list . should_contain secret

delete_on_fail resource ~action =
on_failure caught_panic =
Expand Down
Loading

0 comments on commit 354ee94

Please sign in to comment.