Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HTTP tests more robust by adding retries to the tests #9652

Merged
merged 9 commits into from
Apr 9, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions distribution/lib/Standard/Test/0.0.0-dev/src/Test.enso
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ import project.Suite.Suite
import project.Suite.Suite_Builder
import project.Test_Result.Test_Result


polyglot java import java.lang.Thread


## Contains only static methods
type Test
## Construct a Test Suite object
Expand Down Expand Up @@ -167,3 +171,39 @@ type Test
result = behavior
State.put Clue prev_clue
result

## A helper method that retries the action a few times if it panics.
It allows to make flaky tests more robust.

It waits for a short period of time between retries (in case the failures
are related e.g. to network conditions or propagation delays).

This function should be placed inside of the `specify` block. It can be
used to retry the whole test, or it can be applied to a specific block
inside of the test to only re-run that specific block (in such case, any
side-effects must be considered carefully).

The method returns the first successful value returned by the action,
or fails with the last error thrown by the action after exhausting retry
attempts.
with_retries : Any -> Any
with_retries ~action =
loc = Meta.get_source_location 1

# Delays are in seconds
sleep_time = 0.1
total_sleep_delay = if Environment.get "CI" . is_nothing . not then 10 else 2

max_iterations = total_sleep_delay / sleep_time
go i =
Panic.catch Any action caught_panic->
# If the iterations are exhausted, we rethrow the panic.
if i > max_iterations then Panic.throw caught_panic else
if i % 10 == 0 then
IO.println "Still failing after "+i.to_text+" retries. ("+loc.to_display_text+")"
Thread.sleep (1000*sleep_time . floor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total time spent might be significantly longer if the action itself takes non-negligible time; it might be better to check the current time against (start_time + total_sleep_delay) rather than using a counter.

Copy link
Member Author

@radeusgd radeusgd Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, but I feel like the current behaviour is what we want.

If the action takes 3s to complete due to bad network conditions and it fails on a timeout, then with a retry delay of 2s - it will not retry at all... But the whole point of this is to do some retries. I think it's better to do the same number of retries regardless of how long the underlying action is taking.

The total_sleep_delay is just used to approximate the total wait time. But I guess I can rephrase this to just be max_retries counter and remove the total_sleep_delay altogether, if that will be clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a 2-linear backoff? E.g., first retry waits for 2 seconds, another for 4 seconds, another for 8 seconds, 16 secs, etc.... The way you coded it, it will wait for 100 seconds on the CI after every retry, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a 2-linear backoff? E.g., first retry waits for 2 seconds, another for 4 seconds, another for 8 seconds, 16 secs, etc.... The way you coded it, it will wait for 100 seconds on the CI after every retry, right?

It will wait for 100 milliseconds between every retry, not 100 seconds 😅

I feel like this is unnecessarily complicating stuff. I want the test to finish as soon as possible, so increasing the wait time does not seem to make that better. The strategy we have here was already successfully used for running cloud tests with propagation delays. I don't think there's value in complicating this strategy until we have a reason to do so. For now, I don't see any reasons - it works good enough and is simple.

## TODO This used to be
@Tail_Call go (i+1)
We should re-add the tail call once https://github.com/enso-org/enso/issues/9251 is fixed.
go (i+1)
go 1
9 changes: 4 additions & 5 deletions test/AWS_Tests/src/S3_Spec.enso
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ import enso_dev.Table_Tests.Util
from Standard.Test import all

import enso_dev.Base_Tests.Network.Enso_Cloud.Cloud_Tests_Setup.Cloud_Tests_Setup
from enso_dev.Base_Tests.Network.Enso_Cloud.Cloud_Tests_Setup import with_retries


test_credentials -> AWS_Credential | Nothing =
Expand Down Expand Up @@ -159,7 +158,7 @@ add_specs suite_builder =
Panic.with_finalizer secret_key_id.delete <|
secret_key_value = Enso_Secret.create "my_test_secret-AWS-secretkey" test_credentials.secret_access_key
secret_key_value.should_succeed
Panic.with_finalizer secret_key_value.delete <| with_retries <|
Panic.with_finalizer secret_key_value.delete <| Test.with_retries <|
r2 = S3.list_buckets (AWS_Credential.Key secret_key_id secret_key_value)
r2.should_succeed
r2.should_be_a Vector
Expand Down Expand Up @@ -328,7 +327,7 @@ add_specs suite_builder =

new_file.delete . should_succeed

with_retries <|
Test.with_retries <|
new_file.exists . should_be_false
my_writable_dir.list . should_not_contain new_file

Expand All @@ -339,7 +338,7 @@ add_specs suite_builder =
new_file.read . should_equal "Hello"

"World".write new_file on_existing_file=Existing_File_Behavior.Overwrite . should_succeed
with_retries <|
Test.with_retries <|
new_file.read . should_equal "World"

group_builder.specify "should not be able to append to a file" <|
Expand Down Expand Up @@ -529,7 +528,7 @@ add_specs suite_builder =
Panic.with_finalizer secret_key_id.delete <|
secret_key_value = Enso_Secret.create "datalink-secret-AWS-secretkey" test_credentials.secret_access_key
secret_key_value.should_succeed
Panic.with_finalizer secret_key_value.delete <| with_retries <|
Panic.with_finalizer secret_key_value.delete <| Test.with_retries <|
transformed_data_link_file.read . should_equal "Hello WORLD!"

group_builder.specify "should be able to read a data link with a custom file format set" <| with_default_credentials <|
Expand Down
3 changes: 1 addition & 2 deletions test/Base_Tests/src/Data/XML/XML_Spec.enso
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,7 @@ add_specs suite_builder =
doc = Data.read data.test_file
doc.root_element.name . should_equal "class"

group_builder.specify "Can read from an endpoint" <|
group_builder.specify "Can read from an endpoint" <| Test.with_retries <|
doc = Data.fetch "https://enso-data-samples.s3.us-west-1.amazonaws.com/sample.xml"
doc.root_element.name . should_equal "class"
doc.root_element.at 1 . name . should_equal "teacher"
Expand Down Expand Up @@ -295,4 +295,3 @@ main filter=Nothing =
suite = Test.build suite_builder->
add_specs suite_builder
suite.run_with_filter filter

Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ from Standard.Test import all
import Standard.Test.Test_Environment

import project.Network.Enso_Cloud.Cloud_Tests_Setup.Cloud_Tests_Setup
from enso_dev.Base_Tests.Network.Enso_Cloud.Cloud_Tests_Setup import with_retries

add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environment <|
suite_builder.group "DataLinks in Enso Cloud" pending=setup.real_cloud_pending group_builder->
Expand Down
26 changes: 0 additions & 26 deletions test/Base_Tests/src/Network/Enso_Cloud/Cloud_Tests_Setup.enso
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ import Standard.Base.Errors.Illegal_State.Illegal_State
import Standard.Test.Test_Environment


polyglot java import java.lang.Thread
polyglot java import java.security.KeyStore
polyglot java import javax.net.ssl.SSLContext
polyglot java import javax.net.ssl.TrustManagerFactory
Expand Down Expand Up @@ -158,28 +157,3 @@ type Mock_Credentials
about_to_expire self -> Mock_Credentials =
new_expire_at = Date_Time.now + (Duration.new minutes=1)
Mock_Credentials.Value self.access_token new_expire_at self.refresh_token self.refresh_url self.client_id

## PRIVATE
A helper method that retries the action a few times, to allow tests that may fail due to propagation delays to pass.
This is needed, because after creating a secret, there is a slight delay before it shows up within `list`.
To make tests robust, we add this retry logic.
with_retries ~action =
loc = Meta.get_source_location 1

# Delays are in seconds
sleep_time = 0.1
total_sleep_delay = if Environment.get "CI" . is_nothing . not then 10 else 2

max_iterations = total_sleep_delay / sleep_time
go i =
Panic.catch Any action caught_panic->
# If the iterations are exhausted, we rethrow the panic.
if i > max_iterations then Panic.throw caught_panic else
if i % 10 == 0 then
IO.println "Still failing after "+i.to_text+" retries. ("+loc.to_display_text+")"
Thread.sleep (1000*sleep_time . floor)
## TODO This used to be
@Tail_Call go (i+1)
We should re-add the tail call once https://github.com/enso-org/enso/issues/9251 is fixed.
go (i+1)
go 1
7 changes: 3 additions & 4 deletions test/Base_Tests/src/Network/Enso_Cloud/Enso_File_Spec.enso
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ from Standard.Test import all
import Standard.Test.Test_Environment

import project.Network.Enso_Cloud.Cloud_Tests_Setup.Cloud_Tests_Setup
from enso_dev.Base_Tests.Network.Enso_Cloud.Cloud_Tests_Setup import with_retries

add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environment <|
suite_builder.group "Enso Cloud Files" pending=setup.real_cloud_pending group_builder->
Expand All @@ -30,15 +29,15 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
delete_on_fail caught_panic =
my_dir.delete
Panic.throw caught_panic
Panic.catch Any handler=delete_on_fail <| with_retries <|
Panic.catch Any handler=delete_on_fail <| Test.with_retries <|
my_dir.is_directory . should_be_true
my_dir.exists . should_be_true
my_dir.name . should_equal my_name
Enso_File.root.list . should_contain my_dir

my_dir.delete . should_succeed

with_retries <|
Test.with_retries <|
Enso_File.root.list . should_not_contain my_dir

# TODO the dir still shows as 'existing' after deletion, probably because it still is there in the Trash
Expand Down Expand Up @@ -98,7 +97,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen

dir1.delete . should_succeed

with_retries <|
Test.with_retries <|
dir1.exists . should_be_false
# The inner directory should also have been trashed if its parent is removed
dir2.exists . should_be_false
Expand Down
35 changes: 17 additions & 18 deletions test/Base_Tests/src/Network/Enso_Cloud/Secrets_Spec.enso
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ from Standard.Test.Execution_Context_Helpers import run_with_and_without_output

import project.Network.Enso_Cloud.Cloud_Tests_Setup.Cloud_Tests_Setup
import project.Network.Enso_Cloud.Cloud_Tests_Setup.Mock_Credentials
from project.Network.Enso_Cloud.Cloud_Tests_Setup import with_retries

polyglot java import org.enso.base.enso_cloud.EnsoSecretAccessDenied
polyglot java import org.enso.base.enso_cloud.ExternalLibrarySecretHelper
Expand All @@ -35,19 +34,19 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
my_secret.name . should_equal "my_test_secret"
my_secret.id.is_empty . should_be_false

delete_on_fail my_secret <| with_retries <|
delete_on_fail my_secret <| Test.with_retries <|
Enso_Secret.list . should_contain my_secret

my_secret.delete . should_succeed

with_retries <|
Test.with_retries <|
Enso_Secret.list . should_not_contain my_secret

group_builder.specify "should allow to get a secret by name or path" <|
created_secret = Enso_Secret.create "my_test_secret-2" "my_secret_value"
created_secret.should_succeed
Panic.with_finalizer created_secret.delete <|
with_retries <|
Test.with_retries <|
fetched_secret = Enso_Secret.get "my_test_secret-2"
fetched_secret . should_equal created_secret

Expand All @@ -62,7 +61,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
created_secret.should_succeed
wait_until_secret_is_propagated created_secret
Panic.with_finalizer created_secret.delete <|
with_retries <|
Test.with_retries <|
r1 = Enso_Secret.create "my_test_secret-3" "my_secret_value"

## If the secret was created due to race condition - we clean it up
Expand All @@ -76,7 +75,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
secret1 = Enso_Secret.create "my_test_secret-6" "Yet another Mystery"
secret1.should_succeed

Panic.with_finalizer secret1.delete <| with_retries <|
Panic.with_finalizer secret1.delete <| Test.with_retries <|
https = setup.httpbin_secure_client
response = https.request (Request.get (setup.httpbin_secure_uri / "get") headers=[Header.new "X-My-Secret" secret1])
response.decode_as_json.at "headers" . at "X-My-Secret" . should_equal "Yet another Mystery"
Expand All @@ -85,7 +84,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
secret_token = Enso_Secret.create "my_test_secret-7" "MySecretToken"
secret_token.should_succeed

Panic.with_finalizer secret_token.delete <| with_retries <|
Panic.with_finalizer secret_token.delete <| Test.with_retries <|
https = setup.httpbin_secure_client
response = https.request (Request.get (setup.httpbin_secure_uri / "get") headers=[Header.authorization_bearer secret_token])
response_json = response.decode_as_json
Expand All @@ -97,7 +96,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
Panic.with_finalizer secret_username.delete <|
secret_password = Enso_Secret.create "my_test_secret-9" "MyP@ssword"
secret_password.should_succeed
Panic.with_finalizer secret_password.delete <| with_retries <|
Panic.with_finalizer secret_password.delete <| Test.with_retries <|
https = setup.httpbin_secure_client
response = https.request (Request.get (setup.httpbin_secure_uri / "get") headers=[Header.authorization_basic secret_username secret_password])

Expand All @@ -108,7 +107,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
group_builder.specify "should allow to derive values from secrets" <|
secret1 = Enso_Secret.create "my_test_secret-10" "Something"
secret1.should_succeed
Panic.with_finalizer secret1.delete <| with_retries <|
Panic.with_finalizer secret1.delete <| Test.with_retries <|
x = Derived_Secret_Value.from "X"
y = Derived_Secret_Value.from "Y"
v1 = x + y
Expand All @@ -134,7 +133,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
group_builder.specify "does not allow secrets in HTTP headers" pending=setup.httpbin_pending <|
secret1 = Enso_Secret.create "my_test_secret-11" "Something"
secret1.should_succeed
Panic.with_finalizer secret1.delete <| with_retries <|
Panic.with_finalizer secret1.delete <| Test.with_retries <|
uri = setup.httpbin_uri / "get"
r1 = uri.fetch headers=[Header.new "X-My-Secret" secret1]
r1.should_fail_with Illegal_Argument
Expand All @@ -143,7 +142,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
group_builder.specify "API exposing secrets to external libraries should not be accessible from unauthorized code" <|
secret1 = Enso_Secret.create "my_test_secret-12" "Something"
secret1.should_succeed
Panic.with_finalizer secret1.delete <| with_retries <|
Panic.with_finalizer secret1.delete <| Test.with_retries <|
java_repr = as_hideable_value secret1
Test.expect_panic EnsoSecretAccessDenied <|
ExternalLibrarySecretHelper.resolveValue java_repr
Expand All @@ -156,7 +155,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
nested_secret.should_succeed

delete_on_fail nested_secret <|
with_retries <|
Test.with_retries <|
Enso_Secret.list parent=subdirectory . should_contain nested_secret
Enso_Secret.exists "my-nested-secret-1" parent=subdirectory . should_be_true
Enso_Secret.get "my-nested-secret-1" parent=subdirectory . should_equal nested_secret
Expand All @@ -169,7 +168,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
nested_secret.delete . should_succeed

# Secret should disappear from the list after deletion:
with_retries <|
Test.with_retries <|
Enso_Secret.list parent=subdirectory . should_not_contain nested_secret
Enso_Secret.exists "my-nested-secret-1" parent=subdirectory . should_be_false
Enso_Secret.get "my-nested-secret-1" parent=subdirectory . should_fail_with Not_Found
Expand All @@ -181,7 +180,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
nested_secret = Enso_Secret.create "my-nested-secret-2" "NESTED_secret_value" parent=subdirectory
nested_secret.should_succeed
Panic.with_finalizer nested_secret.delete <|
with_retries <|
Test.with_retries <|
https = setup.httpbin_secure_client
response = https.request (Request.get (setup.httpbin_secure_uri / "get") headers=[Header.new "X-My-Nested-Secret" nested_secret])
response.decode_as_json.at "headers" . at "X-My-Nested-Secret" . should_equal "NESTED_secret_value"
Expand All @@ -193,15 +192,15 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
nested_secret = Enso_Secret.create "my-nested-secret-3" "Value-A" parent=subdirectory
nested_secret.should_succeed
Panic.with_finalizer nested_secret.delete <|
with_retries <|
Test.with_retries <|
https = setup.httpbin_secure_client
response = https.request (Request.get (setup.httpbin_secure_uri / "get") headers=[Header.new "X-My-Nested-Secret" nested_secret])
response.decode_as_json.at "headers" . at "X-My-Nested-Secret" . should_equal "Value-A"

nested_secret.update_value "Value-New-B" . should_succeed

# Not exactly sure if retries are needed here, but for test stability preferred to keep them.
with_retries <|
Test.with_retries <|
# Flushing caches to avoid the old value getting stuck after the first retry fails due to lack of propagation yet.
Enso_User.flush_caches
https = setup.httpbin_secure_client
Expand All @@ -221,7 +220,7 @@ add_specs suite_builder setup:Cloud_Tests_Setup = setup.with_prepared_environmen
Enso_Secret.create ("foo"+Random.uuid) "baz" . should_fail_with Forbidden_Operation

# Get should still work
with_retries <| Enso_Secret.get "my_test_secret-13" . should_equal secret1
Test.with_retries <| Enso_Secret.get "my_test_secret-13" . should_equal secret1

group_builder.specify "should be able to retry fetching a secret if the token is expired" pending=setup.httpbin_pending <|
mock_setup = Cloud_Tests_Setup.prepare_mock_setup
Expand Down Expand Up @@ -250,7 +249,7 @@ main filter=Nothing =


wait_until_secret_is_propagated secret =
with_retries <| Enso_Secret.list . should_contain secret
Test.with_retries <| Enso_Secret.list . should_contain secret

delete_on_fail resource ~action =
on_failure caught_panic =
Expand Down
6 changes: 3 additions & 3 deletions test/Base_Tests/src/Network/Http/Http_Auto_Parse_Spec.enso
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ add_specs suite_builder =
. add_query_argument "Content-Type" "text/plain; charset=windows-1250"
. add_query_argument "base64_response_data" (Base_64.encode_text content_windows_1250 Encoding.windows_1250)

group_builder.specify "should detect the encoding from Content-Type in fetch" <|
group_builder.specify "should detect the encoding from Content-Type in fetch" <| Test.with_retries <|
url_utf8.fetch . should_equal content_utf
url_windows_1250.fetch . should_equal content_windows_1250

group_builder.specify "should detect the encoding from Content-Type in decode_as_text" <|
group_builder.specify "should detect the encoding from Content-Type in decode_as_text" <| Test.with_retries <|
r1 = url_utf8.fetch format=Raw_Response
r1.decode_as_text . should_equal content_utf

Expand All @@ -42,7 +42,7 @@ add_specs suite_builder =
# We may override the encoding detected from Content-Type:
r3.decode_as_text Encoding.ascii . should_fail_with Encoding_Error

group_builder.specify "should detect the encoding from Content-Type in decode_as_json" <|
group_builder.specify "should detect the encoding from Content-Type in decode_as_json" <| Test.with_retries <|
r1 = url_utf8.fetch format=Raw_Response
r1.decode_as_json . should_equal ["x", "Hello! 😊👍 ąę"]

Expand Down
Loading
Loading