Skip to content
This repository has been archived by the owner on May 16, 2023. It is now read-only.

Submission Svc: Timeout handling #602

Merged
merged 4 commits into from
Jul 1, 2020
Merged

Conversation

stevesap
Copy link
Contributor

@stevesap stevesap commented Jun 18, 2020

Checklist

  • Test your changes as thoroughly as possible before you commit them. Preferably, automate your test by unit/integration tests.
  • Make sure mvn install runs for the whole project and, if you touched any code in the respective service, submission and distribution service can be run with spring-boot:run

Description

This PR is a proposal taking into account the analyses in #423

Introduces two submission service related timeouts:

FeignClient timeouts

feign:
  client:
    config:
      default:
        connect-timeout: 5000
        read-timeout: 5000
  • These timeouts are required in order to prevent blocked submission service worker threads in the case where the cwa-verification-server does not respond in timely fashion, likely due to a flood of invalid TAN requests.
  • The above values are similar to those in cwa-verification-portal and considerably less than the 60 second default value.
  • A FeignClient timeout results in Feign.RetyrableExeption which gets handed over to DeferredResult and then immediately interrupts the synchronous endpoint execution, resulting in a HTTP 500 Internal Server Error.

Transaction timeout

spring:
  transaction:
    default-timeout: 20
  • This timeout is required in order to prevent blocked submission service worker threads in the case where the database does not respond in a timely fashion. Note that this is not as likely to happen as much as a cwa-verification-server slow response, however if the database does not respond for whatever reason, the thread will continue waiting for it indefinitely (This was tested this introducing a large delay into the SQL INSERT INTO statement)
  • The above value (in seconds) was chosen as the maximum amount of time that the diagnosis key insertion transaction should reasonably be concluded by.
  • A transaction timeout results in org.springframework.transaction.TransactionTimedOutException which gets handed over to DeferredResult and then immediately interrupts the synchronous endpoint execution, resulting in a HTTP 500 Internal Server Error.

Note: There is no need to set a timeout on DeferredResult or the submission service endpoint itself since due to the submission endpoint currently being implemented synchronously, a timeout on DeferredResult has no effect. As noted, the above 2 timeouts will be handed over to DeferredResult which would interrupt the synchronous endpoint execution immediately.

@RomanNess
Copy link
Contributor

I have two remarks that might not carry weight in the specific case.

  1. spring.transaction.default-timeout is a global setting that will apply a timeout to all functions annotated with @Transactional. It will also trigger for service functions annotated with '@Transactional` that do not call the DB at all.

    A local timeout on function level can be configured with @Transactional(timeout = 20, propagation = Propagation.REQUIRES_NEW) where Propagation.REQUIRES_NEW makes sure that a new transaction with the timeout is used even if a transaction without timeout is already active.

  2. To me it seems a bit odd that HTTP 500 is returned for both timeouts in FeignClient and timeouts against the DB. In order to distinguish these timeouts that indicate load issues in other components from actual server errors in cwa-server, I would have used something like HTTP 504 Gateway Timeout.

Once again, I don't think that this will be an issue for cwa-server in production. These are just two thoughts that came into my mind, reading this pull request.

@stevesap
Copy link
Contributor Author

  1. This is a good point. However, the function in question is the only one in the submission service which uses @Transactional annotation. The main reason I went with the global application configuration setting here is that @Transactional's timeout parameter only accepts constants and I would prefer to use an application configuration property here in case this value needs to be adjusted quickly in production.
    By any chance that additional functions require @Transactional annotation down the road, we can
    look at the somewhat more invasive change of using TransactionTemplate to apply a different timeout to each, using application configuration properties.

  2. I agree that HTTP 504 is more semantically correct here, however I'm not sure what additional value that would bring. If the mobile clients used this information somehow, that would make more sense. It would indeed expose externally that something is wrong with our verification-server or database, but this is something our Ops team would already be able to identify internally since these very specific exceptions (Feign.RetyrableExeption and org.springframework.transaction.TransactionTimedOutException) are getting logged.

@RomanNess
Copy link
Contributor

  1. The flexibility to change the timeout parameter without recompilation is a good point.

  2. The additional value comes when the mobile client reacts to different 5xx status codes. For example the client could display Please try again later for 504 (since Gateway timeouts are typically caused by too much load on the system) and something like Unexpected error for 500.

    Also, it might be easier to group exceptions by status code if you don't have a dedicated Ops team and you have to do the Ops for your own code. 🙂

@stevesap stevesap marked this pull request as ready for review June 19, 2020 12:55
@stevesap stevesap linked an issue Jun 29, 2020 that may be closed by this pull request
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jul 1, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@stevesap stevesap merged commit 2e2655a into master Jul 1, 2020
@stevesap stevesap deleted the optimize/subsvc-timeouts branch July 1, 2020 08:05
christian-kirschnick added a commit that referenced this pull request Jul 10, 2020
* Introduce a new 1.1.0-SNAPSHOT (#465)

* Fix postgres endpoint in README (#467)

* Documentation: Fix broken link in architecture-overview.md (#469)

* Adherence to micrometer naming conventions (#468)

Co-authored-by: Steve BE <[email protected]>

* alternate postgres address (from container) (#479)

* Mitigate CVE-2020-13692 (#477)

* Add profile documentation to readme (#475)

* Add hostname verification (#473)

* Optimize Docker build with mvn go-offline (#470)

* Fix CVE-2020-13692 in services too (#484)

* S3 compatible storage multithreading (#466)

* Number of bytes should not depend on Locale (#478)

Charset.defaultCharset() depends on the locale
of the JVM and OS. Tests would fail e.g. with
UTF-16 (2 byte per character) encoding. Use US-ASCII
instead to make it obvious we use a 1 byte per
character encoding for test data.

* Add CSP header (#492)

* add file license header checklist item (#491)

* correct dev profile argument (#490)

* Fix grep command (#481)

Without "--", grep interprets search string "-t" as flag

* fix sonarlint-warning in tests (#471)

- replace size-comparison with "hasSameSizeAs"
- remove redundant "isNotNull()"

* Consistently use UTC for LocalDateTime (#488)

* docs for running maven build with services from docker-compose (#494)

* Updating readme (#516)

* Update dependencies (#487)

* remove use of ForkJoinPool in SubmissionController (#399) (#518)

* Fix http client connection pool size (#519)

* Update android app versions (#520)

* Add "ssl-client-verification-verify-hostname" profile (#521)

* Revert "Update android app versions (#520)" (#524)

* Add TLS Ciphers compliant to BSI-TR-02102-2 (#525)

* Restrict TLS Version

* Specify explicit ciphers

* Add ciphers

Co-authored-by: Pit Humke <[email protected]>

* Add stubs for planned documentation (#528)

* Update android app versions (#529)

* remove unnecessary import (#532)

* Restict allowed actuator routes (#540)

* Renaming singing decorator (#539)

Co-authored-by: Steve BE <[email protected]>

* Set test root log level to 'off' for distribution (#543)

* Add tests to ensure requests are monitored (#544)

Co-authored-by: Christian Kirschnick <[email protected]>

* Fix batch counter concurrency issue (#551)

* Enabling roles also for local development & docker setup (#552)

* rename app.coronawarn.server.services.distribution.assembly.structure.directory.decorators package in test to 'decorator' to match main package (#555)

* Change SSL-related profiles to be secure by default (#541)

* Fix connection pool size configuration (#561)

* Updated timestamp format (#566)

* Fix SSL defaults (#564)

* moved details to description property (#558)

* Update log level for invalid TAN syntax check (#553)

* Streamline log level for submissionservice (#550)

* Fix fake delay updates (#570)

* final adjustments for params (#576)

* adjusting log statement (#583)

* adjusting log statement

* code review

* S3 Continuation Token (#572)

* Object Store / S3 Publisher Test (#585)

* add test for upload error

* add ObjectStoreAccess Tests

* Add validator for submissionTimestamp (closes #557) (#573)

* Add validator for submissionTimestamp
* [refactor] extract and reuse buildDiagnosisKey helper methods

* Add additional unit tests (#588)

* Doc: TLS (#571)

* Disable UserDetailsServiceAutoConfiguration (#594)

* nit: corrected typo (#592)

* Declutter SubmissionController (#554)

* DOC: Distribution Service/Spring Profiles (#559)

* DOC: Submission Service\Spring Profiles & Impact (#534)

* use version path parts from application configuration (#596)

* Fix/appconfig validation (#598)

- Fixes bug where app configurations could be written even if validation failed at runtime.
- Adds unit test for AppConfigurationDirectory
- Corrects JavaDoc in AppConfigurationDirectory
- Extracts test helper method
- Refactors AppConfigurationProvider into Configuration that provides ApplicationConfiguration bean

* Use non-root user inside the containers (#597)

* Add missing blanks in warn messages (#599)

* Create runner tests (#604)

* Add unit test for ObjectStoreAccess construction bucketExists check

* wip

* cleanup

* Clean up new lines

* Remove unnecessary public modifiers

Co-authored-by: Michael Burwig <[email protected]>

* replace all CR, LF or CRLF (#591)

* Specify TLS named groups (#603)

* Clearing up possible misunderstanding of full architecture vs cwa server (#605)

* Clearing up possible misunderstanding of full architecture vs cwa server (#605)

* Add random key padding (#609)

* Minor code prettification (#612)

* Minor code quality improvements (#575)

* Decrease max random key multiplier to 25 (#615)

* Add manual Log4J shutdown hook (close #589) (#590)

* Object Store Docs (#560)

* Added distribution

* remove links

* Add assembly

* formatting + spelling

* Cleanup files

* Readd DISTRIBUTION.md

* Update docs/DISTRIBUTION.md

Co-authored-by: Pit Humke <[email protected]>

* Update docs/DISTRIBUTION.md

* pr remarks

* add some information for retention

* linting

* Add info about file naming

* Add minor tweaks

Co-authored-by: MKusber <[email protected]>
Co-authored-by: Christian Kirschnick <[email protected]>
Co-authored-by: Pit Humke <[email protected]>
Co-authored-by: Michael Burwig <[email protected]>

* Test for VerificationServiceHealthIndicator, remove null check (#619)

* Fix sonar issues (#623)

* Add unit tests for export.bin.checksum (#621)

* Add unit tests for export.bin.checksum

* add license file header

* fix code smell

* Moved non-static import, use of helper for DiagnosisKey

* Make use of AssertJ dedicated methods

* Remove unused import

Co-authored-by: Kevin Pontes <[email protected]>

* Fix/mime type (#627)

* depending on the s3Key we identify the mime type

608: Index Files: Incorrect mime type  - #608

* get the content type from the localfile and add to headers map

608: Index Files: Incorrect mime type  - #608

* forward the added CONTENT_TYPE header to the s3Client (requestBuilder)

608: Index Files: Incorrect mime type  - #608

* removed keyword: 'final'

608: Index Files: Incorrect mime type

Task-Url: #608

* removed dependency on awssdk

608: Index Files: Incorrect mime type

Task-Url: #608

* Use native query for retention policy (#628)

* #556: Performance Improvement: Disable spring.jpa.open-in-view in (#632)

submission service

Task-Url: #556

* Config/adjust min android version (#633)

* Update app-version-config.yaml

Adjust min android version

* Update app-version-config.yaml

* OSP is OpenStack Platform, OpenShift is OCP (#625)

* Mitigate CVE-2020-11996 (#635)

* Integration Testing on CircleCI (#580)

* failsafe

* add failsafe int test to circle ci

* Docker object store on circle ci

Co-authored-by: Christian Kirschnick <[email protected]>

* Enable Docker Layer Caching on CircleCi (#639)

* Submission Svc: Timeout handling  (#602)

* introduce spring transaction default timeout
* introduce feign client timeouts
* add timeout test

* Simplifying DateAggregatingDecorator (#629)

* Create unit test for DiagnosisKeysHourDirectory.

* Add getDistributionTime function to DiagnosisKeyBundler.

* Add tests to DiagnosisKeysDateDirectory and improve HourDirectoryTests.

* Remove DateAggregatingDecorator.

* Add license header.

* Remove unnecessary public visibility for DiagnosisKeysHourDirectoryTest.

* Use native query for retention policy (#628)

* Code formatting.

* Split parameterized test for DiagnosisKeysDateDirectory into separate tests.

* Split parameterized test for DiagnosisKeysHourDirectory into separate tests.

* Remove null value return when the associated date is current distribution date.

* Remove null value return when the associated date is current distribution date.

* Remove TestTuple class.

* Made NilDirectory public and test coverage.

* Made NilDirectory public and test coverage.

* Made NilDirectory public and test coverage.

* Use optional instead of NilDirectory.

* Add minor formatting tweaks

Co-authored-by: KevponSAP <[email protected]>
Co-authored-by: Pit Humke <[email protected]>
Co-authored-by: Steve BE <[email protected]>
Co-authored-by: Michael Burwig <[email protected]>

* Revert "Consistently use UTC for LocalDateTime (#488)" (#641)

This reverts commit ec099f3.

* switch to spring-2.3.1 (#574)

* switch to spring-2.3.1
* remove CVE-2020-11996 mitigation 
is now resolved by upgrade to spring-2.3.1

* Use maven wrapper as the single source of truth for maven version in the project (#194)

* use maven wrapper as a single source of truth for maven version
* Use ./mvnw instead of mvn
* -set JAVA_HOME variable to jdk 11
* - set export JAVA_HOME without sudo
* - set in the SonarCloud analysis the JAVA_HOME
* - add line for consistency
* - change to use mvnw in docker-compose-it file
* force mvnw script and properties file to use LF eol for windows machines

* Include date index file generation for current date in demo profile. (#643)

* fix javadoc generation errors (#646)

* Update codeowners file (#645)

* Fix client SSL properties (#648)

* Replace data-jpa with data-jdbc (#631)

* Replace data-jpa with data-jdbc
* Remove default constructor from DiagnosisKey
* Clarify test.database.replace statement

* downscale gracefully (#647)

* Failing requests in down-scaling scenario

closes #547

Following
https://dzone.com/articles/graceful-shutdown-spring-boot-applications

We let the tomcat finish it's work (send open async responses -
DeferredResult), before it's terminated (SIGTERM coming from
kubernetes).

* #589 Log4j improper shutdown

same thing as in distribution service

* checkstyle complains about missing javadoc

why is it doing that for this method but not for the others without
javadoc?!?! bug in checkstyle?

* ignore null executors

#547

* add missing file comment 'license'

#547

* fix issues found by sonar

#547

* let's use the spring framework solution

#547

https://dzone.com/articles/configuring-graceful-shutdown-readiness-and-livene

* prefer `DisposableBean` over `@PreDestroy`

#589

* use k8s setting `terminationGracePeriodSeconds`

#547

* Spring Cloud Vault Integration (#624)

* Remove missing .yaml attribute validation test (#652)

* Add dependecies for javadoc generation (#649)

* Move dependency one level up (#654)

* Update with links to cwa-server documentation and JavaDoc (#655)

* Fix javadoc search 404 result issue (#657)

* Fix/attenuation value range (#642)

* Change value range lt to le

* Change parameter lt_10_dbm to le_10_dbm in yaml-files, add unit test for naming mismatch, fix unit test path for broken syntax

* Removed unused enum entry, refactor test

Co-authored-by: Kevin Pontes <[email protected]>
Co-authored-by: Pit Humke <[email protected]>

* Addition to CODEOWNERS.md (#653)

* Addition to codeowners

* Specify entire teams in codeowners file

* Revert "Specify entire teams in codeowners file"

This reverts commit e101f43.

* Added Frederico to  codeowners.md

Co-authored-by: Michael Burwig <[email protected]>

Co-authored-by: Ole Lilienthal <[email protected]>
Co-authored-by: Jan Sievers <[email protected]>
Co-authored-by: selectAll <[email protected]>
Co-authored-by: Steve BE <[email protected]>
Co-authored-by: Pit Humke <[email protected]>
Co-authored-by: Timon G <[email protected]>
Co-authored-by: MKusber <[email protected]>
Co-authored-by: Walter S <[email protected]>
Co-authored-by: Nils Knappmier (work account) <[email protected]>
Co-authored-by: Christian Kirschnick <[email protected]>
Co-authored-by: Michael Burwig <[email protected]>
Co-authored-by: Paul-Christian Volkmer <[email protected]>
Co-authored-by: Roman Neß <[email protected]>
Co-authored-by: Matthias Wessendorf <[email protected]>
Co-authored-by: Stefan Weil <[email protected]>
Co-authored-by: Björn Wilmsmann <[email protected]>
Co-authored-by: Roberto Duessmann <[email protected]>
Co-authored-by: Michael <[email protected]>
Co-authored-by: Kevin Pontes <[email protected]>
Co-authored-by: Hilmar Falkenberg <[email protected]>
Co-authored-by: KevponSAP <[email protected]>
Co-authored-by: Hilmar Falkenberg <[email protected]>
Co-authored-by: Christopher Schmitz <[email protected]>
Co-authored-by: dabelenda <[email protected]>
Co-authored-by: Sorin Stefan Iovita <[email protected]>
Co-authored-by: Michael Risthaus <[email protected]>
Co-authored-by: Driptaroop Das <[email protected]>
Co-authored-by: ioangut <[email protected]>
Co-authored-by: EugenM-SAP <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Review timeout thresholds
3 participants