Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCD Traffic concurrent prober check fails on cloud deployment #1002

Closed
barroco opened this issue Feb 15, 2024 · 6 comments
Closed

SCD Traffic concurrent prober check fails on cloud deployment #1002

barroco opened this issue Feb 15, 2024 · 6 comments
Labels
bug Software behaves incorrectly because of this issue P0 Highest priority; blocking usage or development

Comments

@barroco
Copy link
Contributor

barroco commented Feb 15, 2024

Describe the bug
The prober fails when testing a cloud deployment. Failure in appended at the end of this report.

Problem has been identified when deploying the DSS on GCP using helm.

To Reproduce

  • Deploy a Kubernetes cluster on GCP using terraform
  • Deploy the DSS using Helm
  • Run the oauth dummy server locally
  • Run the prober using interuss/monitoring:v0.3.0

Note that the test has been conducted from a local machine (based in Switzerland) and the cluster is deployed in US east region.

Stacktrace

============================= test session starts ==============================
platform linux -- Python 3.11.7, pytest-6.2.4, py-1.11.0, pluggy-0.13.1
rootdir: /app/monitoring/prober
plugins: Faker-8.1.0, mock-3.6.1
collected 315 items

aux_/test_token_validation.py ...                                        [  0%]
aux_/test_version.py .                                                   [  1%]
monitorlib/test_geo.py .                                                 [  1%]
rid/v1/test_isa_expiry.py .......                                        [  3%]
rid/v1/test_isa_simple.py ...................                            [  9%]
rid/v1/test_isa_simple_heavy_traffic_concurrent.py .....                 [ 11%]
rid/v1/test_isa_validation.py .........                                  [ 14%]
rid/v1/test_subscription_isa_interactions.py ......                      [ 16%]
rid/v1/test_subscription_isa_slightly_overlapping.py .....               [ 17%]
rid/v1/test_subscription_simple.py ............                          [ 21%]
rid/v1/test_subscription_validation.py ........                          [ 24%]
rid/v1/test_token_validation.py ......                                   [ 26%]
rid/v2/test_isa_expiry.py .......                                        [ 28%]
rid/v2/test_isa_simple.py ...................                            [ 34%]
rid/v2/test_isa_validation.py .........                                  [ 37%]
rid/v2/test_subscription_isa_interactions.py ......                      [ 39%]
rid/v2/test_subscription_simple.py ............                          [ 42%]
rid/v2/test_subscription_validation.py ........                          [ 45%]
rid/v2/test_token_validation.py ......                                   [ 47%]
scd/test_constraint_simple.py ...................                        [ 53%]
scd/test_constraints_with_subscriptions.py ...........                   [ 56%]
scd/test_operation_references_error_cases.py .......................     [ 64%]
scd/test_operation_references_state_transition.py ......                 [ 66%]
scd/test_operation_simple.py .....................                       [ 72%]
scd/test_operation_simple_heavy_traffic.py ...............               [ 77%]
scd/test_operation_simple_heavy_traffic_concurrent.py ....Fs.            [ 79%]
scd/test_operation_special_cases.py .......                              [ 81%]
scd/test_operations_simple.py ......................                     [ 88%]
scd/test_subscription_queries.py ..........                              [ 92%]
scd/test_subscription_query_time.py .                                    [ 92%]
scd/test_subscription_simple.py ...........                              [ 95%]
scd/test_subscription_update_validation.py ...........                   [ 99%]
scd/test_uss_availability.py ..                                          [100%]

=================================== FAILURES ===================================
__________________________ test_mutate_ops_concurrent __________________________

ids = <function ids.<locals>.<lambda> at 0x2aaaeb2642c0>, scd_api = '1.0.0'
scd_session = <monitoring.monitorlib.infrastructure.UTMClientSession object at 0x2aaaeaf11610>
scd_session_async = <monitoring.monitorlib.infrastructure.AsyncUTMTestSession object at 0x2aaaeb219910>

    @for_api_versions(scd.API_0_3_17)
    @default_scope(SCOPE_SC)
    @depends_on(test_create_ops_concurrent)
    def test_mutate_ops_concurrent(ids, scd_api, scd_session, scd_session_async):
        start_time = datetime.datetime.utcnow()
        op_req_map = {}
        op_resp_map = {}
        op_map = {}

        # Build mutate requests
        for idx, op_id in enumerate(map(ids, OP_TYPES)):
            op_req_map[op_id] = _build_mutate_request(
                idx, op_id, op_map, scd_session, scd_api
            )
        assert len(op_req_map) == len(OP_TYPES)

        # Mutate operations in parallel
        loop = asyncio.get_event_loop()
>       results = loop.run_until_complete(
            asyncio.gather(
                *[
                    _put_operation_async(req, op_id, scd_session_async, scd_api, False)
                    for op_id, req in op_req_map.items()
                ]
            )
        )

scd/test_operation_simple_heavy_traffic_concurrent.py:419:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.11/asyncio/base_events.py:653: in run_until_complete
    return future.result()
scd/test_operation_simple_heavy_traffic_concurrent.py:158: in _put_operation_async
    result = await scd_session_async.put(req_url, data=req), req_url, req
../monitorlib/infrastructure.py:199: in put
    async with self._client.put(url, **kwargs) as response:
/usr/local/lib/python3.11/site-packages/aiohttp/client.py:1187: in __aenter__
    self._resp = await self._coro
/usr/local/lib/python3.11/site-packages/aiohttp/client.py:601: in _request
    await resp.start(conn)
/usr/local/lib/python3.11/site-packages/aiohttp/client_reqrep.py:960: in start
    with self._timer:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <aiohttp.helpers.TimerContext object at 0x2aaaeb21b010>
exc_type = <class 'asyncio.exceptions.CancelledError'>
exc_val = CancelledError(), exc_tb = <traceback object at 0x2aaaeaf3d200>

    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[TracebackType],
    ) -> Optional[bool]:
        if self._tasks:
            self._tasks.pop()

        if exc_type is asyncio.CancelledError and self._cancelled:
>           raise asyncio.TimeoutError from None
E           TimeoutError

/usr/local/lib/python3.11/site-packages/aiohttp/helpers.py:735: TimeoutError
----------------------------- Captured stdout call -----------------------------
0 (2d): (-56.0, 178.0) 0.0-120.0 2024-02-14T17:38:42.682161Z-2024-02-14T18:38:42.682161Z
1 (2d): (-56.1, 178.0) 0.0-120.0 2024-02-14T17:38:42.817100Z-2024-02-14T18:38:42.817100Z
2 (2d): (-56.2, 178.0) 0.0-120.0 2024-02-14T17:38:42.943341Z-2024-02-14T18:38:42.943341Z
3 (2d): (-56.3, 178.0) 0.0-120.0 2024-02-14T17:38:43.076327Z-2024-02-14T18:38:43.076327Z
4 (2d): (-56.4, 178.0) 0.0-120.0 2024-02-14T17:38:43.209786Z-2024-02-14T18:38:43.209786Z
5 (2d): (-56.5, 178.0) 0.0-120.0 2024-02-14T17:38:43.345533Z-2024-02-14T18:38:43.345533Z
6 (2d): (-56.6, 178.0) 0.0-120.0 2024-02-14T17:38:43.493194Z-2024-02-14T18:38:43.493194Z
7 (altitude): (-56.0, 178.0) 140.0-159.0 2024-02-14T17:38:43.637506Z-2024-02-14T18:38:43.637506Z
8 (altitude): (-56.0, 178.0) 160.0-179.0 2024-02-14T17:38:43.782837Z-2024-02-14T18:38:43.782837Z
9 (altitude): (-56.0, 178.0) 180.0-199.0 2024-02-14T17:38:43.934621Z-2024-02-14T18:38:43.934621Z
10 (altitude): (-56.0, 178.0) 200.0-219.0 2024-02-14T17:38:44.079218Z-2024-02-14T18:38:44.079218Z
11 (altitude): (-56.0, 178.0) 220.0-239.0 2024-02-14T17:38:44.229315Z-2024-02-14T18:38:44.229315Z
12 (altitude): (-56.0, 178.0) 240.0-259.0 2024-02-14T17:38:44.384903Z-2024-02-14T18:38:44.384903Z
13 (altitude): (-56.0, 178.0) 260.0-279.0 2024-02-14T17:38:44.540394Z-2024-02-14T18:38:44.540394Z
14 (time): (-56.0, 178.0) 0.0-120.0 2024-02-14T22:18:44.698535Z-2024-02-14T22:27:44.698535Z
15 (time): (-56.0, 178.0) 0.0-120.0 2024-02-14T22:38:44.845532Z-2024-02-14T22:47:44.845532Z
16 (time): (-56.0, 178.0) 0.0-120.0 2024-02-14T22:58:44.985477Z-2024-02-14T23:07:44.985477Z
17 (time): (-56.0, 178.0) 0.0-120.0 2024-02-14T23:18:45.143788Z-2024-02-14T23:27:45.143788Z
18 (time): (-56.0, 178.0) 0.0-120.0 2024-02-14T23:38:45.300040Z-2024-02-14T23:47:45.300040Z
19 (time): (-56.0, 178.0) 0.0-120.0 2024-02-14T23:58:45.456198Z-2024-02-15T00:07:45.456198Z
=============================== warnings summary ===============================
../../../usr/local/lib/python3.11/site-packages/pvlib/tools.py:7
  /usr/local/lib/python3.11/site-packages/pvlib/tools.py:7: DeprecationWarning:
  Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
  (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
  but was not found to be installed on your system.
  If this would cause problems for you,
  please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

    import pandas as pd

-- Docs: https://docs.pytest.org/en/stable/warnings.html
--------------------- generated xml file: /app/test_result ---------------------
=========================== short test summary info ============================
SKIPPED [1] infrastructure.py:46: Prerequisite task did not pass
======= 1 failed, 313 passed, 1 skipped, 1 warning in 319.78s (0:05:19) ========
+ '[' '' == true ']'
+ echo 'Prober did not succeed.'
Prober did not succeed.
+ exit 1
@barroco
Copy link
Contributor Author

barroco commented Feb 15, 2024

The check fails for the following dss image deployments (Note that the cockroach db has not been reset between runs):

Additional checks:

  • CockroachDB reports 10 sec latency on some queries (more investigation required)
  • Time is synchronized between crdb containers, dss-core-services and local machine running the prober.

@BenjaminPelletier BenjaminPelletier added P0 Highest priority; blocking usage or development bug Software behaves incorrectly because of this issue labels Feb 15, 2024
@barroco
Copy link
Contributor Author

barroco commented Feb 16, 2024

  • Upgrade of cockroachdb to version v22.2.18 do not show improvements.
  • AWS deployment shows similar results.
  • CRDB transactions analysis: Multiple FULL SCAN in query plans of long lasting transactions.

@barroco
Copy link
Contributor Author

barroco commented Feb 22, 2024

Under concurrent load, the transactions are invalidated late leading to multiple retry as seen on the following diagram. The invalidation usually occurs when the notification index is incremented.

Screenshot 2024-02-22 at 17 10 58

@barroco
Copy link
Contributor Author

barroco commented Feb 23, 2024

Successful prober tests have been run following the merge of #1004 on a similar configuration.

Result:

============================= test session starts ==============================
platform linux -- Python 3.11.7, pytest-6.2.4, py-1.11.0, pluggy-0.13.1
rootdir: /app/monitoring/prober
plugins: Faker-8.1.0, mock-3.6.1
collected 315 items

aux_/test_token_validation.py ...                                        [  0%]
aux_/test_version.py .                                                   [  1%]
monitorlib/test_geo.py .                                                 [  1%]
rid/v1/test_isa_expiry.py .......                                        [  3%]
rid/v1/test_isa_simple.py ...................                            [  9%]
rid/v1/test_isa_simple_heavy_traffic_concurrent.py .....                 [ 11%]
rid/v1/test_isa_validation.py .........                                  [ 14%]
rid/v1/test_subscription_isa_interactions.py ......                      [ 16%]
rid/v1/test_subscription_isa_slightly_overlapping.py .....               [ 17%]
rid/v1/test_subscription_simple.py ............                          [ 21%]
rid/v1/test_subscription_validation.py ........                          [ 24%]
rid/v1/test_token_validation.py ......                                   [ 26%]
rid/v2/test_isa_expiry.py .......                                        [ 28%]
rid/v2/test_isa_simple.py ...................                            [ 34%]
rid/v2/test_isa_validation.py .........                                  [ 37%]
rid/v2/test_subscription_isa_interactions.py ......                      [ 39%]
rid/v2/test_subscription_simple.py ............                          [ 42%]
rid/v2/test_subscription_validation.py ........                          [ 45%]
rid/v2/test_token_validation.py ......                                   [ 47%]
scd/test_constraint_simple.py ...................                        [ 53%]
scd/test_constraints_with_subscriptions.py ...........                   [ 56%]
scd/test_operation_references_error_cases.py .......................     [ 64%]
scd/test_operation_references_state_transition.py ......                 [ 66%]
scd/test_operation_simple.py .....................                       [ 72%]
scd/test_operation_simple_heavy_traffic.py ...............               [ 77%]
scd/test_operation_simple_heavy_traffic_concurrent.py .......            [ 79%]
scd/test_operation_special_cases.py .......                              [ 81%]
scd/test_operations_simple.py ......................                     [ 88%]
scd/test_subscription_queries.py ..........                              [ 92%]
scd/test_subscription_query_time.py .                                    [ 92%]
scd/test_subscription_simple.py ...........                              [ 95%]
scd/test_subscription_update_validation.py ...........                   [ 99%]
scd/test_uss_availability.py ..                                          [100%]

=============================== warnings summary ===============================
../../../usr/local/lib/python3.11/site-packages/pvlib/tools.py:7
  /usr/local/lib/python3.11/site-packages/pvlib/tools.py:7: DeprecationWarning:
  Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
  (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
  but was not found to be installed on your system.
  If this would cause problems for you,
  please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

    import pandas as pd

-- Docs: https://docs.pytest.org/en/stable/warnings.html
--------------------- generated xml file: /app/test_result ---------------------
================== 315 passed, 1 warning in 298.65s (0:04:58) ==================
+ echo 'Prober succeeded.'
Prober succeeded.

3 runs have been performed with similar results.

@barroco barroco closed this as completed Feb 23, 2024
@barroco
Copy link
Contributor Author

barroco commented Feb 23, 2024

Related to #742 and #814.

@Shastick
Copy link
Contributor

Shastick commented Sep 12, 2024

@barroco, little non-urgent question about the early locking added in #1004:

is there a particular reason we do not filter out subscriptions that aren't within the time-range of interest? We seem to look for all subscriptions that cover a cell, regardless if this subscription's time boundaries overlaps with the OIR being created or mutated.

As a consequence we might be locking more subscriptions than necessary: if filtering out unwanted subscriptions can be done efficiently (I assume we have the correct indices for this) we may be able to reduce contention even further?

Another observation: when we do r.LockSubscriptionsOnCells(ctx, validParams.cells) in upsertOperationalIntentReference we're only locking cells belonging to the passed parameters, but not the cells belonging to the previous version of the OIR (when we are doing an update). That is probably fine, but there might be opportunities to improve this even further.

(I ran into this while looking at #814)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software behaves incorrectly because of this issue P0 Highest priority; blocking usage or development
Projects
None yet
Development

No branches or pull requests

3 participants