Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registry-Sweepers Error: contained no hits when hits were expected #69

Closed
sjoshi-jpl opened this issue Sep 3, 2023 · 6 comments
Closed
Assignees
Labels
B14.0 i&t.skip Skip I&T of this task/ticket s.high High severity task

Comments

@sjoshi-jpl
Copy link
Contributor

💡 Description

The following error has been occurring for ATM and GEO registry-sweeper tasks that has triggered multiple notifications during every Lambda run. Please take a look :

GEO-PROD

Error found in log group '/ecs/pds-geo-prod-registry-sweeper-task':

Timestamp (UTC): 2023-09-03 13:46:48.295000
Log Stream: ecs/pds-geo-prod-registry-sweeper-container/8f0a15e6c18d40b59214433c6abbe05f
Error Message: 2023-09-03 13:46:48,295::pds.registrysweepers.utils.db::ERROR::Response for query 346d70 contained no hits when hits were expected.  Returned data is incomplete.  Response was: {'_scroll_id': 'FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxZBSjMxUGlTSFNaS3c1WUd0b0pvZmtBAAAAAAAAPJgWekZYaWNTdFRRX3k3X3NFYWNSRGZlZxZibE9CX0F6SVRyR1R0RHlYSHBNb2F3AAAAAAAAOVEWdW5nTDhrYTJRSmVkdnQzWnAxaDNWZxZBSjMxUGlTSFNaS3c1WUd0b0pvZmtBAAAAAAAAPJcWekZYaWNTdFRRX3k3X3NFYWNSRGZlZw==', 'took': 2, 'timed_out': False, 'terminated_early': False, '_shards': {'total': 3, 'successful': 3, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3460390, 'relation': 'eq'}, 'max_score': 1.0, 'hits': []}}

ATM-PROD

Error found in log group '/ecs/pds-atm-prod-registry-sweeper-task':

Timestamp (UTC): 2023-09-03 13:29:26.877000
Log Stream: ecs/pds-atm-prod-registry-sweeper-container/5ab74f8d4b4d45a6b19cda6cfb953956
Error Message: 2023-09-03 13:29:26,877::pds.registrysweepers.utils.db::ERROR::Response for query 945e60 contained no hits when hits were expected.  Returned data is incomplete.  Response was: {'_scroll_id': 'FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxZ3UktlUnNsdVEzU3g2eGx6SVpYaFhRAAAAAAAAACMWdUltaGl1SG9UTU9rSzNWYmo3QTRpdxZ3UktlUnNsdVEzU3g2eGx6SVpYaFhRAAAAAAAAACQWdUltaGl1SG9UTU9rSzNWYmo3QTRpdxZWc180MnowYVFwR2hhb25hd0VncFpRAAAAAAAAAAwWejFZSERmV0xSN2kzclBtNjE1c0c1QQ==', 'took': 1, 'timed_out': False, 'terminated_early': False, '_shards': {'total': 3, 'successful': 2, 'skipped': 0, 'failed': 1, 'failures': [{'shard': -1, 'index': None, 'reason': {'type': 'illegal_state_exception', 'reason': 'node [z1YHDfWLR7i3rPm615sG5A] is not available'}}]}, 'hits': {'total': {'value': 649002, 'relation': 'eq'}, 'max_score': 1.0, 'hits': []}}
@sjoshi-jpl sjoshi-jpl added B14.0 i&t.skip Skip I&T of this task/ticket task labels Sep 3, 2023
@alexdunnjpl
Copy link
Contributor

Self-note - check whether or not scroll_id is being updated on every request, as it should be

@alexdunnjpl
Copy link
Contributor

ATM-PROD appears to be on the OpenSearch side - see failure

}
  'shard': -1,
  'index': None,
  'reason': {
    'type': 'illegal_state_exception',
    'reason': 'node [z1YHDfWLR7i3rPm615sG5A] is not available'
  }
}

@tloubrieu-jpl @sjoshi-jpl I'm not sure where to take this one, as it appears to be a failure of one of the shards associated with that node. Any ideas?

GEO-PROD is a bit weirder - it's not showing any shard failures, nor any errors. I'll introduce some improved logging, hopefully that yields new useful information.

@jordanpadams
Copy link
Member

@sjoshi-jpl to look at this closer to see if the shard failure is happening regularly.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Sep 12, 2023

Logging improvements introduced in #72 , #73
572ac41
81cb14a

Timeout thresholds have also been extended, though it won't be clear for a little while whether or not they're relevant to the timeouts in question.

@alexdunnjpl
Copy link
Contributor

@sjoshi-jpl @jordanpadams I'm fairly certain this is resolved by #77

Closing on that basis. @sjoshi-jpl would you please remove any exclusion rules in the error log escalation lambda? We can re-open if this issue reappears.

@sjoshi-jpl
Copy link
Contributor Author

@alexdunnjpl done. This issue statement has been removed from the sweepers lambda exceptions list and if they do occur again we should know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B14.0 i&t.skip Skip I&T of this task/ticket s.high High severity task
Projects
None yet
Development

No branches or pull requests

4 participants