-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover from SSL_connect error in UpdateCachedAppealsAttributesJob #15416
Comments
Standard investigation - fix if reasonable otherwise cut off new ticket. Putting at a 3. |
Example of BGS retry implementation: https://github.com/department-of-veterans-affairs/caseflow/pull/15316/files |
What is the value of
calls caseflow/app/models/legacy_appeal_representative.rb Lines 51 to 56 in c1f1954
Answer is |
…15482) Resolves #15416 ### Description If a BGS connection error occurs when getting the POA representative, log a warning and allow UpdateCachedAppealsAttributesJob to complete without failure and log the warnings to Slack. The UpdateCachedAppealsAttributesJob runs many times per day. The warning logs should provide more info to determine if we want to automatically retry for certain situations. There may be other errors caused by BGS, but we'll just rescue from `Errno::ECONNRESET` for now. ### Acceptance Criteria - [ ] Recover from the `SSL_connect` error - [ ] Code compiles correctly and tests pass ### Testing Plan It's hard to test unless we know which appeal causes the BGS connection error. In prod, monkey-patch class with code changes, skipping `cache_ama_appeals` and limiting to first few legacy appeals: ```ruby TEST_LIMIT=1500 class UpdateCachedAppealsAttributesJob def perform RequestStore.store[:current_user] = User.system_user ama_appeals_start = Time.zone.now ### Ignore this for testing: cache_ama_appeals datadog_report_time_segment(segment: "cache_ama_appeals", start_time: ama_appeals_start) legacy_appeals_start = Time.zone.now cache_legacy_appeals datadog_report_time_segment(segment: "cache_legacy_appeals", start_time: legacy_appeals_start) datadog_report_runtime(metric_group_name: METRIC_GROUP_NAME) rescue StandardError => error log_error(@start_time, error) else puts "warnings: #{warning_msgs.count}" log_warning unless warning_msgs.empty? end ### overridding to limit number of appeals def cache_legacy_appeals # Avoid lazy evaluation bugs by immediately plucking all VACOLS IDs. Lazy evaluation of the LegacyAppeal.find(...) # was previously causing this code to insert legacy appeal attributes that corresponded to NULL ID fields. puts "cache_legacy_appeals" legacy_appeals = LegacyAppeal.includes(:available_hearing_locations) .where(id: open_appeals_from_tasks(LegacyAppeal.name)).limit(10) ### limit to a few all_vacols_ids = legacy_appeals.pluck(:vacols_id).flatten puts "vacols_ids count: #{all_vacols_ids.count}" cache_postgres_data_start = Time.zone.now cache_legacy_appeal_postgres_data(legacy_appeals) datadog_report_time_segment(segment: "cache_legacy_appeal_postgres_data", start_time: cache_postgres_data_start) ### cache_vacols_data_start = Time.zone.now ### cache_legacy_appeal_vacols_data(all_vacols_ids) ### datadog_report_time_segment(segment: "cache_legacy_appeal_vacols_data", start_time: cache_vacols_data_start) end def cache_legacy_appeal_postgres_data(legacy_appeals) # this transaction times out so let's try to do this in batches ### limit to a subset of appeals; may need to increase in order to get to an appeal that causes a BGS error legacy_appeals.first(TEST_LIMIT).in_groups_of(POSTGRES_BATCH_SIZE, false) do |batch_legacy_appeals| values_to_cache = batch_legacy_appeals.map do |appeal| puts "batch_legacy_appeals: #{appeal.id}" regional_office = RegionalOffice::CITIES[appeal.closest_regional_office] { vacols_id: appeal.vacols_id, appeal_id: appeal.id, appeal_type: LegacyAppeal.name, closest_regional_office_city: regional_office ? regional_office[:city] : COPY::UNKNOWN_REGIONAL_OFFICE, closest_regional_office_key: regional_office ? appeal.closest_regional_office : COPY::UNKNOWN_REGIONAL_OFFICE, docket_type: appeal.docket_name, # "legacy" power_of_attorney_name: poa_representative_name_for(appeal), suggested_hearing_location: appeal.suggested_hearing_location&.formatted_location } end puts "import values_to_cache: #{values_to_cache.count}" CachedAppeal.import values_to_cache, on_duplicate_key_update: { conflict_target: [:appeal_id, :appeal_type], columns: [ :closest_regional_office_city, :closest_regional_office_key, :vacols_id, :docket_type, :power_of_attorney_name, :suggested_hearing_location ] } increment_appeal_count(batch_legacy_appeals.length, LegacyAppeal.name) end end # bypass PowerOfAttorney model completely and always prefer BGS cache def poa_representative_name_for(appeal) bgs_poa = fetch_bgs_power_of_attorney_by_file_number(appeal.veteran_file_number) # both representative_name calls can result in BGS connection error bgs_poa&.representative_name || appeal.representative_name rescue Errno::ECONNRESET => error puts "RESCUED error for: #{appeal.id}" warning_msgs << "#{appeal.class.name} #{appeal.id}: #{error}" if warning_msgs.count < 100 nil end def warning_msgs @warning_msgs ||= [] end def log_warning slack_msg = "[WARN] UpdateCachedAppealsAttributesJob first 100 warnings: \n#{warning_msgs.join("\n")}" slack_service.send_notification(slack_msg) end end # Then run it. Keep an eye on #appeals-job-alerts Slack channel UpdateCachedAppealsAttributesJob.perform_now ``` You may see a bunch of `ROLLBACK`s due to BGS like the following, but the job should keep running. ```ruby [2020-10-22 12:13:57 -0400] BgsPowerOfAttorney Load (1.6ms) SELECT "bgs_power_of_attorneys".* FROM "bgs_power_of_attorneys" WHERE "bgs_power_of_attorneys"."file_number" = $1 LIMIT $2 ... [2020-10-22 12:13:57 -0400] (1.0ms) BEGIN [2020-10-22 12:13:57 -0400] (1.1ms) ROLLBACK ``` and ``` [2020-10-22 12:16:28 -0400] FINISHED BGS: fetch poa for file number: ... [2020-10-22 12:16:28 -0400] (2.6ms) ROLLBACK ```
Description
We are occasionally seeing errors while the
UpdateCachedAppealsAttributeJob
is running that is causing it to exit before it finishes. This has been becoming increasingly more consistent.This ticket is to resolve one of the more recent issues, which happens when the
UpdateCachedAppealsAttribute
job tries to fetch the power of attorney from BGS, and can't because of anSSL_connect
error from BGS. Judging by the error message, this seems to be a transient error, and Caseflow could probably recover from it more gracefully. This also could be an issue where we are overloading the BGS servers, and might need to think about ways to reduce the number of calls we make to BGS.Acceptance criteria
SSL_connect
errorBackground/context/resources
Full error stack trace:
Slack Thread: https://dsva.slack.com/archives/CN3FQR4A1/p1602592080030200
Sentry Alert: https://sentry.prod.appeals.va.gov/department-of-veterans-affairs/caseflow/issues/11179/events/891890/?environment=production
Technical notes
The text was updated successfully, but these errors were encountered: