Recover from SSL_connect error in UpdateCachedAppealsAttributesJob #15416

ferristseng · 2020-10-13T15:48:12Z

Description

We are occasionally seeing errors while the UpdateCachedAppealsAttributeJob is running that is causing it to exit before it finishes. This has been becoming increasingly more consistent.

This ticket is to resolve one of the more recent issues, which happens when the UpdateCachedAppealsAttribute job tries to fetch the power of attorney from BGS, and can't because of an SSL_connect error from BGS. Judging by the error message, this seems to be a transient error, and Caseflow could probably recover from it more gracefully. This also could be an issue where we are overloading the BGS servers, and might need to think about ways to reduce the number of calls we make to BGS.

Acceptance criteria

Recover from the SSL_connect error

Background/context/resources

Full error stack trace:

Errno::ECONNRESET: Connection reset by peer - SSL_connect
  from httpclient/ssl_socket.rb:103:in `connect'
  from httpclient/ssl_socket.rb:103:in `ssl_connect'
  from httpclient/ssl_socket.rb:41:in `initialize'
  from httpclient/ssl_socket.rb:26:in `new'
  from httpclient/ssl_socket.rb:26:in `create_socket'
  from httpclient/session.rb:752:in `block in connect'
  from timeout.rb:93:in `block in timeout'
  from timeout.rb:103:in `timeout'
  from httpclient/session.rb:748:in `connect'
  from httpclient/session.rb:511:in `query'
  from httpclient/session.rb:177:in `query'
  from httpclient.rb:1242:in `do_get_block'
  from new_relic/agent/instrumentation/httpclient.rb:41:in `do_get_block_with_newrelic'
  from httpclient.rb:1019:in `block in do_request'
  from httpclient.rb:1133:in `protect_keep_alive_disconnected'
  from httpclient.rb:1014:in `do_request'
  from httpclient.rb:856:in `request'
  from httpi/adapter/httpclient.rb:26:in `request'
  from httpi.rb:161:in `request'
  from httpi.rb:133:in `post'
  from savon/operation.rb:94:in `block in call_with_logging'
  from savon/request_logger.rb:12:in `log'
  from savon/operation.rb:94:in `call_with_logging'
  from savon/operation.rb:54:in `call'
  from savon/client.rb:36:in `call'
  from bgs/base.rb:152:in `request'
  from bgs/services/person.rb:19:in `find_by_ssn'
  from app/services/external_api/bgs_service.rb:124:in `block in fetch_person_by_ssn'
  from app/services/metrics_service.rb:14:in `block in record'
  from benchmark.rb:293:in `measure'
  from app/services/metrics_service.rb:13:in `record'
  from app/services/external_api/bgs_service.rb:121:in `fetch_person_by_ssn'
  from app/models/person.rb:141:in `fetch_bgs_record_by_ssn'
  from app/models/person.rb:124:in `fetch_bgs_record'
  from app/models/concerns/associated_bgs_record.rb:68:in `try_and_retry_bgs_record'
  from app/models/concerns/associated_bgs_record.rb:62:in `bgs_record'
  from app/models/concerns/associated_bgs_record.rb:104:in `not_found?'
  from app/models/person.rb:113:in `found?'
  from app/models/person.rb:38:in `find_or_create_by_ssn'
  from app/models/legacy_appeal.rb:315:in `person_for_appellant'
  from app/models/legacy_appeal.rb:1022:in `claimant_participant_id'
  from app/models/legacy_appeal.rb:1016:in `new_power_of_attorney'
  from app/models/legacy_appeal.rb:363:in `power_of_attorney'
  from app/models/legacy_appeal.rb:355:in `legacy_appeal_representative'
  from app/models/legacy_appeal.rb:340:in `representative_name'
  from app/jobs/update_cached_appeals_attributes_job.rb:114:in `block (2 levels) in cache_legacy_appeal_postgres_data'
  from app/jobs/update_cached_appeals_attributes_job.rb:103:in `map'
  from app/jobs/update_cached_appeals_attributes_job.rb:103:in `block in cache_legacy_appeal_postgres_data'
  from active_support/core_ext/array/grouping.rb:39:in `block in in_groups_of'
  from active_support/core_ext/array/grouping.rb:39:in `each'
  from active_support/core_ext/array/grouping.rb:39:in `each_slice'
  from active_support/core_ext/array/grouping.rb:39:in `in_groups_of'
  from active_record/relation/delegation.rb:71:in `in_groups_of'
  from app/jobs/update_cached_appeals_attributes_job.rb:102:in `cache_legacy_appeal_postgres_data'
  from app/jobs/update_cached_appeals_attributes_job.rb:91:in `cache_legacy_appeals'
  from app/jobs/update_cached_appeals_attributes_job.rb:23:in `perform'
  from active_job/execution.rb:39:in `block in perform_now'
  from active_support/callbacks.rb:109:in `block in run_callbacks'
  from raven/integrations/rails/active_job.rb:18:in `capture_and_reraise_with_sentry'
  from raven/integrations/rails/active_job.rb:12:in `block (2 levels) in included'
  from active_support/callbacks.rb:118:in `instance_exec'
  from active_support/callbacks.rb:118:in `block in run_callbacks'
  from new_relic/agent/tracer.rb:72:in `in_transaction'
  from new_relic/agent/instrumentation/active_job.rb:66:in `run_in_transaction'
  from new_relic/agent/instrumentation/active_job.rb:55:in `perform'
  from new_relic/agent/instrumentation/active_job.rb:23:in `block (4 levels) in <top (required)>'
  from active_support/callbacks.rb:118:in `instance_exec'
  from active_support/callbacks.rb:118:in `block in run_callbacks'
  from i18n.rb:313:in `with_locale'
  from active_job/translation.rb:9:in `block (2 levels) in <module:Translation>'
  from active_support/callbacks.rb:118:in `instance_exec'
  from active_support/callbacks.rb:118:in `block in run_callbacks'
  from active_job/logging.rb:26:in `block (4 levels) in <module:Logging>'
  from active_support/notifications.rb:168:in `block in instrument'
  from active_support/notifications/instrumenter.rb:23:in `instrument'
  from active_support/notifications.rb:168:in `instrument'
  from active_job/logging.rb:25:in `block (3 levels) in <module:Logging>'
  from active_job/logging.rb:46:in `block in tag_logger'
  from active_support/tagged_logging.rb:71:in `block in tagged'
  from active_support/tagged_logging.rb:28:in `tagged'
  from active_support/tagged_logging.rb:71:in `tagged'
  from active_job/logging.rb:46:in `tag_logger'
  from active_job/logging.rb:22:in `block (2 levels) in <module:Logging>'
  from active_support/callbacks.rb:118:in `instance_exec'
  from active_support/callbacks.rb:118:in `block in run_callbacks'
  from active_support/callbacks.rb:136:in `run_callbacks'
  from active_job/execution.rb:38:in `perform_now'
  from active_job/execution.rb:24:in `block in execute'
  from active_support/callbacks.rb:109:in `block in run_callbacks'
  from active_job/railtie.rb:28:in `block (4 levels) in <class:Railtie>'
  from active_support/execution_wrapper.rb:87:in `wrap'
  from active_support/reloader.rb:73:in `block in wrap'
  from active_support/execution_wrapper.rb:87:in `wrap'
  from active_support/reloader.rb:72:in `wrap'
  from active_job/railtie.rb:27:in `block (3 levels) in <class:Railtie>'
  from active_support/callbacks.rb:118:in `instance_exec'
  from active_support/callbacks.rb:118:in `block in run_callbacks'
  from active_support/callbacks.rb:136:in `run_callbacks'
  from active_job/execution.rb:22:in `execute'
  from shoryuken/extensions/active_job_adapter.rb:78:in `perform'
  from shoryuken/processor.rb:20:in `block in process'
  from shoryuken/middleware/chain.rb:98:in `block in invoke'
  from app/jobs/middleware/job_request_store_middleware.rb:8:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from app/jobs/middleware/job_monitoring_middleware.rb:9:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from shoryuken/middleware/server/active_record.rb:6:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from shoryuken/middleware/server/auto_extend_visibility.rb:10:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from shoryuken/middleware/server/auto_delete.rb:6:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from shoryuken/middleware/server/exponential_backoff_retry.rb:16:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from shoryuken/middleware/server/timing.rb:14:in `block in call'
  from shoryuken/logging.rb:20:in `with_context'
  from shoryuken/middleware/server/timing.rb:8:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from shoryuken/middleware/chain.rb:103:in `invoke'
  from shoryuken/processor.rb:19:in `process'
  from shoryuken/processor.rb:8:in `process'
  from shoryuken/manager.rb:73:in `block in assign'
  from concurrent/executor/safe_task_executor.rb:24:in `block in execute'
  from concurrent/synchronization/mutex_lockable_object.rb:41:in `block in synchronize'
  from concurrent/synchronization/mutex_lockable_object.rb:41:in `synchronize'
  from concurrent/synchronization/mutex_lockable_object.rb:41:in `synchronize'
  from concurrent/executor/safe_task_executor.rb:19:in `execute'
  from concurrent/promise.rb:563:in `block in realize'
  from concurrent/executor/ruby_thread_pool_executor.rb:363:in `run_task'
  from concurrent/executor/ruby_thread_pool_executor.rb:352:in `block (3 levels) in create_worker'
  from concurrent/executor/ruby_thread_pool_executor.rb:335:in `loop'
  from concurrent/executor/ruby_thread_pool_executor.rb:335:in `block (2 levels) in create_worker'
  from concurrent/executor/ruby_thread_pool_executor.rb:334:in `catch'
  from concurrent/executor/ruby_thread_pool_executor.rb:334:in `block in create_worker'

Slack Thread: https://dsva.slack.com/archives/CN3FQR4A1/p1602592080030200
Sentry Alert: https://sentry.prod.appeals.va.gov/department-of-veterans-affairs/caseflow/issues/11179/events/891890/?environment=production

Technical notes

The text was updated successfully, but these errors were encountered:

lomky · 2020-10-15T17:20:25Z

Standard investigation - fix if reasonable otherwise cut off new ticket. Putting at a 3.

yoomlam · 2020-10-15T19:10:12Z

Example of BGS retry implementation: https://github.com/department-of-veterans-affairs/caseflow/pull/15316/files

yoomlam · 2020-10-23T16:16:13Z

What is the value of RequestStore.store[:application] when UpdateCachedAppealsAttributesJob is run? Why I ask:

caseflow/app/jobs/update_cached_appeals_attributes_job.rb

Line 114 in 0324c07

    
           power_of_attorney_name: (bgs_poa&.representative_name || appeal.representative_name),

calls

caseflow/app/models/legacy_appeal_representative.rb

Lines 51 to 56 in c1f1954

    
           def use_representative_info_from_bgs? 
        
             RequestStore.store[:application] == "queue" || 
        
               RequestStore.store[:application] == "hearings" || 
        
               RequestStore.store[:application] == "hearing_schedule_job" || 
        
               RequestStore.store[:application] == "idt" 
        
           end

Answer is queue_job if manually running in prod's Rails console using UpdateCachedAppealsAttributesJob.perform_now, which means appeal.representative_name is effectively appeal.vacols_representative_name rather than appeal.bgs_representative_name.

…15482) Resolves #15416 ### Description If a BGS connection error occurs when getting the POA representative, log a warning and allow UpdateCachedAppealsAttributesJob to complete without failure and log the warnings to Slack. The UpdateCachedAppealsAttributesJob runs many times per day. The warning logs should provide more info to determine if we want to automatically retry for certain situations. There may be other errors caused by BGS, but we'll just rescue from `Errno::ECONNRESET` for now. ### Acceptance Criteria - [ ] Recover from the `SSL_connect` error - [ ] Code compiles correctly and tests pass ### Testing Plan It's hard to test unless we know which appeal causes the BGS connection error. In prod, monkey-patch class with code changes, skipping `cache_ama_appeals` and limiting to first few legacy appeals: ```ruby TEST_LIMIT=1500 class UpdateCachedAppealsAttributesJob def perform RequestStore.store[:current_user] = User.system_user ama_appeals_start = Time.zone.now ### Ignore this for testing: cache_ama_appeals datadog_report_time_segment(segment: "cache_ama_appeals", start_time: ama_appeals_start) legacy_appeals_start = Time.zone.now cache_legacy_appeals datadog_report_time_segment(segment: "cache_legacy_appeals", start_time: legacy_appeals_start) datadog_report_runtime(metric_group_name: METRIC_GROUP_NAME) rescue StandardError => error log_error(@start_time, error) else puts "warnings: #{warning_msgs.count}" log_warning unless warning_msgs.empty? end ### overridding to limit number of appeals def cache_legacy_appeals # Avoid lazy evaluation bugs by immediately plucking all VACOLS IDs. Lazy evaluation of the LegacyAppeal.find(...) # was previously causing this code to insert legacy appeal attributes that corresponded to NULL ID fields. puts "cache_legacy_appeals" legacy_appeals = LegacyAppeal.includes(:available_hearing_locations) .where(id: open_appeals_from_tasks(LegacyAppeal.name)).limit(10) ### limit to a few all_vacols_ids = legacy_appeals.pluck(:vacols_id).flatten puts "vacols_ids count: #{all_vacols_ids.count}" cache_postgres_data_start = Time.zone.now cache_legacy_appeal_postgres_data(legacy_appeals) datadog_report_time_segment(segment: "cache_legacy_appeal_postgres_data", start_time: cache_postgres_data_start) ### cache_vacols_data_start = Time.zone.now ### cache_legacy_appeal_vacols_data(all_vacols_ids) ### datadog_report_time_segment(segment: "cache_legacy_appeal_vacols_data", start_time: cache_vacols_data_start) end def cache_legacy_appeal_postgres_data(legacy_appeals) # this transaction times out so let's try to do this in batches ### limit to a subset of appeals; may need to increase in order to get to an appeal that causes a BGS error legacy_appeals.first(TEST_LIMIT).in_groups_of(POSTGRES_BATCH_SIZE, false) do |batch_legacy_appeals| values_to_cache = batch_legacy_appeals.map do |appeal| puts "batch_legacy_appeals: #{appeal.id}" regional_office = RegionalOffice::CITIES[appeal.closest_regional_office] { vacols_id: appeal.vacols_id, appeal_id: appeal.id, appeal_type: LegacyAppeal.name, closest_regional_office_city: regional_office ? regional_office[:city] : COPY::UNKNOWN_REGIONAL_OFFICE, closest_regional_office_key: regional_office ? appeal.closest_regional_office : COPY::UNKNOWN_REGIONAL_OFFICE, docket_type: appeal.docket_name, # "legacy" power_of_attorney_name: poa_representative_name_for(appeal), suggested_hearing_location: appeal.suggested_hearing_location&.formatted_location } end puts "import values_to_cache: #{values_to_cache.count}" CachedAppeal.import values_to_cache, on_duplicate_key_update: { conflict_target: [:appeal_id, :appeal_type], columns: [ :closest_regional_office_city, :closest_regional_office_key, :vacols_id, :docket_type, :power_of_attorney_name, :suggested_hearing_location ] } increment_appeal_count(batch_legacy_appeals.length, LegacyAppeal.name) end end # bypass PowerOfAttorney model completely and always prefer BGS cache def poa_representative_name_for(appeal) bgs_poa = fetch_bgs_power_of_attorney_by_file_number(appeal.veteran_file_number) # both representative_name calls can result in BGS connection error bgs_poa&.representative_name || appeal.representative_name rescue Errno::ECONNRESET => error puts "RESCUED error for: #{appeal.id}" warning_msgs << "#{appeal.class.name} #{appeal.id}: #{error}" if warning_msgs.count < 100 nil end def warning_msgs @warning_msgs ||= [] end def log_warning slack_msg = "[WARN] UpdateCachedAppealsAttributesJob first 100 warnings: \n#{warning_msgs.join("\n")}" slack_service.send_notification(slack_msg) end end # Then run it. Keep an eye on #appeals-job-alerts Slack channel UpdateCachedAppealsAttributesJob.perform_now ``` You may see a bunch of `ROLLBACK`s due to BGS like the following, but the job should keep running. ```ruby [2020-10-22 12:13:57 -0400] BgsPowerOfAttorney Load (1.6ms) SELECT "bgs_power_of_attorneys".* FROM "bgs_power_of_attorneys" WHERE "bgs_power_of_attorneys"."file_number" = $1 LIMIT $2 ... [2020-10-22 12:13:57 -0400] (1.0ms) BEGIN [2020-10-22 12:13:57 -0400] (1.1ms) ROLLBACK ``` and ``` [2020-10-22 12:16:28 -0400] FINISHED BGS: fetch poa for file number: ... [2020-10-22 12:16:28 -0400] (2.6ms) ROLLBACK ```

ferristseng added Product: caseflow-queue Type: Bug Integration: BGS BGS integration Team: Echo 🐬 Source: Sentry Alert created because of a Sentry alert Priority: Medium Blocking issue w/workaround, or "second in" priority for new work. labels Oct 13, 2020

lomky mentioned this issue Oct 16, 2020

Echo engineering team lines of effort 2020 #13175

Closed

lomky added this to the PI5-S6: XXX Sprint milestone Oct 19, 2020

yoomlam self-assigned this Oct 19, 2020

yoomlam mentioned this issue Oct 22, 2020

Allow UpdateCachedAppealsAttributesJob to complete and send warnings #15482

Merged

2 tasks

va-bot closed this as completed in #15482 Oct 26, 2020

yoomlam mentioned this issue Nov 6, 2020

Cache hearings related fields selectively #15577

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover from SSL_connect error in UpdateCachedAppealsAttributesJob #15416

Recover from SSL_connect error in UpdateCachedAppealsAttributesJob #15416

ferristseng commented Oct 13, 2020

lomky commented Oct 15, 2020

yoomlam commented Oct 15, 2020

yoomlam commented Oct 23, 2020 •

edited

Loading

Recover from SSL_connect error in UpdateCachedAppealsAttributesJob #15416

Recover from SSL_connect error in UpdateCachedAppealsAttributesJob #15416

Comments

ferristseng commented Oct 13, 2020

Description

Acceptance criteria

Background/context/resources

Technical notes

lomky commented Oct 15, 2020

yoomlam commented Oct 15, 2020

yoomlam commented Oct 23, 2020 • edited Loading

yoomlam commented Oct 23, 2020 •

edited

Loading