Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from SSL_connect error in UpdateCachedAppealsAttributesJob #15416

Closed
1 task
ferristseng opened this issue Oct 13, 2020 · 3 comments · Fixed by #15482
Closed
1 task

Recover from SSL_connect error in UpdateCachedAppealsAttributesJob #15416

ferristseng opened this issue Oct 13, 2020 · 3 comments · Fixed by #15482
Assignees
Labels
Integration: BGS BGS integration Priority: Medium Blocking issue w/workaround, or "second in" priority for new work. Product: caseflow-queue Source: Sentry Alert created because of a Sentry alert Team: Echo 🐬 Type: Bug

Comments

@ferristseng
Copy link
Contributor

Description

We are occasionally seeing errors while the UpdateCachedAppealsAttributeJob is running that is causing it to exit before it finishes. This has been becoming increasingly more consistent.

This ticket is to resolve one of the more recent issues, which happens when the UpdateCachedAppealsAttribute job tries to fetch the power of attorney from BGS, and can't because of an SSL_connect error from BGS. Judging by the error message, this seems to be a transient error, and Caseflow could probably recover from it more gracefully. This also could be an issue where we are overloading the BGS servers, and might need to think about ways to reduce the number of calls we make to BGS.

Acceptance criteria

  • Recover from the SSL_connect error

Background/context/resources

Full error stack trace:

Errno::ECONNRESET: Connection reset by peer - SSL_connect
  from httpclient/ssl_socket.rb:103:in `connect'
  from httpclient/ssl_socket.rb:103:in `ssl_connect'
  from httpclient/ssl_socket.rb:41:in `initialize'
  from httpclient/ssl_socket.rb:26:in `new'
  from httpclient/ssl_socket.rb:26:in `create_socket'
  from httpclient/session.rb:752:in `block in connect'
  from timeout.rb:93:in `block in timeout'
  from timeout.rb:103:in `timeout'
  from httpclient/session.rb:748:in `connect'
  from httpclient/session.rb:511:in `query'
  from httpclient/session.rb:177:in `query'
  from httpclient.rb:1242:in `do_get_block'
  from new_relic/agent/instrumentation/httpclient.rb:41:in `do_get_block_with_newrelic'
  from httpclient.rb:1019:in `block in do_request'
  from httpclient.rb:1133:in `protect_keep_alive_disconnected'
  from httpclient.rb:1014:in `do_request'
  from httpclient.rb:856:in `request'
  from httpi/adapter/httpclient.rb:26:in `request'
  from httpi.rb:161:in `request'
  from httpi.rb:133:in `post'
  from savon/operation.rb:94:in `block in call_with_logging'
  from savon/request_logger.rb:12:in `log'
  from savon/operation.rb:94:in `call_with_logging'
  from savon/operation.rb:54:in `call'
  from savon/client.rb:36:in `call'
  from bgs/base.rb:152:in `request'
  from bgs/services/person.rb:19:in `find_by_ssn'
  from app/services/external_api/bgs_service.rb:124:in `block in fetch_person_by_ssn'
  from app/services/metrics_service.rb:14:in `block in record'
  from benchmark.rb:293:in `measure'
  from app/services/metrics_service.rb:13:in `record'
  from app/services/external_api/bgs_service.rb:121:in `fetch_person_by_ssn'
  from app/models/person.rb:141:in `fetch_bgs_record_by_ssn'
  from app/models/person.rb:124:in `fetch_bgs_record'
  from app/models/concerns/associated_bgs_record.rb:68:in `try_and_retry_bgs_record'
  from app/models/concerns/associated_bgs_record.rb:62:in `bgs_record'
  from app/models/concerns/associated_bgs_record.rb:104:in `not_found?'
  from app/models/person.rb:113:in `found?'
  from app/models/person.rb:38:in `find_or_create_by_ssn'
  from app/models/legacy_appeal.rb:315:in `person_for_appellant'
  from app/models/legacy_appeal.rb:1022:in `claimant_participant_id'
  from app/models/legacy_appeal.rb:1016:in `new_power_of_attorney'
  from app/models/legacy_appeal.rb:363:in `power_of_attorney'
  from app/models/legacy_appeal.rb:355:in `legacy_appeal_representative'
  from app/models/legacy_appeal.rb:340:in `representative_name'
  from app/jobs/update_cached_appeals_attributes_job.rb:114:in `block (2 levels) in cache_legacy_appeal_postgres_data'
  from app/jobs/update_cached_appeals_attributes_job.rb:103:in `map'
  from app/jobs/update_cached_appeals_attributes_job.rb:103:in `block in cache_legacy_appeal_postgres_data'
  from active_support/core_ext/array/grouping.rb:39:in `block in in_groups_of'
  from active_support/core_ext/array/grouping.rb:39:in `each'
  from active_support/core_ext/array/grouping.rb:39:in `each_slice'
  from active_support/core_ext/array/grouping.rb:39:in `in_groups_of'
  from active_record/relation/delegation.rb:71:in `in_groups_of'
  from app/jobs/update_cached_appeals_attributes_job.rb:102:in `cache_legacy_appeal_postgres_data'
  from app/jobs/update_cached_appeals_attributes_job.rb:91:in `cache_legacy_appeals'
  from app/jobs/update_cached_appeals_attributes_job.rb:23:in `perform'
  from active_job/execution.rb:39:in `block in perform_now'
  from active_support/callbacks.rb:109:in `block in run_callbacks'
  from raven/integrations/rails/active_job.rb:18:in `capture_and_reraise_with_sentry'
  from raven/integrations/rails/active_job.rb:12:in `block (2 levels) in included'
  from active_support/callbacks.rb:118:in `instance_exec'
  from active_support/callbacks.rb:118:in `block in run_callbacks'
  from new_relic/agent/tracer.rb:72:in `in_transaction'
  from new_relic/agent/instrumentation/active_job.rb:66:in `run_in_transaction'
  from new_relic/agent/instrumentation/active_job.rb:55:in `perform'
  from new_relic/agent/instrumentation/active_job.rb:23:in `block (4 levels) in <top (required)>'
  from active_support/callbacks.rb:118:in `instance_exec'
  from active_support/callbacks.rb:118:in `block in run_callbacks'
  from i18n.rb:313:in `with_locale'
  from active_job/translation.rb:9:in `block (2 levels) in <module:Translation>'
  from active_support/callbacks.rb:118:in `instance_exec'
  from active_support/callbacks.rb:118:in `block in run_callbacks'
  from active_job/logging.rb:26:in `block (4 levels) in <module:Logging>'
  from active_support/notifications.rb:168:in `block in instrument'
  from active_support/notifications/instrumenter.rb:23:in `instrument'
  from active_support/notifications.rb:168:in `instrument'
  from active_job/logging.rb:25:in `block (3 levels) in <module:Logging>'
  from active_job/logging.rb:46:in `block in tag_logger'
  from active_support/tagged_logging.rb:71:in `block in tagged'
  from active_support/tagged_logging.rb:28:in `tagged'
  from active_support/tagged_logging.rb:71:in `tagged'
  from active_job/logging.rb:46:in `tag_logger'
  from active_job/logging.rb:22:in `block (2 levels) in <module:Logging>'
  from active_support/callbacks.rb:118:in `instance_exec'
  from active_support/callbacks.rb:118:in `block in run_callbacks'
  from active_support/callbacks.rb:136:in `run_callbacks'
  from active_job/execution.rb:38:in `perform_now'
  from active_job/execution.rb:24:in `block in execute'
  from active_support/callbacks.rb:109:in `block in run_callbacks'
  from active_job/railtie.rb:28:in `block (4 levels) in <class:Railtie>'
  from active_support/execution_wrapper.rb:87:in `wrap'
  from active_support/reloader.rb:73:in `block in wrap'
  from active_support/execution_wrapper.rb:87:in `wrap'
  from active_support/reloader.rb:72:in `wrap'
  from active_job/railtie.rb:27:in `block (3 levels) in <class:Railtie>'
  from active_support/callbacks.rb:118:in `instance_exec'
  from active_support/callbacks.rb:118:in `block in run_callbacks'
  from active_support/callbacks.rb:136:in `run_callbacks'
  from active_job/execution.rb:22:in `execute'
  from shoryuken/extensions/active_job_adapter.rb:78:in `perform'
  from shoryuken/processor.rb:20:in `block in process'
  from shoryuken/middleware/chain.rb:98:in `block in invoke'
  from app/jobs/middleware/job_request_store_middleware.rb:8:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from app/jobs/middleware/job_monitoring_middleware.rb:9:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from shoryuken/middleware/server/active_record.rb:6:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from shoryuken/middleware/server/auto_extend_visibility.rb:10:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from shoryuken/middleware/server/auto_delete.rb:6:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from shoryuken/middleware/server/exponential_backoff_retry.rb:16:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from shoryuken/middleware/server/timing.rb:14:in `block in call'
  from shoryuken/logging.rb:20:in `with_context'
  from shoryuken/middleware/server/timing.rb:8:in `call'
  from shoryuken/middleware/chain.rb:100:in `block in invoke'
  from shoryuken/middleware/chain.rb:103:in `invoke'
  from shoryuken/processor.rb:19:in `process'
  from shoryuken/processor.rb:8:in `process'
  from shoryuken/manager.rb:73:in `block in assign'
  from concurrent/executor/safe_task_executor.rb:24:in `block in execute'
  from concurrent/synchronization/mutex_lockable_object.rb:41:in `block in synchronize'
  from concurrent/synchronization/mutex_lockable_object.rb:41:in `synchronize'
  from concurrent/synchronization/mutex_lockable_object.rb:41:in `synchronize'
  from concurrent/executor/safe_task_executor.rb:19:in `execute'
  from concurrent/promise.rb:563:in `block in realize'
  from concurrent/executor/ruby_thread_pool_executor.rb:363:in `run_task'
  from concurrent/executor/ruby_thread_pool_executor.rb:352:in `block (3 levels) in create_worker'
  from concurrent/executor/ruby_thread_pool_executor.rb:335:in `loop'
  from concurrent/executor/ruby_thread_pool_executor.rb:335:in `block (2 levels) in create_worker'
  from concurrent/executor/ruby_thread_pool_executor.rb:334:in `catch'
  from concurrent/executor/ruby_thread_pool_executor.rb:334:in `block in create_worker'

Slack Thread: https://dsva.slack.com/archives/CN3FQR4A1/p1602592080030200
Sentry Alert: https://sentry.prod.appeals.va.gov/department-of-veterans-affairs/caseflow/issues/11179/events/891890/?environment=production

Technical notes

@ferristseng ferristseng added Product: caseflow-queue Type: Bug Integration: BGS BGS integration Team: Echo 🐬 Source: Sentry Alert created because of a Sentry alert Priority: Medium Blocking issue w/workaround, or "second in" priority for new work. labels Oct 13, 2020
@lomky
Copy link
Contributor

lomky commented Oct 15, 2020

Standard investigation - fix if reasonable otherwise cut off new ticket. Putting at a 3.

@yoomlam
Copy link
Contributor

yoomlam commented Oct 15, 2020

@yoomlam
Copy link
Contributor

yoomlam commented Oct 23, 2020

What is the value of RequestStore.store[:application] when UpdateCachedAppealsAttributesJob is run? Why I ask:

power_of_attorney_name: (bgs_poa&.representative_name || appeal.representative_name),

calls
def use_representative_info_from_bgs?
RequestStore.store[:application] == "queue" ||
RequestStore.store[:application] == "hearings" ||
RequestStore.store[:application] == "hearing_schedule_job" ||
RequestStore.store[:application] == "idt"
end

Answer is queue_job if manually running in prod's Rails console using UpdateCachedAppealsAttributesJob.perform_now, which means appeal.representative_name is effectively appeal.vacols_representative_name rather than appeal.bgs_representative_name.

va-bot pushed a commit that referenced this issue Oct 26, 2020
…15482)

Resolves #15416 

### Description
If a BGS connection error occurs when getting the POA representative, log a warning and allow UpdateCachedAppealsAttributesJob to complete without failure and log the warnings to Slack.

The UpdateCachedAppealsAttributesJob runs many times per day. The warning logs should provide more info to determine if we want to automatically retry for certain situations.

There may be other errors caused by BGS, but we'll just rescue from `Errno::ECONNRESET` for now.

### Acceptance Criteria
- [ ] Recover from the `SSL_connect` error
- [ ] Code compiles correctly and tests pass

### Testing Plan
It's hard to test unless we know which appeal causes the BGS connection error.

In prod, monkey-patch class with code changes, skipping `cache_ama_appeals` and limiting to first few legacy appeals:
```ruby
TEST_LIMIT=1500
class UpdateCachedAppealsAttributesJob 
  def perform
    RequestStore.store[:current_user] = User.system_user
    ama_appeals_start = Time.zone.now
    ### Ignore this for testing: cache_ama_appeals
    datadog_report_time_segment(segment: "cache_ama_appeals", start_time: ama_appeals_start)

    legacy_appeals_start = Time.zone.now
    cache_legacy_appeals
    datadog_report_time_segment(segment: "cache_legacy_appeals", start_time: legacy_appeals_start)

    datadog_report_runtime(metric_group_name: METRIC_GROUP_NAME)
  rescue StandardError => error
    log_error(@start_time, error)
  else
    puts "warnings: #{warning_msgs.count}"
    log_warning unless warning_msgs.empty?
  end

### overridding to limit number of appeals
  def cache_legacy_appeals
    # Avoid lazy evaluation bugs by immediately plucking all VACOLS IDs. Lazy evaluation of the LegacyAppeal.find(...)
    # was previously causing this code to insert legacy appeal attributes that corresponded to NULL ID fields.
    puts "cache_legacy_appeals"
    legacy_appeals = LegacyAppeal.includes(:available_hearing_locations)
      .where(id: open_appeals_from_tasks(LegacyAppeal.name)).limit(10) ### limit to a few
    all_vacols_ids = legacy_appeals.pluck(:vacols_id).flatten
    puts "vacols_ids count: #{all_vacols_ids.count}"

    cache_postgres_data_start = Time.zone.now
    cache_legacy_appeal_postgres_data(legacy_appeals)
    datadog_report_time_segment(segment: "cache_legacy_appeal_postgres_data", start_time: cache_postgres_data_start)

    ### cache_vacols_data_start = Time.zone.now
    ### cache_legacy_appeal_vacols_data(all_vacols_ids)
    ### datadog_report_time_segment(segment: "cache_legacy_appeal_vacols_data", start_time: cache_vacols_data_start)
  end

  def cache_legacy_appeal_postgres_data(legacy_appeals)
    # this transaction times out so let's try to do this in batches
### limit to a subset of appeals; may need to increase in order to get to an appeal that causes a BGS error
    legacy_appeals.first(TEST_LIMIT).in_groups_of(POSTGRES_BATCH_SIZE, false) do |batch_legacy_appeals| 
      values_to_cache = batch_legacy_appeals.map do |appeal| 
        puts "batch_legacy_appeals: #{appeal.id}"
        regional_office = RegionalOffice::CITIES[appeal.closest_regional_office]
        {
          vacols_id: appeal.vacols_id,
          appeal_id: appeal.id,
          appeal_type: LegacyAppeal.name,
          closest_regional_office_city: regional_office ? regional_office[:city] : COPY::UNKNOWN_REGIONAL_OFFICE,
          closest_regional_office_key: regional_office ? appeal.closest_regional_office : COPY::UNKNOWN_REGIONAL_OFFICE,
          docket_type: appeal.docket_name, # "legacy"
          power_of_attorney_name: poa_representative_name_for(appeal),
          suggested_hearing_location: appeal.suggested_hearing_location&.formatted_location
        }
      end

      puts "import values_to_cache: #{values_to_cache.count}"
      CachedAppeal.import values_to_cache, on_duplicate_key_update: { conflict_target: [:appeal_id, :appeal_type],
                                                                      columns: [
                                                                        :closest_regional_office_city,
                                                                        :closest_regional_office_key,
                                                                        :vacols_id,
                                                                        :docket_type,
                                                                        :power_of_attorney_name,
                                                                        :suggested_hearing_location
                                                                      ] }
      increment_appeal_count(batch_legacy_appeals.length, LegacyAppeal.name)
    end
  end

  # bypass PowerOfAttorney model completely and always prefer BGS cache
  def poa_representative_name_for(appeal)
    bgs_poa = fetch_bgs_power_of_attorney_by_file_number(appeal.veteran_file_number)
    # both representative_name calls can result in BGS connection error
    bgs_poa&.representative_name || appeal.representative_name
  rescue Errno::ECONNRESET => error
    puts "RESCUED error for: #{appeal.id}"
    warning_msgs << "#{appeal.class.name} #{appeal.id}: #{error}" if warning_msgs.count < 100
    nil
  end

  def warning_msgs
    @warning_msgs ||= []
  end

  def log_warning
    slack_msg = "[WARN] UpdateCachedAppealsAttributesJob first 100 warnings: \n#{warning_msgs.join("\n")}"
    slack_service.send_notification(slack_msg)
  end
end

# Then run it. Keep an eye on #appeals-job-alerts Slack channel
UpdateCachedAppealsAttributesJob.perform_now
```

You may see a bunch of `ROLLBACK`s due to BGS like the following, but the job should keep running.
```ruby
[2020-10-22 12:13:57 -0400]   BgsPowerOfAttorney Load (1.6ms)  SELECT  "bgs_power_of_attorneys".* FROM "bgs_power_of_attorneys" WHERE "bgs_power_of_attorneys"."file_number" = $1 LIMIT $2 ...
[2020-10-22 12:13:57 -0400]    (1.0ms)  BEGIN
[2020-10-22 12:13:57 -0400]    (1.1ms)  ROLLBACK
```
and
```
[2020-10-22 12:16:28 -0400] FINISHED BGS: fetch poa for file number: ...
[2020-10-22 12:16:28 -0400]    (2.6ms)  ROLLBACK
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Integration: BGS BGS integration Priority: Medium Blocking issue w/workaround, or "second in" priority for new work. Product: caseflow-queue Source: Sentry Alert created because of a Sentry alert Team: Echo 🐬 Type: Bug
Projects
None yet
3 participants