Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.6.8.29 #103

Merged
merged 29 commits into from
Sep 27, 2023
Merged

3.6.8.29 #103

merged 29 commits into from
Sep 27, 2023

Conversation

PalNilsson
Copy link
Collaborator

@PalNilsson PalNilsson commented Sep 27, 2023

  • Redirecting stdout/stderr to files for trace service curls
    • This could prevent thread deadlocks in the standard python subprocess.communicate() function in case of overwhelming amount of stdout/stderr. The subprocess.communicate() function is no longer used, which also means that the internal timeout capability in subprocess can no longer be used and had to be reimplemented by a threading timer which sets the relevant error code if necessary
    • The 'last’ output from curl is stored in trace_curl_last.stdout/stderr, and gets appended to trace_curl.stdout/stderr
    • The trace_curl_last.stdout/stderr files are searched for any curl errors (curl command always returns 0 exit code even when there was an error, so the output has to be processed)
    • A failed rucio trace curl operation is now reported with job metrics
      • Example: rucioTraceError=N
    • Increased connection timeout from 20s to 100s to be in line with panda server curl operations (where we don’t see any problems)
    • Related JIRA ticket: https://its.cern.ch/jira/browse/ATLASPANDA-835
  • Reporting prmon read_bytes/total_input_size with job metrics (‘readbyterate’)
    • Information to be used for optimizing brokerage
    • Requested by J. Elmsheuser, R. Walker
  • Extended usage of psutil
    • Job monitoring is now using psutil to discover prmon pid
    • If psutil is not available (e.g. as is the case on marenostrum), the code falls back to old ps command usage
  • Added protection against expired job objects in job_monitor loop
    • Reported by W. Guan/Z. Yang
  • Updated GitHub Action workflows
    • Unit tests and flake8 are are now independent workflows
    • Moved to latest flake8 version 6.1.0 for flake8 verification
    • All tests are run for python versions 3.8, 3.9, 3.10 and 3.11
  • Tested pilot running under python versions 3.9.18 and 3.11.5
    • Grid jobs are currently running under python version 3.9.14 but will soon switch to 3.9.18 to be in line with user tools (like rucio)
    • Python version 3.11.5 will be the default version on EL9
    • Requested by A. De Silva

PalNilsson and others added 29 commits September 23, 2021 14:08
Initial Pilot 2 to Pilot 3 changes
…ded additional exception handling. Cleanup
@PalNilsson PalNilsson merged commit efecbae into master Sep 27, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant