Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include status information for failed event logs in core tool #1187

Merged
merged 3 commits into from
Jul 15, 2024

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Jul 12, 2024

Fixes #1164. This PR generates status reports for event logs that failed due to an exception (i.e. File Not Found, Authentication or any other CSP exceptions) in the rapids_4_spark_qualification_output_status.csv file.

CMD:

spark_rapids qualification -p databricks-azure --eventlogs \
 "test_log,~/Work/event-logs/*-cpu/*,abfss://[email protected]/path/app-20230330112539-000" \
  --tools_jar $SPARK_RAPIDS_TOOLS_JAR

Output

File: qual_2024xxx/rapids_4_spark_qualification_output/rapids_4_spark_qualification_output_status.csv

After this change:

|-----------------------------------------------------------------------------|---------|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Event Log                                                                   | Status  | AppID                   | Description                                                                                                                                                                                                                                                                                                                                                                                                                           |
|-----------------------------------------------------------------------------|---------|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| file:/path/event-logs/databricks-azure-cpu/eventlog                         | SUCCESS | app-20231212234153-0000 | Took 96633ms to process                                                                                                                                                                                                                                                                                                                                                                                                               |
|-----------------------------------------------------------------------------|---------|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| file:/path/event-logs/dataproc-l4-cpu/eventlog-test-2                       | SKIPPED | N/A                     | GpuEventLogException: Cannot parse event logs from GPU run: skipping this file                                                                                                                                                                                                                                                                                                                                                        |
|-----------------------------------------------------------------------------|---------|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| test_log                                                                    | FAILURE | N/A                     | File test_log does not exist                                                                                                                                                                                                                                                                                                                                                                                                          |
|-----------------------------------------------------------------------------|---------|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| file:/path/event-logs/databricks-aws-cpu/eventlog                           | SUCCESS | app-20231212214826-0000 | Took 2616ms to process                                                                                                                                                                                                                                                                                                                                                                                                                |
|-----------------------------------------------------------------------------|---------|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| abfss://[email protected]/path/app-20230330112539-000 | FAILURE | N/A                     | Operation failed: "This request is not authorized to perform this operation.", 403, HEAD, https://test-bucket.dfs.core.windows.net/test-c/?upn=false&action=getAccessControl&timeout=90                                                                                                                                                                                                                                               |
|-----------------------------------------------------------------------------|---------|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| gs://test-gs-bucket/path/application_1666141048720_000                      | FAILURE | N/A                     | Error reading credential file from environment variable GOOGLE_APPLICATION_CREDENTIALS, value '/Users/psarthi/.config/gcloud/application_default_credentials.json': 400 Bad Request\nPOST https://oauth2.googleapis.com/token\n{\n  "error" : "invalid_grant",\n  "error_description" : "reauth related error (invalid_rapt)",\n  "error_uri" : "https://support.google.com/a/answer/9368756",\n  "error_subtype" : "invalid_rapt"\n} |
|-----------------------------------------------------------------------------|---------|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

Previously

|-------------------------------------------------------|---------|-------------------------|--------------------------------------------------------------------------------|
| Event Log                                             | Status  | AppID                   | Description                                                                    |
|-------------------------------------------------------|---------|-------------------------|--------------------------------------------------------------------------------|
| file:/path/event-logs/databricks-azure-cpu/eventlog   | SUCCESS | app-20231212234153-0000 | Took 99079ms to process                                                        |
|-------------------------------------------------------|---------|-------------------------|--------------------------------------------------------------------------------|
| file:/path/event-logs/dataproc-l4-cpu/eventlog-test-2 | SKIPPED | N/A                     | GpuEventLogException: Cannot parse event logs from GPU run: skipping this file |
|-------------------------------------------------------|---------|-------------------------|--------------------------------------------------------------------------------|
| file:/path/event-logs/databricks-aws-cpu/eventlog     | SUCCESS | app-20231212214826-0000 | Took 2866ms to process                                                         |
|-------------------------------------------------------|---------|-------------------------|--------------------------------------------------------------------------------|

Changes:

  • Added a case class FailedEventLog as a wrapper for failed event logs
  • Modified getEventLogInfo() to return FailedEventLog for failed cases
  • Refactoring Q/P Tool to move common variables to a Base class:
    • Created a ToolBase that initialises some of the common variables for both Q/P Tool
    • Define a method handleFailedEventLogs() to add failed event log to the status report

Testing

  • Evaluated the status report for both Profiling and Qualification tool.

Notes

  • Next steps would be to propagate this information to the python tool
  • Kept the error message with full length to avoid missing critical information.

@parthosa parthosa added bug Something isn't working core_tools Scope the core module (scala) labels Jul 12, 2024
@parthosa parthosa self-assigned this Jul 12, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa !
LGTME.
Missing copyrights

@tgravescs
Copy link
Collaborator

so this is just for the scala code then?
is there an issue for python code to rollup? is there cases python code could fail or skip something that scala code never sees?

Signed-off-by: Partho Sarthi <[email protected]>
@parthosa
Copy link
Collaborator Author

parthosa commented Jul 15, 2024

Yes @tgravescs.. This PR is for only for the scala code.

is there an issue for python code to rollup?

Follow up issue for the python tool to rollup this information - #1126

# Conflicts:
#	core/src/main/scala/com/nvidia/spark/rapids/tool/EventLogPathProcessor.scala
@parthosa parthosa requested a review from amahussein July 15, 2024 16:45
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Partho

@tgravescs
Copy link
Collaborator

my question is answered don't wait on me for anything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Profiling/Qualification Tool does not contain status info for failed event log
3 participants