Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add plugin mechanism for dataset-specific preprocessing in qualx #1148

Merged
merged 5 commits into from
Jun 28, 2024

Conversation

leewyang
Copy link
Collaborator

This PR adds a plugin mechanism to invoke dataset-specific code to modify the pandas dataframe returned by the qualx load_profiles() function. This is intended to allow custom handlers for one-off cases which shouldn't be introduced into the main codebase.

The path to the plugin module should be specified within the dataset JSON file with the "load_profiles_hook" key, e.g.

{
    "nds": {
        "eventlogs": [
            "/path/to/eventlogs",
        ],
        "app_meta": { ... }
        "load_profiles_hook": "/path/to/plugin/module.py"
    }
}

The plugin module should define a function with the following signature:

def load_profiles_hook(df: pd.DataFrame) -> pd.DataFrame:
    # add dataset-specific modifications
    return df

Changes

  1. Add plugin mechanism for dataset-specific manipulation of the profile dataframe.
  2. Moved injection of the "jobName" from the "description" field to a suffix of the "appName" field. This allows the "description" field to retain it's original value for inferred app_meta cases, which can be useful inside the load_profiles_hook.
  3. Strip the injected "jobName" when filtering out test sets by "appName".
  4. Add --output-sql-ids-aligned argument to Profiler invocations (for future use).
  5. Fix logger deprecation warnings.

Test

Following CMDs have been tested:

Internal Usage:

python qualx_main.py preprocess
python qualx_main.py train
python qualx_main.py evaluate
python qualx_main.py compare

@leewyang leewyang added the user_tools Scope the wrapper module running CSP, QualX, and reports (python) label Jun 27, 2024
@leewyang leewyang self-assigned this Jun 27, 2024
@amahussein
Copy link
Collaborator

python qualx_main.py train
@leewyang
There is a CLI for train spark_rapids train, right? just making sure that the CLI is not falling behind to be used.

@leewyang
Copy link
Collaborator Author

leewyang commented Jun 27, 2024

Yes, I recently tested spark_rapids train CLI per #1140. It pretty much just wraps the same code, so I think it's fine.

Signed-off-by: Lee Yang <[email protected]>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the context to approve the PR.
Perhaps let's wait for @mattahrens' review.
@leewyang is this blocking you in anyway?

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@leewyang leewyang merged commit 2eef904 into NVIDIA:dev Jun 28, 2024
14 checks passed
@leewyang leewyang deleted the qualx_plugin branch June 28, 2024 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants