Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Write to mlperf run_history bigquery table for mlperf runs #120

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

raymondzouu
Copy link
Collaborator

Description

Parse tensorboard file and write to a new table run_history in the mlperf_dataset only for mlperf runs. Schema is created as discussed in http://shortn/_Hr3sWp59UK.

Tests

Ran maxtext_sweep_gke_example_dag http://shortn/_NApHfEATw2 and wrote metrics to mlperf_dataset http://shortn/_3CfsiDPr6v

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

@raymondzouu raymondzouu force-pushed the raymondzou-mlperf-table branch 5 times, most recently from 9e2597e to ecb254e Compare February 16, 2024 00:47
@raymondzouu raymondzouu marked this pull request as ready for review February 16, 2024 03:59
@raymondzouu raymondzouu changed the title [WIP] Write to mlperf run_history bigquery table for mlperf runs Write to mlperf run_history bigquery table for mlperf runs Feb 16, 2024
Copy link
Collaborator

@RissyRan RissyRan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Could you also add a run from our team's tests to ensure this does not break post_process step?

@@ -29,6 +29,7 @@
BENCHMARK_BQ_JOB_TABLE_NAME = "job_history"
BENCHMARK_BQ_METRIC_TABLE_NAME = "metric_history"
BENCHMARK_BQ_METADATA_TABLE_NAME = "metadata_history"
BENCHMARK_BQ_RUN_TABLE_NAME = "run_history"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to name it more specifically to mlperf? i.e. mlperf_history or mlperf_result?

num_chips: int
step_time: float
throughput: float
per_device_tflops_per_sec: float
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a little bit change of this name, so would like to double check, i.e. indicate teraflop?

multislice_topology: str
num_params: int
global_batch_size: int
per_device_batch_size: float
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per core or device? I feel core is widely used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summary_config: metric_config.SummaryConfig,
) -> (List[List[bigquery.MetricHistoryRow]], List[List[bigquery.MetadataHistoryRow]],):
) -> (Dict[str, Any], Dict[str, Any],):
"""Process metrics and dimensions from TensorBoard file.

Args:
base_id: The unique ID for this test job.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove base_id in comment

summary_config: metric_config.SummaryConfig,
) -> (List[List[bigquery.MetricHistoryRow]], List[List[bigquery.MetadataHistoryRow]],):
) -> (Dict[str, Any], Dict[str, Any],):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dict[str, float] & Dict[str, str]? Same comment for the func below

* int(metadata["max_target_length/text_summary"])
) / aggregated_metrics["perf/step_time_seconds"]
precision = (
"bfloat16"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we always has either bf16 or int8? Or any other quantization is available?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now only bf16 and int8 are available

description=task_test_config.test_name,
platform="Cloud",
date=datetime.datetime.now(),
base_cl="",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the plan for this metadata?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure yet, I think I will need to add a change in MaxText to get this value into the tensorboard file. But for now will leave it blank.

ici_mesh_shape=f"[{ici_data_parallelism}, {ici_fsdp_parallelism}, {ici_sequence_parallelism}, {ici_tensor_parallelism}, {ici_autoregressive_parallelism}]",
dcn_mesh_shape=f"[{dcn_data_parallelism}, {dcn_fsdp_parallelism}, {dcn_sequence_parallelism}, {dcn_tensor_parallelism}, {dcn_autoregressive_parallelism}]",
xprof="",
mfu=0.0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be extracted from TensorBoard?

@raymondzouu raymondzouu assigned raymondzouu and unassigned RissyRan Feb 21, 2024
@raymondzouu raymondzouu changed the title Write to mlperf run_history bigquery table for mlperf runs [WIP] Write to mlperf run_history bigquery table for mlperf runs Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants