Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move explorer execution to background #1504

Merged
merged 2 commits into from
Dec 13, 2024
Merged

Conversation

odjuricicTT
Copy link
Contributor

@odjuricicTT odjuricicTT commented Dec 4, 2024

Explorer execute command now spawns a new thread in which all model compile and execution is handled. Results are stored as state on the server and will be sent to the frontend with the next covert command.

Additionally:

  • Use subprocesses for compile and to_flatbuffer in order for asserts to not kill the server
  • Setup high level error messages for execution steps
  • Send incremental logs to the frontend

assert result.ok
if "error" in result.json():
print(result.json())
assert False


def wait_for_execution_to_finish():
for _ in range(1000): # Try for up to 100 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test model compilation should be performed under 100s? Might not be in the future. Would you want to define wait time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Will clean this up a bit and leave a nice error message if we timeout. But i wouldn't increase this much until we have bigger models, in order to catch hang-like bugs faster atm.

I was hitting the metal regression locally while testing this PR, that is why i added timeouts to all requests in the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is defining wait time per test via some attribute, not having a single value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Will add.

)
self.reset_state()

options = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here should be new override class be plugged in with its toString method, also replacing hardcodings below. Ofc it needs to be populated from TTE frontend data first - feeding the override values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

error = "Error running compile TTIR to TTNN Backend Pipeline"
self.log(error)
raise ExplorerRunException(error)
self.progress = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be more than 20 in the future, lot more. :) Would be cool to extend it in future with more granularity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :), once we have some compile and execution output we can build a good indication op progress based on that.

@@ -3,7 +3,7 @@ include(ExternalProject)
set(TT_EXPLORER_SCRIPT ${CMAKE_CURRENT_SOURCE_DIR}/run.py)
set(TTMLIR_BUILD_BIN_DIR ${TTMLIR_BINARY_DIR}/bin)

set(MODEL_EXPLORER_VERSION "95d79ec933643c3b145537ce6fd5d9f0e735683d")
set(MODEL_EXPLORER_VERSION "0ac804da35e6d42b2e946879e3b3bdd8bd17254a")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably coordinate with the model-explorer repo to sync this version to the one with the logging window (when it gets merged in)

@@ -70,9 +74,23 @@ def execute(
memory_layout_analysis_enabled = False
memory_layout_analysis_policy = None

ttnn_ir = self.model_runner.run(
self.model_runner.run(
model_path, memory_layout_analysis_enabled, memory_layout_analysis_policy
)

# TODO(odjuricic, #933) Parse TTNN IR and return the post optimized graph.
return {"graphs": []}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm about it here: this is the expected return in case the execution started, if something goes wrong, it should throw an error, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Is that an ok flow from frontend side?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is okay, I just wanted to confirm as my previous implementation was expecting some response.

I made a change to accept graphs as an empty array as the success response :)

"isDone": done,
"progress": progress,
"total": 100,
"error": error,
Copy link
Contributor

@madcampos madcampos Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be interesting to add properties to send logs to the UI, like, actual text that gives a better perspective of what is going on?

Here is the current interface for the UI, let me know if there is some property that doesn't make sense: https://github.com/tenstorrent/model-explorer/blob/724abaf54e3ae87b70b8869d1cf7716bbc565a63/src/ui/src/common/extension_command.ts#L99-L107

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding that today. Left a comment on your PR to discuss these properties.

@odjuricicTT odjuricicTT force-pushed the odjuricic/explorer-background branch 2 times, most recently from 405d848 to 83e6cb8 Compare December 11, 2024 11:40
graph = mlir.build_graph(module)
return {"graphs": [graph]}
graph, perf_data = mlir.build_graph(module, perf_trace)
return {"graphs": [graph], "perf_data": perf_data}
Copy link
Contributor

@vprajapati-tt vprajapati-tt Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perf_data will not be returned because it needs to be wrapped into an adapter_response and kept in graphs. It should look something like:

return to_adapter_format({
  "graphs": [graph],
  "perf_data": perf_data
})

In case the graph isn't passed we should escalate this to change how the frontend parses this as well since the graphs: graph property needs to exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try out, if this is not the case i'll try to merge now in order to unblock your further work and create an issue for this.

@odjuricicTT
Copy link
Contributor Author

@svuckovicTT @mtopalovicTT Codeowners file got messed up and your review is required here now. Please take a look. I've put in fix for codeowners in seperate PR.

@odjuricicTT odjuricicTT force-pushed the odjuricic/explorer-background branch from 83e6cb8 to 303beef Compare December 12, 2024 14:55
@odjuricicTT odjuricicTT enabled auto-merge (squash) December 12, 2024 15:05
@odjuricicTT odjuricicTT merged commit 30a7a9e into main Dec 13, 2024
21 checks passed
azecevicTT pushed a commit that referenced this pull request Dec 17, 2024
* Use subprocesses for compile and to_flatbuffer in order for asserts to not kill the server
* Setup high level error messages for execution steps
* Send incremental logs to the frontend
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants