Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML - Can't get training progress during image training #5553

Closed
LittleLittleCloud opened this issue Dec 15, 2020 · 1 comment · Fixed by #5554
Closed

AutoML - Can't get training progress during image training #5553

LittleLittleCloud opened this issue Dec 15, 2020 · 1 comment · Fixed by #5554
Assignees
Labels
AutoML.NET Automating various steps of the machine learning process P1 Priority of the issue for triage purpose: Needs to be fixed soon.

Comments

@LittleLittleCloud
Copy link
Contributor

LittleLittleCloud commented Dec 15, 2020

System information

  • OS version/distro: win10
  • .NET Version (eg., dotnet --info): 3.1.4

Issue

  • What did you do?
  • I use AutoML API to launch an image classification training, and in order to get training progress, I attach a logger to the current context. However, no training progress shows after I attach to logger and start training.

What might happen

After some investigation, I believe the error is caused by one of the latest changes we made on how a trial is launched. In this PR #5445, it creates a new context instead of reusing the current context when starting a trial at the beginning. So when I subscribe to the log channel when calling API, it is actually listening to the current context's channel where no trial is ongoing. However, since that new context where the trial is ongoing is not available externally, there's no way to have a peek at training progress right now.

@mstfbl mstfbl self-assigned this Dec 15, 2020
@mstfbl mstfbl added AutoML.NET Automating various steps of the machine learning process P1 Priority of the issue for triage purpose: Needs to be fixed soon. labels Dec 15, 2020
@justinormont
Copy link
Contributor

Earlier discussion -- #5445 (review)

My initial thoughts from #5445 (comment):

We can always duplicate the logger. Or attach a logger to the new context, and when called, have it pass the message to the original context.


@LittleLittleCloud : What type of message are you reading from the log? Log scraping is likely the only usable method currently.

Future

In the longer term, we may want to have each component pass along a structured status message: { rows processed, percent complete, processing duration, current step name, memory, other stats }. ML․NET conveys very little information on the status of a training job.

The output from MAML was sometimes sufficient (examples: 1, 2, 3, 4). These give some notion of the progress of the training job.

Related issues on having an output verbosity level besides zero & firehose:

To quote an earlier issue comment:

As mentioned in #3235, MLContext.Log() doesn't have a verbosity selection, so it's more of a firehose.

If a verbosity argument is added to MLContext.Log(), the log output from there should be human readable to see general progress.

I believe it's still hidden within the firehose of output and once the verbosity is scaled down, you should see messages like:

LightGBM objective=multiclassova
[7] 'Loading data for LightGBM' finished in 00:00:15.6600468.
[8] 'Training with LightGBM' started.
..................................................(00:30.58)	0/200 iterations
..................................................(01:00.9)	1/200 iterations
..................................................(01:31.2)	2/200 iterations
..................................................(02:01.4)	2/200 iterations
..................................................(02:31.9)	3/200 iterations
..................................................(03:02.5)	4/200 iterations
..................................................(03:32.9)	4/200 iterations
..................................................(04:03.6)	5/200 iterations
..................................................(04:34.4)	5/200 iterations
..................................................(05:04.8)	6/200 iterations

And naively extrapolating, there's around 2.7 hours left in the LightGBM training.

@mstfbl mstfbl linked a pull request Dec 16, 2020 that will close this issue
@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
AutoML.NET Automating various steps of the machine learning process P1 Priority of the issue for triage purpose: Needs to be fixed soon.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants