Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous rendering #3

Open
jlewi opened this issue Jan 18, 2018 · 9 comments
Open

Asynchronous rendering #3

jlewi opened this issue Jan 18, 2018 · 9 comments

Comments

@jlewi
Copy link

jlewi commented Jan 18, 2018

Do GPUs help with movie rendering?

Is a rendering a movie something we could do as a separate K8s job?

@cwbeitel
Copy link
Owner

Yes it could be done as a separate job.

I don't know whether GPU's will help with rendering. Rendering involves evaluating the model on the training environment and passing the resulting frames to ffmpg to render. So the question is whether the bottleneck is in model eval or these other two steps, if the former then yes.

@cwbeitel
Copy link
Owner

It's better to have separate train and render jobs because the resources needed for the two may not be the same. Also it may be interesting to repeat renders with different parameters e.g. resolutions.

@jlewi
Copy link
Author

jlewi commented Jan 19, 2018

Does the render job run at the end of the training job? Is there anyway to run it in parallel so we don't need to wait for the job to finish?

@cwbeitel
Copy link
Owner

Ah interesting that's a great idea. All the render job needs is checkpoints. So renders could be performed as often as each time a checkpoint is written (and as you're suggesting in parallel).

One way to do this is to have a single long-running render job that stores the global_step of the last checkpoint it rendered and polls the checkpoint dir for more recent checkpoints, potentially occupying the amount of resources needed to render while polling (provided there is not already a new checkpoint when the render for the last checkpoint is complete which would probably mean you're writing checkpoints too often).

Another would be to have training jobs enqueue render jobs either in a message queue or by directly creating render jobs (the latter seems not to be separating concerns appropriately).

And another would be to have the writing of checkpoints trigger renders via a third-party storage object change notifier and file name matcher. For now it looks like minio does not support the GCS notifications API.

Triggers aside renders could be run using lightweight fission functions for one episode at a time (allowing many in parallel). One issue with this would be for models that require a lot of memory - loading the model many times in parallel or doing so with a serverless framework that doesn't support mem requests. But looks like this is in progress for Fission fission/fission#193.

Introducing an event-/message-based serverless framework to kubeflow just for this purpose might be overboard but less so if you're already using these in your production system. As a device streams events how are those communicated to models deployed with serving? If we can assume that production systems will be expected to have serverless and messaging already deployed then adding these is in a sense free and having training jobs trigger renders by emitting a "checkpoint written" message might be a pretty efficient, easy, decoupled way to go initially.

But if that's not a fair assumption or out of scope then it would be most expedient to start with a render job that checks for new checkpoints as a cron job. Or perhaps produces renders on that schedule and doesn't necessarily check that they are based on new checkpoints.

Thoughts or simpler approaches?

@cwbeitel
Copy link
Owner

Of course there's also manually submitting render jobs (which you can do with a logdir for a training job that is ongoing) as I have in the notebook but which isn't quite working yet

ks param set agents-ppo logdir [log dir for running job]
ks param set agents-ppo num_cpu 1
ks param set agents-ppo run_mode render
ks apply gke -c agents-ppo

@cwbeitel
Copy link
Owner

See 6cdfcac

@cwbeitel
Copy link
Owner

So it's working to trigger renders over HTTP (render jobs run, renders generated, uploaded to gcs), see notebook.

The cleanest way to trigger renders seems to be to use a hook into a MonitoredTrainingSession.

Also each render job should only produce one render instead of 7 that are being produced now (despite num_agents=1 and num_episodes=1)

cwbeitel pushed a commit that referenced this issue Jan 22, 2018
- Using a hook into the MonitoredTrainingSession, every N steps or seconds, a render job request manifest is constructed of the following form:

INFO:tensorflow:Render trigger manifest: {'args': {'render_count': 1, 'meta': {'elapsed_time': 10.006209135055542, 'global_step': 699}, 'log_dir': 'gs://kubeflow-rl-kf/jobs/kuka-e456acf9'}, 'job_type': 'render'}
INFO:tensorflow:Triggering render number 2.
INFO:tensorflow:Render trigger manifest: {'args': {'render_count': 2, 'meta': {'elapsed_time': 20.00917911529541, 'global_step': 1480}, 'log_dir': 'gs://kubeflow-rl-kf/jobs/kuka-e456acf9'}, 'job_type': 'render'}

- Next up is to post that data to the ${FISSION_ROUTER}/job-trigger-render-events route

- Alternatively can port the code from tools/job-trigger/fission/render.py into trigger.py and submit jobs directly, see #3
@jlewi
Copy link
Author

jlewi commented Jan 24, 2018

This is really cool. The idea of using fission to hook into monitored training session is pretty neat.

This might be useful for TB as a service because we could trigger a job for each event dir to load the data into a DB like MySQL.

@cwbeitel
Copy link
Owner

cwbeitel commented Jan 24, 2018

Yeah thanks man. On the positive side it seems like a way to provide flexible subscription to training events by the broader infrastructure. I was thinking messages over topics would allow flexibility in what other services consumed particular events. E.g. email/slack notify when performance exceeds X, notify when a job goes down, update hparams and re-start job if performance is below nth %ile of job pool, stream performance metrics to non-tensorboard dashboards, etc.

TB as a service sounds cool. Assuming you're referring to the sync of training event and checkpoint data. Looks like tboard is moving towards SQL tensorflow/tensorboard#92 which would be valuable for many reasons. It would speed things up to log locally or at least to a local cache and sync the result of that as appropriate. It would probably be a better design to have training jobs emit event streams that are cached and consumed by a tensorboard service than logging to local disk and triggering separate filesystem syncs of that log directory.

Generally related it would be nice to be able to see renders within tensorboard (and perhaps to trigger them there). (And perhaps trigger and monitor jobs from tboard as well).

@cwbeitel cwbeitel changed the title Do GPUs help with rendering movies? Asynchronous rendering Jan 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants