Reproduction of 80K/sec throughput #14

pengsun · 2018-07-31T12:14:12Z

Hi, I tried to reproduce the 80K/sec throughput reported in the paper, but only got around 22K/sec.

I ran the single learner on a GPU machine (the GPU is P40):

python experiment.py --job_name=learner --task=0 --num_actors=150 \
    --level_name=rooms_keys_doors_puzzle --batch_size=32 \
    --entropy_cost=0.0033391318945337044 \
    --learning_rate=0.00031866995608948655 \
    --total_environment_frames=10000000000 --reward_clipping=soft_asymmetric

and ran 150 actors each on a CPU machine (each one is actually a docker machine in remote allocated by a cloud service):

python experiment.py --job_name=actor --task=$i \
      --num_actors=150 --level_name=rooms_keys_doors_puzzle

where i denotes the i-th actor.

Could you give some hints on how to reproduce the throughput? Did you require a proprietary intra net connection?

The text was updated successfully, but these errors were encountered:

lespeholt · 2018-07-31T12:17:17Z

The model in the code is the larger model, so if you haven't, try with the small model. On a P100 I get 30+k FPS with the big model. Are you limited by the speed of the learner or the actors?

lespeholt · 2018-07-31T12:21:27Z

Memory bandwidth of P40 is significantly lower than P100 which could explain the difference from 22k to 30+k.

pengsun · 2018-07-31T12:41:26Z

@lespeholt Thanks for the super quick response!

I'm not sure what is the bottleneck, cpu actor or gpu learner, that's why I'm asking for help:) Can we observe some measures in the tensorboard to indentify the possible issues? If there is not existing one, could you advise some so that i can write code in my end?
Okay, will try it with a P100 and use a small model (as in the left of Figure 3 of the paper, right?)

What can cause the cpu bottleneck, like a slow CPU?

Also, what is the intra net connection (between actor and learner) for the throughput reported in the paper? Did you rely on, say, a fast Ethernet or Infiniband or something? In the run I mentioned the network traffic is over the public internet, can it be a bottleneck or not an important factor?

lespeholt · 2018-07-31T12:46:02Z

You can try and increase/decrease number of actors to see if it has an effect. A more precise way is to look at TensorFlow performance timelines: https://towardsdatascience.com/howto-profile-tensorflow-1a49fb18073d
Just stick to the larger model for now. If you get to around 30k you have basically reproduced the results for the larger model.

You need to be able to transfer with around 2-3GB per second in total. If you increase the unroll length, you decrease the bandwidth requirements so you can try that.

pengsun · 2018-07-31T13:10:21Z

Thanks!

How does the number 2-3GB/sec come (e.g., batch_size * width * height * rollout_len * BytesOfFloat, etc.)? I'm still reading the tf.FIFOQueue code (with capacity=1) and struggling to understand the sync mechanism. I guess answering this question my help me (and others) to understand how the Actor code works :)

Also, I just asked around and found I was unable to access a P100, the best GPU in hand is only P40... So please feel free to close the issue.

lespeholt · 2018-07-31T13:51:45Z

Actually it's much less for the big model (roughly same number of parameters as the small model).

Per unroll:
Parameters: 1.6M*4b = 6.4MB
Observations: 100*96*72*3b = 2.07MB
Total: 8.5MB

30,000 FPS / 400 (unroll length 100 * action repeat 4) * 8.5MB ~= 650 MB/s excluding overhead

pengsun · 2018-07-31T14:46:44Z

I see, very clearly! Thanks so much!

pengsun · 2018-08-01T07:54:57Z

Hi, some updates.

We tried the smaller net as in the paper (by modifying the Agent code).
As I said, we don't have P100, so we still tried it with P40.
1 gpu learner + 150 remote cpu actors

The throughput is still around 22K

Also, several arguments combinations were tried to see how they impact throughput:
batch 32 unroll 100 q capacity 1: 22k
batch 64 unroll 100 q capacity 1 : 23k
batch 32 unroll 200 q capacity 1: 24k
batch 64 unroll 200 q capacity 1: 24k
batch 32 unroll 200 q capacity 8: 25k

lespeholt · 2018-08-01T07:57:52Z

Could be your network in that case. I suggest you take a look at TensorFlow performance timelines.

krfricke · 2018-09-06T21:08:56Z

Hi,
I'm also trying to reproduce the paper results on google cloud and am currently unable to reach the numbers. Actually I can't even get close to them. Can you recommend any special machine setup on google cloud? E.g. is it sensible to use a small head with one GPU and 1-2 larger actor nodes? Or do we really need to use one machine per actor?

Also, am I correct that for distributed execution I will need to modify the cluster spec in the code and pass my own? I'm currently doing that, but I was wondering if I'm missing something there.

My current setup:

1 head node (12 CPUs, 1 V100 GPU) (only learner)
2 worker nodes (96 CPUs each)
128 actors, all scheduled on the 2 worker nodes
seekavoid_arena_01
At ~5k environment frames/sec
tensorflow-gpu 1.10.1
python 3.6.4

I'm using python 3.6 (only had to make minor adjustments like replacing iteritems with items). Might that influence the results?

lespeholt · 2018-09-07T07:57:42Z

It's correct that the cluster spec needs to be modified depending on your setup.

The advantage of having several machines can be that the network is less saturated. However, your setup should get much higher speeds. The speeds you get is what you would expect only running on CPU. One thing to note though, the network in experiment.py is the bigger network described in the paper, so the target speed should be 30+k FPS.

Can you verify that you actually run on GPU? I find that the best way to debug performance issues for TensorFlow is to look at the performance timelines. On them you can see if the learner is waiting on data from actors, which operations are slow and on what device they run on.

krfricke · 2018-09-07T12:49:25Z

Thanks for your quick response.

Without GPU, we only get to about 500 env frames/sec. I made a plot of the GPU utilization using nvidia-smi in 1 second intervals:

Should we expect to see permanent high utilization or is this normal? The results are similar when using a single node and no dynamic batching. With dynamic batching, we see a constant utilization of about ~20%.

I also reproduced the issue (both single node and distributed) on a fresh install following the Dockerfile and with the following specs:

Head node:

48 CPUs
4 V100 GPUs (only 1 used, but gcloud doesn't allow less GPUs for 48 CPUs)
Ubuntu 18.04
Cuda 9.0
CuDNN 7.1.4
python 2.7
tensorflow-gpu 1.9 (also tried with 1.10)

2 Child nodes:

96 CPUs each
No GPUs, otherwise the same as above

This setup also leads to just about 5,5k environment frames/sec.

I will try to look into the timelines tonight, is there any other reason you could think of that hurts gpu utilization? should we try 16 or 32 machines with few workers each?

lespeholt · 2018-09-07T13:18:13Z

In a distributed setup, the utilization should be constantly high. In a single-machine setup, it may be somewhat low since producing the frames will slow it down.

With the setup you mention, you should definitely see similar speeds or close to them as in the paper.

Timelines for both the learner and actors is helpful.

krfricke · 2018-09-07T16:27:35Z

These are the timelines for the learner:

Zommed in:

And for one example actor (they all look alike):

The QueueDequeueManyV2 in the learner takes most of the time, but is followed by 0.5-1.0 seconds of delay (some of that time could be attributed to the timeline itself + tensorflow summaries).
There are some multiple second gaps within the actor tasks, which doesn't seem right to me. Can you help with the interpretation?

Update: We just got to 22k env frames/sec by using 32 nodes à 16 cpus, and only scheduling 4 actors on each, so it might be a matter of resource starvation?

The paper just states the number of CPU cores used in the distributed setup, but how many actors (and how many nodes) were used? We assumed as many as there were CPUs.

lespeholt · 2018-09-18T11:17:35Z

Yes, we used 1 CPU per actor. Can you try 150 actors with 1 CPU each?

It's a bit hard to interpret the timelines without interacting with them. Since dequeuemany is taking that much time on the learner, it looks like they are bottlenecked by actors or the bandwidth to them. Not sure why there is a gap between the actor steps. If they wait on enqueuing, then it suggest a bottleneck in the learner or the bandwidth. In this case it would then be the network.

Can you try and create new variables for each actor? i.e. no sharing of variables. If that is significantly faster, it's network bandwidth.

pengsun mentioned this issue Sep 17, 2018

Network bandwidth for actor update limit jaimeyzzz/impala_horovod_gym#2

Closed

pengsun mentioned this issue May 29, 2019

Question: Very low throughput (250 FPS/sec) jaimeyzzz/impala_horovod_gym#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction of 80K/sec throughput #14

Reproduction of 80K/sec throughput #14

pengsun commented Jul 31, 2018

lespeholt commented Jul 31, 2018

lespeholt commented Jul 31, 2018

pengsun commented Jul 31, 2018

lespeholt commented Jul 31, 2018

pengsun commented Jul 31, 2018 •

edited

Loading

lespeholt commented Jul 31, 2018 •

edited

Loading

pengsun commented Jul 31, 2018

pengsun commented Aug 1, 2018 •

edited

Loading

lespeholt commented Aug 1, 2018

krfricke commented Sep 6, 2018 •

edited

Loading

lespeholt commented Sep 7, 2018

krfricke commented Sep 7, 2018

lespeholt commented Sep 7, 2018

krfricke commented Sep 7, 2018 •

edited

Loading

lespeholt commented Sep 18, 2018

Reproduction of 80K/sec throughput #14

Reproduction of 80K/sec throughput #14

Comments

pengsun commented Jul 31, 2018

lespeholt commented Jul 31, 2018

lespeholt commented Jul 31, 2018

pengsun commented Jul 31, 2018

lespeholt commented Jul 31, 2018

pengsun commented Jul 31, 2018 • edited Loading

lespeholt commented Jul 31, 2018 • edited Loading

pengsun commented Jul 31, 2018

pengsun commented Aug 1, 2018 • edited Loading

lespeholt commented Aug 1, 2018

krfricke commented Sep 6, 2018 • edited Loading

lespeholt commented Sep 7, 2018

krfricke commented Sep 7, 2018

lespeholt commented Sep 7, 2018

krfricke commented Sep 7, 2018 • edited Loading

lespeholt commented Sep 18, 2018

pengsun commented Jul 31, 2018 •

edited

Loading

lespeholt commented Jul 31, 2018 •

edited

Loading

pengsun commented Aug 1, 2018 •

edited

Loading

krfricke commented Sep 6, 2018 •

edited

Loading

krfricke commented Sep 7, 2018 •

edited

Loading