Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduction of 80K/sec throughput #14

Open
pengsun opened this issue Jul 31, 2018 · 15 comments
Open

Reproduction of 80K/sec throughput #14

pengsun opened this issue Jul 31, 2018 · 15 comments

Comments

@pengsun
Copy link

pengsun commented Jul 31, 2018

Hi, I tried to reproduce the 80K/sec throughput reported in the paper, but only got around 22K/sec.

I ran the single learner on a GPU machine (the GPU is P40):

python experiment.py --job_name=learner --task=0 --num_actors=150 \
    --level_name=rooms_keys_doors_puzzle --batch_size=32 \
    --entropy_cost=0.0033391318945337044 \
    --learning_rate=0.00031866995608948655 \
    --total_environment_frames=10000000000 --reward_clipping=soft_asymmetric 

and ran 150 actors each on a CPU machine (each one is actually a docker machine in remote allocated by a cloud service):

python experiment.py --job_name=actor --task=$i \
      --num_actors=150 --level_name=rooms_keys_doors_puzzle

where i denotes the i-th actor.

Could you give some hints on how to reproduce the throughput? Did you require a proprietary intra net connection?

@lespeholt
Copy link
Collaborator

The model in the code is the larger model, so if you haven't, try with the small model. On a P100 I get 30+k FPS with the big model. Are you limited by the speed of the learner or the actors?

@lespeholt
Copy link
Collaborator

Memory bandwidth of P40 is significantly lower than P100 which could explain the difference from 22k to 30+k.

@pengsun
Copy link
Author

pengsun commented Jul 31, 2018

@lespeholt Thanks for the super quick response!

  1. I'm not sure what is the bottleneck, cpu actor or gpu learner, that's why I'm asking for help:) Can we observe some measures in the tensorboard to indentify the possible issues? If there is not existing one, could you advise some so that i can write code in my end?

  2. Okay, will try it with a P100 and use a small model (as in the left of Figure 3 of the paper, right?)

What can cause the cpu bottleneck, like a slow CPU?

Also, what is the intra net connection (between actor and learner) for the throughput reported in the paper? Did you rely on, say, a fast Ethernet or Infiniband or something? In the run I mentioned the network traffic is over the public internet, can it be a bottleneck or not an important factor?

@lespeholt
Copy link
Collaborator

  1. You can try and increase/decrease number of actors to see if it has an effect. A more precise way is to look at TensorFlow performance timelines: https://towardsdatascience.com/howto-profile-tensorflow-1a49fb18073d

  2. Just stick to the larger model for now. If you get to around 30k you have basically reproduced the results for the larger model.

You need to be able to transfer with around 2-3GB per second in total. If you increase the unroll length, you decrease the bandwidth requirements so you can try that.

@pengsun
Copy link
Author

pengsun commented Jul 31, 2018

Thanks!

How does the number 2-3GB/sec come (e.g., batch_size * width * height * rollout_len * BytesOfFloat, etc.)? I'm still reading the tf.FIFOQueue code (with capacity=1) and struggling to understand the sync mechanism. I guess answering this question my help me (and others) to understand how the Actor code works :)

Also, I just asked around and found I was unable to access a P100, the best GPU in hand is only P40... So please feel free to close the issue.

@lespeholt
Copy link
Collaborator

lespeholt commented Jul 31, 2018

Actually it's much less for the big model (roughly same number of parameters as the small model).

Per unroll:
Parameters: 1.6M*4b = 6.4MB
Observations: 100*96*72*3b = 2.07MB
Total: 8.5MB

30,000 FPS / 400 (unroll length 100 * action repeat 4) * 8.5MB ~= 650 MB/s excluding overhead

@pengsun
Copy link
Author

pengsun commented Jul 31, 2018

I see, very clearly! Thanks so much!

@pengsun
Copy link
Author

pengsun commented Aug 1, 2018

Hi, some updates.

We tried the smaller net as in the paper (by modifying the Agent code).
As I said, we don't have P100, so we still tried it with P40.
1 gpu learner + 150 remote cpu actors

The throughput is still around 22K

Also, several arguments combinations were tried to see how they impact throughput:
batch 32 unroll 100 q capacity 1: 22k
batch 64 unroll 100 q capacity 1 : 23k
batch 32 unroll 200 q capacity 1: 24k
batch 64 unroll 200 q capacity 1: 24k
batch 32 unroll 200 q capacity 8: 25k

@lespeholt
Copy link
Collaborator

Could be your network in that case. I suggest you take a look at TensorFlow performance timelines.

@krfricke
Copy link

krfricke commented Sep 6, 2018

Hi,
I'm also trying to reproduce the paper results on google cloud and am currently unable to reach the numbers. Actually I can't even get close to them. Can you recommend any special machine setup on google cloud? E.g. is it sensible to use a small head with one GPU and 1-2 larger actor nodes? Or do we really need to use one machine per actor?

Also, am I correct that for distributed execution I will need to modify the cluster spec in the code and pass my own? I'm currently doing that, but I was wondering if I'm missing something there.

My current setup:

  • 1 head node (12 CPUs, 1 V100 GPU) (only learner)
  • 2 worker nodes (96 CPUs each)
  • 128 actors, all scheduled on the 2 worker nodes
  • seekavoid_arena_01
  • At ~5k environment frames/sec
  • tensorflow-gpu 1.10.1
  • python 3.6.4

I'm using python 3.6 (only had to make minor adjustments like replacing iteritems with items). Might that influence the results?

@lespeholt
Copy link
Collaborator

It's correct that the cluster spec needs to be modified depending on your setup.

The advantage of having several machines can be that the network is less saturated. However, your setup should get much higher speeds. The speeds you get is what you would expect only running on CPU. One thing to note though, the network in experiment.py is the bigger network described in the paper, so the target speed should be 30+k FPS.

Can you verify that you actually run on GPU? I find that the best way to debug performance issues for TensorFlow is to look at the performance timelines. On them you can see if the learner is waiting on data from actors, which operations are slow and on what device they run on.

@krfricke
Copy link

krfricke commented Sep 7, 2018

Thanks for your quick response.

Without GPU, we only get to about 500 env frames/sec. I made a plot of the GPU utilization using nvidia-smi in 1 second intervals:

screenshot from 2018-09-07 11-17-18

Should we expect to see permanent high utilization or is this normal? The results are similar when using a single node and no dynamic batching. With dynamic batching, we see a constant utilization of about ~20%.

I also reproduced the issue (both single node and distributed) on a fresh install following the Dockerfile and with the following specs:

Head node:

  • 48 CPUs
  • 4 V100 GPUs (only 1 used, but gcloud doesn't allow less GPUs for 48 CPUs)
  • Ubuntu 18.04
  • Cuda 9.0
  • CuDNN 7.1.4
  • python 2.7
  • tensorflow-gpu 1.9 (also tried with 1.10)

2 Child nodes:

  • 96 CPUs each
  • No GPUs, otherwise the same as above

This setup also leads to just about 5,5k environment frames/sec.

I will try to look into the timelines tonight, is there any other reason you could think of that hurts gpu utilization? should we try 16 or 32 machines with few workers each?

@lespeholt
Copy link
Collaborator

In a distributed setup, the utilization should be constantly high. In a single-machine setup, it may be somewhat low since producing the frames will slow it down.

With the setup you mention, you should definitely see similar speeds or close to them as in the paper.

Timelines for both the learner and actors is helpful.

@krfricke
Copy link

krfricke commented Sep 7, 2018

These are the timelines for the learner:
screenshot from 2018-09-07 17-44-44
Zommed in:
screenshot from 2018-09-07 17-37-21

And for one example actor (they all look alike):
screenshot from 2018-09-07 18-18-57

The QueueDequeueManyV2 in the learner takes most of the time, but is followed by 0.5-1.0 seconds of delay (some of that time could be attributed to the timeline itself + tensorflow summaries).
There are some multiple second gaps within the actor tasks, which doesn't seem right to me. Can you help with the interpretation?

Update: We just got to 22k env frames/sec by using 32 nodes à 16 cpus, and only scheduling 4 actors on each, so it might be a matter of resource starvation?

The paper just states the number of CPU cores used in the distributed setup, but how many actors (and how many nodes) were used? We assumed as many as there were CPUs.

@lespeholt
Copy link
Collaborator

Yes, we used 1 CPU per actor. Can you try 150 actors with 1 CPU each?

It's a bit hard to interpret the timelines without interacting with them. Since dequeuemany is taking that much time on the learner, it looks like they are bottlenecked by actors or the bandwidth to them. Not sure why there is a gap between the actor steps. If they wait on enqueuing, then it suggest a bottleneck in the learner or the bandwidth. In this case it would then be the network.

Can you try and create new variables for each actor? i.e. no sharing of variables. If that is significantly faster, it's network bandwidth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants