Slow training of pruned transducer is normal? #391

RuABraun · 2022-05-31T13:28:14Z

RuABraun
May 31, 2022

Hi,

I'm training a model with my own data by combining tedlium's asr_data_module.py and librespeech's pruned_transducer_stateless2/train.py. I don't think I've made significant changes, so I would expect everything to run as it's intended to.
However, my GPU utilization is quite low (as in I can see it bouncing between 0 and >90% util, and it's longer at 0 than at 90+). I don't know whether maybe this is normal with the RNNT loss?

The GPU is a RTX3090, I'm not bottlenecked by IO (minimal time spent between end of and start of batch iteration loop), my env is:

python==3.8.13
icefall 28th-may
lhotse==1.2
torch==1.13.0.dev20220529+cu116
k2==1.15.1.dev20220530+cuda11.6.torch1.13.0.dev20220529
cudnn==8.4

So my question is should I expect this training speed? Is this inherent with using RNNT? 50 batches (max-duration 500) take more than 2 minutes (closer to 3).

Answered by csukuangfj

May 31, 2022

With V100 32GB, when the max duration is 300, it takes about 33 seconds to 55 seconds for 50 batches for the LibriSpeech corpus.

You may use https://github.com/benfred/py-spy to find out which one is the time consuming part.

View full answer

csukuangfj · 2022-05-31T13:44:00Z

csukuangfj
May 31, 2022
Maintainer

With V100 32GB, when the max duration is 300, it takes about 33 seconds to 55 seconds for 50 batches for the LibriSpeech corpus.

You may use https://github.com/benfred/py-spy to find out which one is the time consuming part.

23 replies

csukuangfj Jun 5, 2022
Maintainer

~~I find that dynamic_lru_cache by default does nothing. I tried to use lhotse.set_caching_enabled(True), but it still does not help for --num-workers=2.~~

[EDITED]: not related to this. lookup_cache_or_open_regular_file uses lru_cache, not dynamic_lru_cache.

csukuangfj Jun 5, 2022
Maintainer

Disable the cache entirely seems to fix the issue.

diff --git a/lhotse/features/io.py b/lhotse/features/io.py
index 8ddeed1..92900f4 100644
--- a/lhotse/features/io.py
+++ b/lhotse/features/io.py
@@ -737,8 +737,7 @@ class LilcomChunkyReader(FeaturesReader):

     def __init__(self, storage_path: Pathlike, *args, **kwargs):
         super().__init__()
-        self.file = lookup_cache_or_open_regular_file(storage_path)
-        self.lock = threading.Lock()
+        self.file = open(storage_path, "rb")

     @dynamic_lru_cache
     def read(
@@ -763,9 +762,8 @@ class LilcomChunkyReader(FeaturesReader):
         for offset, end in pairwise(chunk_offsets):
             # We need to use locks to avoid race conditions between seek
             # and read in multi-threaded reads.
-            with self.lock:
-                self.file.seek(offset)
-                chunk_data.append(self.file.read(end - offset))
+            self.file.seek(offset)
+            chunk_data.append(self.file.read(end - offset))

         # Read, decode, concat
         decompressed_chunks = [lilcom.decompress(data) for data in chunk_data]

pzelasko Jun 5, 2022
Maintainer

Interesting. I’ll try to replicate that with mini librispeech. I wonder how much the removal of caching would affect the performance (it’s an extra number of potentially slow syscalls).

danpovey Jun 5, 2022
Maintainer

I'm surprised that @csukuangfj 's first change did not help.

pzelasko Jun 5, 2022
Maintainer

I'm also surprised I never ran into this issue myself. What system is this running on?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow training of pruned transducer is normal? #391

{{title}}

Replies: 1 comment 23 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Slow training of pruned transducer is normal? #391

RuABraun May 31, 2022

Replies: 1 comment · 23 replies

csukuangfj May 31, 2022 Maintainer

csukuangfj Jun 5, 2022 Maintainer

csukuangfj Jun 5, 2022 Maintainer

pzelasko Jun 5, 2022 Maintainer

danpovey Jun 5, 2022 Maintainer

pzelasko Jun 5, 2022 Maintainer

RuABraun
May 31, 2022

Replies: 1 comment 23 replies

csukuangfj
May 31, 2022
Maintainer

csukuangfj Jun 5, 2022
Maintainer

csukuangfj Jun 5, 2022
Maintainer

pzelasko Jun 5, 2022
Maintainer

danpovey Jun 5, 2022
Maintainer

pzelasko Jun 5, 2022
Maintainer