Add streaming ASR with Emformer RNN-T (#6)

* First working version. * First C++ working version. * Refactoring. * Add streaming ASR with stateless Emformer RNN-T. * typo fixes * Fix comments. * Add web interface. * Add CI for streaming ASR. * Minor fixes to README. * Minor fixes.
k2-fsa · Jun 1, 2022 · ba865c7 · ba865c7
1 parent 259d2b9
commit ba865c7
Show file tree

Hide file tree

Showing 35 changed files with 2,282 additions and 58 deletions.
diff --git a/.github/workflows/run-streaming-test.yaml b/.github/workflows/run-streaming-test.yaml
@@ -0,0 +1,118 @@
+# Copyright      2022  Xiaomi Corp.       (author: Fangjun Kuang)
+
+# See ../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+name: Run streaming ASR tests
+
+on:
+  push:
+    branches:
+      - master
+  pull_request:
+    branches:
+      - master
+
+jobs:
+  run_streaming_asr_tests:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os: [ubuntu-18.04, macos-10.15]
+        torch: ["1.10.0"]
+        torchaudio: ["0.10.0"]
+        python-version: [3.7, 3.8, 3.9]
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+
+      - name: Setup Python
+        uses: actions/setup-python@v2
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Install GCC 7
+        if: startsWith(matrix.os, 'ubuntu')
+        run: |
+          sudo apt-get install -y gcc-7 g++-7
+          echo "CC=/usr/bin/gcc-7" >> $GITHUB_ENV
+          echo "CXX=/usr/bin/g++-7" >> $GITHUB_ENV
+
+      - name: Install PyTorch ${{ matrix.torch }}
+        shell: bash
+        if: startsWith(matrix.os, 'ubuntu')
+        run: |
+          python3 -m pip install -qq --upgrade pip
+          python3 -m pip install -qq wheel twine typing_extensions websockets sentencepiece>=0.1.96
+          python3 -m pip install -qq torch==${{ matrix.torch }}+cpu torchaudio==${{ matrix.torchaudio }}+cpu numpy -f https://download.pytorch.org/whl/cpu/torch_stable.html
+
+      - name: Install PyTorch ${{ matrix.torch }}
+        shell: bash
+        if: startsWith(matrix.os, 'macos')
+        run: |
+          python3 -m pip install -qq --upgrade pip
+          python3 -m pip install -qq wheel twine typing_extensions websockets sentencepiece>=0.1.96
+          python3 -m pip install -qq torch==${{ matrix.torch }} torchaudio==${{ matrix.torchaudio }} numpy -f https://download.pytorch.org/whl/cpu/torch_stable.html
+
+      - name: Cache kaldifeat
+        id: my-cache
+        uses: actions/cache@v2
+        with:
+          path: |
+            ~/tmp/kaldifeat
+          key: cache-tmp-${{ matrix.python-version }}-${{ matrix.os }}
+
+      - name: Install kaldifeat
+        if: steps.my-cache.outputs.cache-hit != 'true'
+        shell: bash
+        run: |
+          .github/scripts/install-kaldifeat.sh
+
+      - name: Install sherpa
+        shell: bash
+        run: |
+          python3 setup.py install
+
+      - name: Download pretrained model and test-data
+        shell: bash
+        run: |
+          git lfs install
+          git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01
+
+      - name: Start server
+        shell: bash
+        run: |
+          export PYTHONPATH=~/tmp/kaldifeat/kaldifeat/python:$PYTHONPATH
+          export PYTHONPATH=~/tmp/kaldifeat/build/lib:$PYTHONPATH
+
+          ./sherpa/bin/pruned_stateless_emformer_rnnt2/streaming_server.py \
+            --port 6006 \
+            --max-batch-size 50 \
+            --max-wait-ms 5 \
+            --nn-pool-size 1 \
+            --nn-model-filename ./icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01/exp/cpu_jit-epoch-39-avg-6-use-averaged-model-1.pt \
+            --bpe-model-filename ./icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01/data/lang_bpe_500/bpe.model &
+
+          echo "Sleep 10 seconds to wait for the server startup"
+          sleep 10
+
+      - name: Start client
+        shell: bash
+        run: |
+          ./sherpa/bin/pruned_stateless_emformer_rnnt2/streaming_client.py \
+            --server-addr localhost \
+            --server-port 6006 \
+            ./icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01/test_wavs/1221-135766-0001.wav
diff --git a/.github/workflows/run-test.yaml b/.github/workflows/run-test.yaml
@@ -1,4 +1,3 @@
-
 # Copyright      2022  Xiaomi Corp.       (author: Fangjun Kuang)
 
 # See ../../LICENSE for clarification regarding multiple authors

diff --git a/.github/workflows/style_check.yml b/.github/workflows/style_check.yml
@@ -0,0 +1,48 @@
+# Copyright (c)  2022  Xiaomi Corporation (authors: Fangjun Kuang)
+#
+# See ../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: style_check
+
+on:
+  push:
+    branches:
+      - master
+  pull_request:
+    branches:
+      - master
+
+jobs:
+  style_check:
+    runs-on: ubuntu-18.04
+    strategy:
+      matrix:
+        python-version: [3.8]
+      fail-fast: false
+
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+
+      - name: Setup Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v1
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Check style with cpplint
+        shell: bash
+        working-directory: ${{github.workspace}}
+        run: ./scripts/check_style_cpplint.sh
diff --git a/README.md b/README.md
@@ -1,24 +1,25 @@
 ## Introduction
 
-An ASR server framework in **Python**, aiming to support both streaming
+An ASR server framework in **Python**, supporting both streaming
 and non-streaming recognition.
 
-**Note**: Only non-streaming recognition is implemented at present. We
-will add streaming recognition later.
-
 CPU-bound tasks, such as neural network computation, are implemented in
 C++; while IO-bound tasks, such as socket communication, are implemented
 in Python.
 
-**Caution**: We assume the model is trained using pruned stateless RNN-T
-from [icefall][icefall] and it is from a directory like
-`pruned_transducer_statelessX` where `X` >=2.
+**Caution**: For offline ASR, we assume the model is trained using pruned
+stateless RNN-T from [icefall][icefall] and it is from a directory like
+`pruned_transducer_statelessX` where `X` >=2. For streaming ASR, we
+assume the model is using `pruned_stateless_emformer_rnnt2`.
 
-We provide a Colab notebook, containing how to start the server, how to
-start the client, and how to decode `test-clean` of LibriSpeech.
+For the offline ASR, we provide a Colab notebook, containing how to start the
+server, how to start the client, and how to decode `test-clean` of LibriSpeech.
 
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JX5Ph2onYm1ZjNP_94eGqZ-DIRMLlIca?usp=sharing)
 
+For the streaming ASR, we provide a YouTube demo, showing you how to use it.
+See <https://www.youtube.com/watch?v=z7HgaZv5W0U>
+
 ## Installation
 
 First, you have to install `PyTorch` and `torchaudio`. PyTorch 1.10 is known
@@ -63,7 +64,6 @@ make -j
 export PYTHONPATH=$PWD/../sherpa/python:$PWD/lib:$PYTHONPATH
 ```
 
-
 ## Usage
 
 First, check that `sherpa` has been installed successfully:
@@ -74,7 +74,103 @@ python3 -c "import sherpa; print(sherpa.__version__)"
 
 It should print the version of `sherpa`.
 
-### Start the server
+#### Streaming ASR with pruned stateless Emformer RNN-T
+
+#### Start the server
+
+To start the server, you need to first generate two files:
+
+- (1) The torch script model file. You can use `export.py --jit=1` in
+`pruned_stateless_emformer_rnnt2` from [icefall][icefall].
+
+- (2) The BPE model file. You can find it in `data/lang_bpe_XXX/bpe.model`
+in [icefall][icefall], where `XXX` is the number of BPE tokens used in
+the training.
+
+With the above two files ready, you can start the server with the
+following command:
+
+```bash
+./sherpa/bin/pruned_stateless_emformer_rnnt2/streaming_server.py \
+  --port 6006 \
+  --max-batch-size 50 \
+  --max-wait-ms 5 \
+  --nn-pool-size 1 \
+  --nn-model-filename ./path/to/exp/cpu_jit.pt \
+  --bpe-model-filename ./path/to/data/lang_bpe_500/bpe.model
+```
+
+You can use `./sherpa/bin/pruned_stateless_emformer_rnnt2/streaming_server.py --help`
+to view the help message.
+
+We provide a pretrained model using the LibriSpeech dataset at
+<https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01>
+
+The following shows how to use the above pretrained model to start the server.
+
+```bash
+git lfs install
+git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01
+
+./sherpa/bin/pruned_stateless_emformer_rnnt2/streaming_server.py \
+  --port 6006 \
+  --max-batch-size 50 \
+  --max-wait-ms 5 \
+  --nn-pool-size 1 \
+  --nn-model-filename ./icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01/exp/cpu_jit-epoch-39-avg-6-use-averaged-model-1.pt \
+  --bpe-model-filename ./icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01/data/lang_bpe_500/bpe.model
+```
+
+#### Start the client
+
+We provide two clients at present:
+
+ - (1) [./sherpa/bin/pruned_stateless_emformer_rnnt2/streaming_client.py](./sherpa/bin/pruned_stateless_emformer_rnnt2/streaming_client.py)
+   It shows how to decode a single sound file.
+
+ - (2) [./sherpa/bin/pruned_stateless_emformer_rnnt2/web](./sherpa/bin/pruned_stateless_emformer_rnnt2/web)
+   You can record your speech in real-time within a browser and send it to the server for recognition.
+
+##### streaming_client.py
+
+```bash
+./sherpa/bin/pruned_stateless_emformer_rnnt2/streaming_client.py --help
+
+./sherpa/bin/pruned_stateless_emformer_rnnt2/streaming_client.py \
+  --server-addr localhost \
+  --server-port 6006 \
+  ./icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01/test_wavs/1221-135766-0001.wav
+```
+
+##### Web client
+
+```bash
+cd ./sherpa/bin/pruned_stateless_emformer_rnnt2/web
+python3 -m http.server 6008
+```
+
+Then open your browser and go to `http://localhost:6008/record.html`. You will
+see a UI like the following screenshot.
+
+![web client screenshot](./pic/emformer-streaming-asr-web-client.png)
+
+Click the button `Record`.
+
+Now you can `speak` and you will get recognition results from the
+server in real-time.
+
+**Caution**: For the web client, we hard-code the server port to `6006`.
+You can change the file [./sherpa/bin/pruned_stateless_emformer_rnnt2/web/record.js](./sherpa/bin/pruned_stateless_emformer_rnnt2/web/record.js)
+to replace `6006` in it to whatever port the server is using.
+
+**Caution**: `http://0.0.0.0:6008/record.html` or `http://127.0.0.1:6008/record.html`
+won't work. You have to use `localhost`. Otherwise, you won't be able to use
+your microphone in your browser since we are not using `https` which requires
+a certificate.
+
+### Offline ASR
+
+#### Start the server
 
 To start the server, you need to first generate two files:
 
@@ -97,7 +193,7 @@ sherpa/bin/offline_server.py \
   --feature-extractor-pool-size 5 \
   --nn-pool-size 1 \
   --nn-model-filename ./path/to/exp/cpu_jit.pt \
-  --bpe-model-filename ./path/to/data/lang_bpe_500/bpe.model &
+  --bpe-model-filename ./path/to/data/lang_bpe_500/bpe.model
 ```
 
 You can use `./sherpa/bin/offline_server.py --help` to view the help message.
@@ -122,7 +218,7 @@ sherpa/bin/offline_server.py \
   --bpe-model-filename ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/data/lang_bpe_500/bpe.model
 ```
 
-### Start the client
+#### Start the client
 After starting the server, you can use the following command to start the client:
 
 ```bash
@@ -147,7 +243,7 @@ sherpa/bin/offline_client.py \
   icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13//test_wavs/1221-135766-0002.wav
 ```
 
-### RTF test
+#### RTF test
 
 We provide a demo [./sherpa/bin/decode_manifest.py](./sherpa/bin/decode_manifest.py)
 to decode the `test-clean` dataset from the LibriSpeech corpus.

diff --git a/pic/emformer-streaming-asr-web-client.png b/pic/emformer-streaming-asr-web-client.png
Original file line number	Diff line number	Diff line change
		@@ -1,4 +1,3 @@

		# Copyright 2022 Xiaomi Corp. (author: Fangjun Kuang)

		# See ../../LICENSE for clarification regarding multiple authors
Expand Down