sovaai · Oct 23, 2021 · Oct 23, 2021 · Oct 23, 2021 · Oct 23, 2021
diff --git a/README.md b/README.md
@@ -1,115 +1,108 @@
-# SOVA ASR
+# Система автопротоколирования конференций в онлайн режиме
 
-SOVA ASR is a fast speech recognition solution based on [Wav2Letter](https://arxiv.org/abs/1609.03193) architecture. It is designed as a REST API service and it can be customized (both code and models) for your needs.
+## Системные требования:
+Операционная система, поддерживающая работу с Docker, предпочтительно Ubuntu 20.04, минимум 16 GB RAM, минимум 4 ядра, процессор с тактовой частотой не ниже 2.50 GHz, видеокарта NVIDIA с объёмом графической памяти не меньше 8 GB, 15 GB свободного места на SSD.
 
-## Installation
+Рекомендуемая конфигурация: инстанс типа g4dn.2xlarge в AWS с Ubuntu 20.04 и 50 GB SSD.
 
-The easiest way to deploy the service is via docker-compose, so you have to install Docker and docker-compose first. Here's a brief instruction for Ubuntu:
-
-#### Docker installation
+## Инструкция по разворачиванию:
+Клонируем репозиторий и переходим в папку проекта:
+```
+git clone https://github.com/sxdxfan/sova-asr
+cd sova-asr
+```
 
-*	Install Docker:
-```bash
-$ sudo apt-get update
-$ sudo apt-get install \
+Устанавливаем Docker и docker-compose с поддержкой NVIDIA:
+```
+sudo apt-get update
+sudo apt-get install \
     apt-transport-https \
     ca-certificates \
     curl \
     gnupg-agent \
     software-properties-common
-$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
-$ sudo apt-key fingerprint 0EBFCD88
-$ sudo add-apt-repository \
-   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
-   $(lsb_release -cs) \
-   stable"
-$ sudo apt-get update
-$ sudo apt-get install docker-ce docker-ce-cli containerd.io
-$ sudo usermod -aG docker $(whoami)
+curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
+sudo apt-key fingerprint 0EBFCD88
+sudo add-apt-repository \
+    "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
+    $(lsb_release -cs) \
+    stable"
+sudo apt-get update
+sudo apt-get install docker-ce docker-ce-cli containerd.io
+sudo usermod -aG docker $(whoami)
+sudo curl -L "https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
+sudo chmod +x /usr/local/bin/docker-compose
+curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
+    sudo apt-key add -
+distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
+curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
+    sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
+sudo apt-get update
+sudo apt-get install nvidia-container-runtime
+sudo echo -e '{\n    "runtimes": {\n        "nvidia": {\n            "path": "nvidia-container-runtime",\n            "runtimeArgs": []\n        }\n    },\n    "default-runtime": "nvidia"\n}' >> /etc/docker/daemon.json
+sudo systemctl restart docker.service
 ```
-In order to run docker commands without sudo you might need to relogin.
-*   Install docker-compose:
+
+Скачиваем и разворачиваем веса моделей:
 ```
-$ sudo curl -L "https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
-$ sudo chmod +x /usr/local/bin/docker-compose
+wget http://dataset.sova.ai/SOVA-ASR/data.tar.gz
+tar -xvf data.tar.gz && rm data.tar.gz
 ```
 
-*   (Optional) If you're planning on using CUDA run these commands:
+Запускаем бэкенд часть (поднимутся сервисы на портах 8888, 8889, 8890):
 ```
-$ curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
-  sudo apt-key add -
-$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
-$ curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
-  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
-$ sudo apt-get update
-$ sudo apt-get install nvidia-container-runtime
+sudo docker-compose build
+sudo docker-compose up -d sova-asr sova-asr-decoder sova-asr-punctuator
 ```
-Add the following content to the file **/etc/docker/daemon.json**:
-```json
-{
-    "runtimes": {
-        "nvidia": {
-            "path": "nvidia-container-runtime",
-            "runtimeArgs": []
-        }
-    },
-    "default-runtime": "nvidia"
-}
-```
-Restart the service:
-```bash
-$ sudo systemctl restart docker.service
-``` 
-
-#### Build and deploy
 
-**In order to run service with pretrained models you will have to download http://dataset.sova.ai/SOVA-ASR/data.tar.gz.**
-
-*   Clone the repository, download the pretrained models archive and extract the contents into the project folder:
-```bash
-$ git clone --recursive https://github.com/sovaai/sova-asr.git
-$ cd sova-asr/
-$ wget http://dataset.sova.ai/SOVA-ASR/data.tar.gz
-$ tar -xvf data.tar.gz && rm data.tar.gz
+Переходим в подпапку с фронтендом и устанавливаем зависимости:
 ```
-
-*   Build docker image
-     *   If you're planning on using GPU (it is required for training and can be used for inference): build *sova-asr* image using the following command:
-     ```bash
-     $ sudo docker-compose build
-     ```
-     *   If you're planning on using CPU only: modify `Dockerfile`, `docker-compose.yml` (remove the runtime and environment sections) and `config.ini` (*cpu* should be set to 0) and build *sova-asr* image:
-     ```bash
-     $ sudo docker-compose build
-     ```
-
-*	Run web service in a docker container
-     ```bash
-     $ sudo docker-compose up -d sova-asr
-     ```
-
-## Testing
-
-To test the service you can send a POST request:
-```bash
-$ curl --request POST 'http://localhost:8888/asr' --form 'audio_blob=@"data/test.wav"'
+cd frontend
+curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -
+echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list
+sudo apt-get update
+sudo apt-get install -y --upgrade npm node-gyp nodejs-dev libssl1.0-dev yarn
+sudo npm install -g n
+sudo n stable
+yarn install
 ```
 
-## Finetuning acoustic model
-
-If you want to finetune the acoustic model you can set hyperparameters and paths to your own train and validation manifest files and run the training service.
-
-*	Set training options in *Train* section of **config.ini**. Train and validation csv manifest files should contain comma-separated audio file paths and reference texts in each line. For instance:
-     ```bash
-     data/audio/000000.wav,добрый день
-     data/audio/000001.wav,как ваши дела
-     ...
-     ```
-*	Run training in docker container:
-     ```bash
-     $ sudo docker-compose up -d sova-asr-train
-     ```
+Производим билд:
+```
+yarn build
+```
 
-## Customizations
+После билда в папке фронтенда появится подпапка build, к которой необходимо указать путь в конфигурации веб сервера (например, nginx). Также необходимо сконфигурировать пути обращений к API бэкенда. Пример конфигурации nginx:
 
-If you want to train your own acoustic model refer to [PuzzleLib tutorials](https://puzzlelib.org/tutorials/Wav2Letter/). Check [KenLM documentation](https://kheafield.com/code/kenlm/) for building your own language model. This repository was tested on Ubuntu 18.04 and has pre-built .so Trie decoder files for Python 3.6 running inside the Docker container, for modifications you can get your own .so files using [Wav2Letter++](https://github.com/facebookresearch/wav2letter) code for building Python bindings. Otherwise you can use a standard Greedy decoder (set in config.ini).
+```
+server {
+	index index.html index.php index.htm index.php;
+	add_header X-Frame-Options "SAMEORIGIN";
+	add_header X-Content-Type-Options "nosniff";
+	client_max_body_size 700M;
+	proxy_connect_timeout 600;
+	proxy_send_timeout 600;
+	proxy_read_timeout 600;
+	send_timeout 600;
+	location = /robots.txt {
+		add_header Content-Type text/plain;
+		return 200 "User-agent: *\nDisallow: /\n";
+	}
+	location / {
+		index index.html index.php index.htm index.php;
+		root /var/www/sova-asr/frontend/build;
+		client_max_body_size 256M;
+		try_files $uri $uri/ /index.html;
+	}
+	location /asr {
+		proxy_pass http://localhost:8888;
+		client_max_body_size 700M;
+         }
+	server_name SERVER_NAME;
+	listen 443 ssl http2; 
+	ssl_certificate SSL_CERTIFICATE; 
+	ssl_certificate_key SSL_CERTIFICATE_KEY;
+	access_log /var/log/nginx/asr-access.log;
+	error_log /var/log/nginx/asr-error.log;
+}
+```
diff --git a/app.py b/app.py
@@ -13,19 +13,23 @@ def index():
 
 @app.route('/asr', methods=['POST'])
 def asr():
+    host_url = "https://asr-contest.nanosemantics.ai"
     res = []
     for f in request.files:
         if f.startswith('audio_blob') and FileHandler.check_format(request.files[f]):
 
-            response_code, filename, response = FileHandler.get_recognized_text(request.files[f])
+            response_code, audio_file, docx_file, response = FileHandler.get_recognized_text(request.files[f])
 
             if response_code == 0:
-                response_audio_url = url_for('media_file', filename=filename)
+                response_audio_url = url_for('media_file', filename=audio_file)
+                response_docx_url = url_for('media_file', filename=docx_file)
             else:
                 response_audio_url = None
+                response_docx_url = None
 
             res.append({
-                'response_audio_url': response_audio_url,
+                'response_docx_url': host_url + response_docx_url if response_docx_url else '',
+                'response_audio_url': host_url + response_audio_url if response_audio_url else '',
                 'response_code': response_code,
                 'response': response,
             })

diff --git a/config.ini b/config.ini
@@ -6,10 +6,10 @@ labels = [_-абвгдеёжзийклмнопрстуфхцчшщъыьэюя ]
 model_path = data/w2l-16khz.hdf
 
 # Path to language model
-lm_path = data/vosk/lm.klm
+lm_path = data/lm/lm.klm
 
 # Path to the lexicon file
-lexicon = data/vosk/lexicon.txt
+lexicon = data/lm/lexicon.txt
 
 # Path to prediction tokens file
 tokens = data/tokens.txt
@@ -32,6 +32,15 @@ window_size = 0.02
 # Window stride in seconds for acoustic model samples
 window_stride = 0.01
 
+# Voice Activity Detector agressiveness mode (0-3)
+vad_aggressiveness_mode = 3
+
+# Voice Activity Detector frame duration in milliseconds
+vad_frame_duration_ms = 10
+
+# Voice Activity Detector maximum pause duration in milliseconds
+vad_max_pause_ms = 500
+
 
 [Train]
 # Path to train manifest csv
@@ -62,4 +71,4 @@ checkpoint_per_batch = 1000
 save_folder = Checkpoints/
 
 # Continue from checkpoint model
-continue_from = data/w2l-16khz.hdf
+continue_from = data/w2l-16khz.hdf
diff --git a/data_loader.py b/data_loader.py
@@ -18,11 +18,10 @@ def load_audio(path, sample_rate):
     sound = sound.set_channels(1)
     sound = sound.set_sample_width(2)
 
-    return np.array(sound.get_array_of_samples()).astype(float)
+    return sound
 
 
-def preprocess(audio_path, sample_rate=16000, window_size=0.02, window_stride=0.01, window='hamming'):
-    audio = load_audio(audio_path, sample_rate)
+def preprocess(audio, sample_rate=16000, window_size=0.02, window_stride=0.01, window='hamming'):
     nfft = int(sample_rate * window_size)
     win_length = nfft
     hop_length = int(sample_rate * window_stride)

diff --git a/decoder.py b/decoder.py
@@ -59,7 +59,7 @@ def decode(self, output, start_timestamp=0, frame_time=0.02):
 
 
 class TrieDecoder:
-    def __init__(self, lexicon, tokens, lm_path, beam_threshold=30):
+    def __init__(self, lexicon, tokens, lm_path, beam_threshold=10):
         from trie_decoder.common import Dictionary, create_word_dict, load_words
         from trie_decoder.decoder import CriterionType, DecoderOptions, KenLM, LexiconDecoder
         lexicon = load_words(lexicon)
@@ -101,12 +101,16 @@ def get_trie(self, lexicon):
 
         return trie, sil_idx, blank_idx, unk_idx
 
-    def decode(self, output, start_timestamp=0, frame_time=0.02):
+    def decode(self, output, start_timestamp=0, frame_time=0.02, max_decoder_len=500):
         output = np.log(softmax(output[:, :].astype(np.float32, copy=False), axis=-1))
 
-        t, n = output.shape
-        result = self.trieDecoder.decode(output.ctypes.data, t, n)[0]
-        tokens = result.tokens
+        results = []
+        for i in range(1 + output.shape[0] // max_decoder_len):
+            output_part = output[i * max_decoder_len:(i + 1) * max_decoder_len]
+            t, n = output_part.shape
+            results.append(self.trieDecoder.decode(output_part.ctypes.data, t, n)[0])
+
+        tokens = [token for result in results for token in result.tokens]
 
         words, new_word = [], True
         current_word, current_timestamp, start_idx, end_idx = None, start_timestamp, 0, 0
@@ -134,14 +138,15 @@ def decode(self, output, start_timestamp=0, frame_time=0.02):
                         words_len += end_idx - start_idx
                         words.append({
                             "word": current_word,
-                            "start": np.round(current_timestamp, 2),
+                            "timestamp": max(0.0, np.round(current_timestamp, 2) - 0.2),
                             "end": np.round(end_timestamp, 2),
-                            "confidence": np.round(np.exp(word_lm_score / max(1, end_idx - start_idx)) * 100, 2)
+                            "confidence": np.round(np.exp(word_lm_score / max(10, end_idx - start_idx)) * 100, 2)
                         })
 
                     else:
                         current_word += self.tokenDict.get_entry(k)
 
-        score = np.round(np.exp(result.score / max(1, words_len)), 2)
+        score = np.mean([result.score for result in results])
+        score = np.round(np.exp(score / max(1, words_len)), 2)
 
         return DecodeResult(score, words)
diff --git a/decoder_app.py b/decoder_app.py
@@ -0,0 +1,31 @@
+import configparser
+from decoder import TrieDecoder
+from flask import Flask, request
+import json
+import numpy as np
+
+
+config = configparser.ConfigParser()
+config.read("config.ini", encoding="UTF-8")
+lexicon = config["Wav2Letter"]["lexicon"]
+tokens = config["Wav2Letter"]["tokens"]
+lm_path = config["Wav2Letter"]["lm_path"]
+beam_threshold = float(config["Wav2Letter"]["beam_threshold"])
+decoder = TrieDecoder(lexicon, tokens, lm_path, beam_threshold)
+
+app = Flask(__name__)
+
+
+@app.route("/decode", methods=["POST"])
+def decode():
+    data = request.json
+    outputs = np.array(data["outputs"])
+    result = decoder.decode(outputs, start_timestamp=data["start_timestamp"])
+
+    results = {
+        "text": result.text,
+        "score": result.score,
+        "words": result.words
+    }
+
+    return json.dumps(results, ensure_ascii=False)