open-mmlab · RMSnow · Jan 6, 2024 · Jan 1, 2024 · Jan 1, 2024 · Jan 2, 2024
diff --git a/README.md b/README.md
@@ -73,7 +73,7 @@ Amphion provides a comprehensive objective evaluation of the generated audio. Th
 - **Energy Modeling**: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
 - **Intelligibility**: Character/Word Error Rate, which can be calculated based on [Whisper](https://github.com/openai/whisper) and more.
 - **Spectrogram Distortion**: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.
-- **Speaker Similarity**: Cosine similarity, which can be calculated based on [RawNet3](https://github.com/Jungjee/RawNet), [WeSpeaker](https://github.com/wenet-e2e/wespeaker), and more.
+- **Speaker Similarity**: Cosine similarity, which can be calculated based on [RawNet3](https://github.com/Jungjee/RawNet), [Resemblyzer](https://github.com/resemble-ai/Resemblyzer), [WeSpeaker](https://github.com/wenet-e2e/wespeaker), and more.
 
 ### Datasets
 

diff --git a/bins/calc_metrics.py b/bins/calc_metrics.py
@@ -50,7 +50,8 @@
     "v_uv_f1": extract_f1_v_uv,
     "cer": extract_cer,
     "wer": extract_wer,
-    "speaker_similarity": extract_speaker_similarity,
+    "rawnet3_similarity": extract_speaker_similarity,
+    "resemblyzer_similarity": extract_resemblyzer_similarity,
     "fad": extract_fad,
     "mcd": extract_mcd,
     "mstft": extract_mstft,
@@ -65,23 +66,11 @@ def calc_metric(ref_dir, deg_dir, dump_dir, metrics, fs=None):
     result = defaultdict()
 
     for metric in tqdm(metrics):
-        if metric == "speaker_similarity":
-            print("Select the model to use for speaker similarity:")
-            print("(1) RawNet3")
-            print("(2) Resemblyzer")
-            model_choice = input("Enter the number of your choice: ").strip()
-
-            if model_choice not in ["1", "2"]:
-                print("Invalid choice. Exiting the program.")
-                sys.exit(1)
-
-            if model_choice == "1":
-                result[metric] = str(METRIC_FUNC[metric](ref_dir, deg_dir))
-            elif model_choice == "2":
-                similarity_score = extract_resemblyzer_similarity(
-                    deg_dir, ref_dir, dump_dir
-                )
-                result[metric] = str(similarity_score)
+        if metric in ["fad", "rawnet3_similarity"]:
+            result[metric] = str(METRIC_FUNC[metric](ref_dir, deg_dir))
+            continue
+        elif metric in ["resemblyzer_similarity"]:
+            result[metric] = str(METRIC_FUNC[metric](deg_dir, ref_dir, dump_dir))
             continue
 
         audios_ref = []

diff --git a/egs/metrics/README.md b/egs/metrics/README.md
@@ -25,6 +25,7 @@ Until now, Amphion Evaluation has supported the following objective metrics:
   - Scale Invariant Signal to Noise Ratio (SISNR)
 - **Speaker Similarity**:
   - Cosine similarity based on [Rawnet3](https://github.com/Jungjee/RawNet)
+  - Cosine similarity based on [Resemblyzer](https://github.com/resemble-ai/Resemblyzer)
   - Cosine similarity based on [WeSpeaker](https://github.com/wenet-e2e/wespeaker) (👨‍💻 developing)
 
 We provide a recipe to demonstrate how to objectively evaluate your generated audios. There are three steps in total:
@@ -37,7 +38,7 @@ We provide a recipe to demonstrate how to objectively evaluate your generated au
 
 If you want to calculate `RawNet3` based speaker similarity, you need to download the pretrained model first, as illustrated [here](../../pretrained/README.md).
 
-## 2. Aduio Data Preparation
+## 2. Audio Data Preparation
 
 Prepare reference audios and generated audios in two folders, the `ref_dir` contains the reference audio and the `gen_dir` contains the generated audio. Here is an example.
 
@@ -74,21 +75,69 @@ As for the metrics, an example is provided below:
 
 All currently available metrics keywords are listed below:
 
-| Keys                  | Description                                |
-| --------------------- | ------------------------------------------ |
-| `fpc`                 | F0 Pearson Coefficients                    |
-| `f0_periodicity_rmse` | F0 Periodicity Root Mean Square Error      |
-| `f0rmse`              | F0 Root Mean Square Error                  |
-| `v_uv_f1`             | Voiced/Unvoiced F1 Score                   |
-| `energy_rmse`         | Energy Root Mean Square Error              |
-| `energy_pc`           | Energy Pearson Coefficients                |
-| `cer`                 | Character Error Rate                       |
-| `wer`                 | Word Error Rate                            |
-| `speaker_similarity`  | Cos Similarity based on RawNet3            |
-| `fad`                 | Frechet Audio Distance                     |
-| `mcd`                 | Mel Cepstral Distortion                    |
-| `mstft`               | Multi-Resolution STFT Distance             |
-| `pesq`                | Perceptual Evaluation of Speech Quality    |
-| `si_sdr`              | Scale Invariant Signal to Distortion Ratio |
-| `si_snr`              | Scale Invariant Signal to Noise Ratio      |
-| `stoi`                | Short Time Objective Intelligibility       |
+| Keys                      | Description                                |
+| ------------------------- | ------------------------------------------ |
+| `fpc`                     | F0 Pearson Coefficients                    |
+| `f0_periodicity_rmse`     | F0 Periodicity Root Mean Square Error      |
+| `f0rmse`                  | F0 Root Mean Square Error                  |
+| `v_uv_f1`                 | Voiced/Unvoiced F1 Score                   |
+| `energy_rmse`             | Energy Root Mean Square Error              |
+| `energy_pc`               | Energy Pearson Coefficients                |
+| `cer`                     | Character Error Rate                       |
+| `wer`                     | Word Error Rate                            |
+| `rawnet3_similarity`      | Cos Similarity based on RawNet3            |
+| `resemblyzer_similarity`  | Cos Similarity based on Resemblyzer        |
+| `fad`                     | Frechet Audio Distance                     |
+| `mcd`                     | Mel Cepstral Distortion                    |
+| `mstft`                   | Multi-Resolution STFT Distance             |
+| `pesq`                    | Perceptual Evaluation of Speech Quality    |
+| `si_sdr`                  | Scale Invariant Signal to Distortion Ratio |
+| `si_snr`                  | Scale Invariant Signal to Noise Ratio      |
+| `stoi`                    | Short Time Objective Intelligibility       |
+
+
+
+## Troubleshooting
+### FAD (Using Offline Models)
+If your system is unable to access huggingface.co from the terminal, you might run into an error like "OSError: Can't load tokenizer for ...". To work around this, follow these steps to use local models:
+
+1. Download the [bert-base-uncased](https://huggingface.co/bert-base-uncased), [roberta-base](https://huggingface.co/roberta-base), and [facebook/bart-base](https://huggingface.co/facebook/bart-base) models from `huggingface.co`. Ensure that the models are complete and uncorrupted. Place these directories within `Amphion/pretrained`. For a detailed file structure reference, see [This README](../../pretrained/README.md#optional-model-dependencies-for-evaluation) under `Amphion/pretrained`.
+2. Inside the `Amphion/pretrained` directory, create a bash script with the content outlined below. This script will automatically update the tokenizer paths used by your system:
+  ```bash
+  #!/bin/bash
+
+  BERT_DIR="bert-base-uncased"
+  ROBERTA_DIR="roberta-base"
+  BART_DIR="facebook/bart-base"
+  PYTHON_SCRIPT="[YOUR ENV PATH]/lib/python3.9/site-packages/laion_clap/training/data.py"
+
+  update_tokenizer_path() {
+      local dir_name=$1
+      local tokenizer_variable=$2
+      local full_path
+
+      if [ -d "$dir_name" ]; then
+          full_path=$(realpath "$dir_name")
+          if [ -f "$PYTHON_SCRIPT" ]; then
+              sed -i "s|${tokenizer_variable}.from_pretrained(\".*\")|${tokenizer_variable}.from_pretrained(\"$full_path\")|" "$PYTHON_SCRIPT"
+              echo "Updated ${tokenizer_variable} path to $full_path."
+          else
+              echo "Error: The specified Python script does not exist."
+              exit 1
+          fi
+      else
+          echo "Error: The directory $dir_name does not exist in the current directory."
+          exit 1
+      fi
+  }
+
+  update_tokenizer_path "$BERT_DIR" "BertTokenizer"
+  update_tokenizer_path "$ROBERTA_DIR" "RobertaTokenizer"
+  update_tokenizer_path "$BART_DIR" "BartTokenizer"
+
+  echo "BERT, BART and RoBERTa Python script paths have been updated."
+
+  ```
+
+3. The script provided is intended to adjust the tokenizer paths in the `data.py` file, found under `/lib/python3.9/site-packages/laion_clap/training/`, within your specific environment. For those utilizing conda, you can determine your environment path by running `conda info --envs`. Then, substitute `[YOUR ENV PATH]` in the script with this path. If your environment is configured differently, you'll need to update the `PYTHON_SCRIPT` variable to correctly point to the `data.py` file.
+4. Run the script. If it executes successfully, the tokenizer paths will be updated, allowing them to be loaded locally.
diff --git a/evaluation/metrics/similarity/resemblyzer_similarity.py b/evaluation/metrics/similarity/resemblyzer_similarity.py
@@ -7,8 +7,8 @@
 import torch
 import numpy as np
 import pandas as pd
+import torch.nn.functional as F
 from resemblyzer import VoiceEncoder, preprocess_wav
-from scipy.spatial.distance import cosine
 
 
 def load_wavs(directory):
@@ -40,7 +40,9 @@ def calculate_cosine_similarity(embeddings1, embeddings2, names1, names2):
     similarity_info = []
     for i, emb1 in enumerate(embeddings1):
         for j, emb2 in enumerate(embeddings2):
-            similarity = 1 - cosine(emb1, emb2)
+            emb1_tensor = torch.tensor(emb1).unsqueeze(0)
+            emb2_tensor = torch.tensor(emb2).unsqueeze(0)
+            similarity = F.cosine_similarity(emb1_tensor, emb2_tensor)
             similarity_info.append(
                 {"Reference": names2[j], "Target": names1[i], "Similarity": similarity}
             )

diff --git a/pretrained/README.md b/pretrained/README.md
@@ -128,3 +128,87 @@ Amphion
  ┃ ┃ ┣ model.pt
 ```
 
+
+# (Optional) Model Dependencies for Evaluation
+When utilizing Amphion's Evaluation Pipelines, terminals without access to `huggingface.co` may encounter error messages such as "OSError: Can't load tokenizer for ...". To work around this, the dependant models for evaluation can be pre-prepared and stored here, at `Amphion/pretrained`, and follow [this README](../egs/metrics/README.md#troubleshooting) to configure your environment to load local models.
+
+The dependant models of Amphion's evaluation pipeline are as follows (sort alphabetically):
+
+- [Evaluation Pipeline Models Dependency](#optional-model-dependencies-for-evaluation)
+  - [bert-base-uncased](#bert-base-uncased)
+  - [facebook/bart-base](#facebookbart-base)
+  - [roberta-base](#roberta-base)
+
+The instructions about how to download them is displayed as follows.
+
+## bert-base-uncased
+
+To load `bert-base-uncased` locally, follow [this link](https://huggingface.co/bert-base-uncased) to download all files for `bert-base-uncased` model, and store them under `Amphion/pretrained/bert-base-uncased`, conforming to the following file structure tree:
+
+```
+Amphion
+ ┣ pretrained
+ ┃ ┣ bert-base-uncased
+ ┃ ┃ ┣ config.json
+ ┃ ┃ ┣ coreml 
+ ┃ ┃ ┃ ┣ fill-mask
+ ┃ ┃ ┃   ┣ float32_model.mlpackage
+ ┃ ┃ ┃      ┣ Data
+ ┃ ┃ ┃         ┣ com.apple.CoreML
+ ┃ ┃ ┃            ┣ model.mlmodel 
+ ┃ ┃ ┣ flax_model.msgpack
+ ┃ ┃ ┣ LICENSE
+ ┃ ┃ ┣ model.onnx
+ ┃ ┃ ┣ model.safetensors
+ ┃ ┃ ┣ pytorch_model.bin
+ ┃ ┃ ┣ README.md
+ ┃ ┃ ┣ rust_model.ot
+ ┃ ┃ ┣ tf_model.h5
+ ┃ ┃ ┣ tokenizer_config.json
+ ┃ ┃ ┣ tokenizer.json
+ ┃ ┃ ┣ vocab.txt
+```
+
+## facebook/bart-base
+
+To load `facebook/bart-base` locally, follow [this link](https://huggingface.co/facebook/bart-base) to download all files for `facebook/bart-base` model, and store them under `Amphion/pretrained/facebook/bart-base`, conforming to the following file structure tree:
+
+```
+Amphion
+ ┣ pretrained
+ ┃ ┣ facebook
+ ┃ ┃ ┣ bart-base
+ ┃ ┃ ┃ ┣ config.json
+ ┃ ┃ ┃ ┣ flax_model.msgpack
+ ┃ ┃ ┃ ┣ gitattributes.txt
+ ┃ ┃ ┃ ┣ merges.txt
+ ┃ ┃ ┃ ┣ model.safetensors
+ ┃ ┃ ┃ ┣ pytorch_model.bin
+ ┃ ┃ ┃ ┣ README.txt
+ ┃ ┃ ┃ ┣ rust_model.ot
+ ┃ ┃ ┃ ┣ tf_model.h5
+ ┃ ┃ ┃ ┣ tokenizer.json
+ ┃ ┃ ┃ ┣ vocab.json
+```
+
+## roberta-base
+
+To load `roberta-base` locally, follow [this link](https://huggingface.co/roberta-base) to download all files for `roberta-base` model, and store them under `Amphion/pretrained/roberta-base`, conforming to the following file structure tree:
+
+```
+Amphion
+ ┣ pretrained
+ ┃ ┣ roberta-base
+ ┃ ┃ ┣ config.json
+ ┃ ┃ ┣ dict.txt
+ ┃ ┃ ┣ flax_model.msgpack
+ ┃ ┃ ┣ gitattributes.txt
+ ┃ ┃ ┣ merges.txt
+ ┃ ┃ ┣ model.safetensors
+ ┃ ┃ ┣ pytorch_model.bin
+ ┃ ┃ ┣ README.txt
+ ┃ ┃ ┣ rust_model.ot
+ ┃ ┃ ┣ tf_model.h5
+ ┃ ┃ ┣ tokenizer.json
+ ┃ ┃ ┣ vocab.json
+```
diff --git a/pretrained/bert-base-uncased/README.md b/pretrained/bert-base-uncased/README.md
@@ -0,0 +1,8 @@
+[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Models-pink)](https://huggingface.co/bert-base-uncased)
+
+# Download
+
+- [Link](https://huggingface.co/bert-base-uncased)
+- Model: `bert-base-uncased`
+- Download the latest files under `Files and versions` tab.
+- Overwrite this file if necessary.
diff --git a/pretrained/facebook/bart-base/README.md b/pretrained/facebook/bart-base/README.md
@@ -0,0 +1,8 @@
+[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Models-pink)](https://huggingface.co/facebook/bart-base)
+
+# Download
+
+- [Link](https://huggingface.co/facebook/bart-base)
+- Model: `facebook/bart-base`
+- Download the latest files under `Files and versions` tab.
+- Overwrite this file if necessary.
diff --git a/pretrained/roberta-base/README.md b/pretrained/roberta-base/README.md
@@ -0,0 +1,8 @@
+[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Models-pink)](https://huggingface.co/roberta-base)
+
+# Download
+
+- [Link](https://huggingface.co/roberta-base)
+- Model: `roberta-base`
+- Download the latest files under `Files and versions` tab.
+- Overwrite this file if necessary.