Skip to content

Commit

Permalink
2.1.6 (#562)
Browse files Browse the repository at this point in the history
  • Loading branch information
mmcauliffe authored Feb 14, 2023
1 parent f6f5cc6 commit 703167e
Show file tree
Hide file tree
Showing 58 changed files with 2,272 additions and 203 deletions.
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,5 @@ repos:
- id: end-of-file-fixer
- id: trailing-whitespace
- id: check-added-large-files
args: ['--maxkb=2000']
- id: mixed-line-ending
265 changes: 265 additions & 0 deletions docs/source/_static/sound_files/english_t.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/sound_files/english_t.wav
Binary file not shown.
124 changes: 124 additions & 0 deletions docs/source/_static/sound_files/english_t_it's.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
125 changes: 125 additions & 0 deletions docs/source/_static/sound_files/english_t_it.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
137 changes: 137 additions & 0 deletions docs/source/_static/sound_files/english_t_itself.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
128 changes: 128 additions & 0 deletions docs/source/_static/sound_files/english_t_just.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
127 changes: 127 additions & 0 deletions docs/source/_static/sound_files/english_t_onto.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
131 changes: 131 additions & 0 deletions docs/source/_static/sound_files/english_t_righted.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
120 changes: 120 additions & 0 deletions docs/source/_static/sound_files/english_t_stop.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
128 changes: 128 additions & 0 deletions docs/source/_static/sound_files/english_t_tipped.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
128 changes: 128 additions & 0 deletions docs/source/_static/sound_files/english_t_to.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
133 changes: 133 additions & 0 deletions docs/source/_static/sound_files/english_t_top.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
128 changes: 128 additions & 0 deletions docs/source/_static/sound_files/english_t_truck.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 18 additions & 0 deletions docs/source/changelog/changelog_2.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,24 @@
2.1 Changelog
*************

2.1.6
=====

- Fix for issue with ignore_case flag not being respected
- Fixed a hang in speaker diarization
- Fixed an error related to paths ending in trailing slashes which caused MFA to try to connect to a database named after the local user
- Partial migration to using :class:`pathlib.Path` instead of :mod:`os.path`

2.1.5
=====

- Fix for improperly reset databases

2.1.4
=====

- Change how database connections are made to remove pooling

2.1.3
=====

Expand Down
4 changes: 1 addition & 3 deletions docs/source/changelog/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@

.. _news:


(news=)
# News

## Roadmap
Expand Down
5 changes: 3 additions & 2 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -344,14 +344,15 @@
"text": "Montreal Forced Aligner",
# "image_dark": "logo-dark.svg",
},
"google_analytics_id": "UA-73068199-4",
"analytics": {
"google_analytics_id": "UA-73068199-4",
},
# "show_nav_level": 1,
# "navigation_depth": 4,
# "show_toc_level": 2,
# "collapse_navigation": True,
}
html_context = {
# "github_url": "https://github.com", # or your GitHub Enterprise interprise
"github_user": "MontrealCorpusTools",
"github_repo": "Montreal-Forced-Aligner",
"github_version": "main",
Expand Down
3 changes: 1 addition & 2 deletions docs/source/user_guide/implementations/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@

(tutorials_index)=
# Implementation details
# In depth guides

:::{warning}
This section is under construction!
Expand Down
165 changes: 165 additions & 0 deletions docs/source/user_guide/implementations/phone_groups.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,167 @@

# Phone groups

When training an acoustic model, MFA begins by training a monophone model, where each phone is context-independent. Consider an English ARPABET model as an example. A {ipa_inline}`[T]` is modeled the same regardless of:
* Whether it's word initial
* Whether it follows an {ipa_inline}`[S]`
* Whether it's the onset of a stressed syllable
* Whether it's the onset of an unstressed syllable
* Whether it's word final
* Whether it's followed by {ipa_inline}`[R]`

For each of these cases, the acoustic model will proceed through the same HMM states with the same GMM PDFs (Probability Distribution Functions).



:::::{tab-set}

::::{tab-item} Full utterance

The truck righted itself just before it tipped over onto it's top and came to a full stop.

:::{raw} html

<div class="align-center">
<audio controls="controls">
<source src="../../_static/sound_files/english_t.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.</audio>
</div>
:::

:::{figure} ../../_static/sound_files/english_t.svg
:align: center
Waveform, spectrogram, and aligned labels for the full reading of the English text
:::

::::

::::{tab-item} truck

:::{figure} ../../_static/sound_files/english_t_truck.svg
:align: center
Waveform, spectrogram, and aligned labels for the word "truck", realized as {ipa_inline}`[tʃ]`
:::

::::

::::{tab-item} righted

:::{figure} ../../_static/sound_files/english_t_righted.svg
:align: center
Waveform, spectrogram, and aligned labels for the word "righted", realized as {ipa_inline}`[ɾ]`
:::

::::

::::{tab-item} itself

:::{figure} ../../_static/sound_files/english_t_itself.svg
:align: center
Waveform, spectrogram, and aligned labels for the word "itself", realized as {ipa_inline}`[t̚]`
:::

::::

::::{tab-item} just

:::{figure} ../../_static/sound_files/english_t_just.svg
:align: center
Waveform, spectrogram, and aligned labels for the word "just"
:::

::::

::::{tab-item} it

:::{figure} ../../_static/sound_files/english_t_it.svg
:align: center
Waveform, spectrogram, and aligned labels for the word "it", realized as {ipa_inline}`[t̚]`
:::

::::

::::{tab-item} tipped

:::{figure} ../../_static/sound_files/english_t_tipped.svg
:align: center
Waveform, spectrogram, and aligned labels for the word "tipped", realized as {ipa_inline}`[tʰ]`
:::

::::

::::{tab-item} onto

:::{figure} ../../_static/sound_files/english_t_onto.svg
:align: center
Waveform, spectrogram, and aligned labels for the word "onto", realized as {ipa_inline}`[tʰ]`
:::

::::

::::{tab-item} it's

:::{figure} ../../_static/sound_files/english_t_it's.svg
:align: center
Waveform, spectrogram, and aligned labels for the word "it's", realized as {ipa_inline}`[t]`
:::

::::

::::{tab-item} top

:::{figure} ../../_static/sound_files/english_t_top.svg
:align: center
Waveform, spectrogram, and aligned labels for the word "top", realized as {ipa_inline}`[tʰ]`
:::

::::

::::{tab-item} to

:::{figure} ../../_static/sound_files/english_t_to.svg
:align: center
Waveform, spectrogram, and aligned labels for the word "to", realized as {ipa_inline}`[tʰ]`
:::

::::

::::{tab-item} stop

:::{figure} ../../_static/sound_files/english_t_stop.svg
:align: center
Waveform, spectrogram, and aligned labels for the word "stop", realized as {ipa_inline}`[t]`
:::

::::

:::::

Given the range of acoustic realizations of {ipa_inline}`[T]` for the utterance above, modeling all occurrences as the same sequence of three HMM states doesn't make a ton of sense. One aspect of the MFA ARPA model that adds some accounting for this variation is the use of position dependent phones, so rather than a single {ipa_inline}`[T]`, you actually have {ipa_inline}`[T_B]` (at the beginnings of words), {ipa_inline}`[T_E]` (at the ends of words), {ipa_inline}`[T_I]` (in the middle of words), and {ipa_inline}`[T_S]` (word consists of just {ipa_inline}`[T_S]`, doesn't really apply for {ipa_inline}`[T]`, but is more relevant for vowels like {ipa_inline}`[AY1_S]`). So final realizations won't be modelled the same as initial realizations or those in the middle of words, each of which will have its own HMM states and GMM PDFs. This carries its own drawback, as sometimes a final or intermediate {ipa_inline}`[T]` is realized the same as an initial {ipa_inline}`[T]` (i.e. {ipa_inline}`[tʰ]`), but there's no pooling across the positions, so {ipa_inline}`[T_E]` and {ipa_inline}`[T_I]` HMM-GMMs do not contain any learned stats from the {ipa_inline}`[T_S]`.

Moving on from monophones which by definition cannot account well for coarticulation and contextual variability, the next stage of MFA training uses triphones. Triphones are essentially strings of three phones to represent a phone. So for a word like stop, the monophone string would be {ipa_inline}`[S T AA1 P]`, but the corresponding triphone string would be {ipa_inline}`[_/S/T S/T/AA1 T/AA1/P AA1/P/_]`, where the original {ipa_inline}`[T]` is no longer the same as all other instances of {ipa_inline}`[T]`, but instead is only the same as {ipa_inline}`[T]` preceded by {ipa_inline}`[S]` and followed by {ipa_inline}`[AA1]`. As a result of taking the preceding and following context into account, you now have a ton of different phone symbols that are each modeled differently and have different amounts of data. A triphone like {ipa_inline}`[S/T/AA1]` might be decently common, but one like {ipa_inline}`[S/T/AA2]` would not have much data given the rarity of {ipa_inline}`[AA2]` in transcriptions. However, we'd really like to pool the data across these and other triphones as the key aspect for modeling the {ipa_inline}`[T]` in this case is that it is preceded by {ipa_inline}`[S]`, and followed by a vowel, not so much what quality the vowel has.

So instead of taking each triphone string as a separate phone, these triphones are clustered to make a decision tree based on the previous and following contexts. These decision trees should learn that if a {ipa_inline}`[T]` is preceded by {ipa_inline}`[S]`, then use PDFs related to the unaspirated {ipa_inline}`[t]` realization, if it's at the beginning of a word followed by a vowel, use the PDFs related to {ipa_inline}`[tʰ]` realization, etc. By clustering PDFs into similar ones and making decision trees based on context, we can side step the sparsity issue related to blowing up the inventory of sounds with trigrams, and we can explicitly include groups of phones together that should be modeled in the same way.

These phone groups specify what phone symbols should use the same decision trees in modeling. For position dependent phone modeling, it naturally follows that we should put all positions under the same root, so {ipa_inline}`[T_B]`, {ipa_inline}`[T_E]`, {ipa_inline}`[T_I]` and {ipa_inline}`[T_S]` can benefit from data associated with other positions, while still having some bias towards particular realizations (as the decision tree takes the central symbol into account as well as the preceding and following).

In MFA 2.1, you can now specify what phones should be grouped together, rather than specifying arbitrary phone sets like ``IPA`` or ``ARPA`` as in MFA 2.0. There are baseline versions of these phone groups available in [mfa-models/config/acoustic/phone_groups](https://github.com/MontrealCorpusTools/mfa-models/tree/main/config/acoustic/phone_groups). The [English US ARPA phone group](https://github.com/MontrealCorpusTools/mfa-models/blob/main/config/acoustic/phone_groups/english_arpa.yaml) gives the same phone groups that were used in training the [English (US) ARPA 2.0 models](https://mfa-models.readthedocs.io/en/latest/acoustic/English/English%20%28US%29%20ARPA%20acoustic%20model%20v2_0_0a.html#English%20(US)%20ARPA%20acoustic%20model%20v2_0_0a), while the MFA phone set ones are a bit more subject to change as I iterate on them.

A general rule of thumb that I follow is to keep phonetically similar-ish phones in the same group, so for [English MFA phone group](https://github.com/MontrealCorpusTools/mfa-models/blob/main/config/acoustic/phone_groups/english_mfa.yaml), I've added phonetic variants to dictionary and specified [phonological rules](phonological_rules.md) for adding more variation to it, but most of these share phone groups with their root phone. So variants like {ipa_inline}`[t tʰ tʲ tʷ]` are grouped together, but less similar variants like {ipa_inline}`[ɾ]` and {ipa_inline}`[ʔ]` have their own phone groups (shown in the excerpt below). Similar dialectal variants variants like {ipa_inline}`[ow əw o]` are grouped together as well.

:::{code} yaml
-
- t
-
-
-
-
- d
-
-
- ɾ
- ɾʲ
-
- ʔ
:::

The default phone groups without any custom yaml file or phone set type specified is to treat each phone as its own phone group. Regardless of how phone groups are set up, if ``position_dependent_phones`` is specified, then each phone's phone group will contain all the various positional phone variants.
55 changes: 55 additions & 0 deletions docs/source/user_guide/implementations/phonological_rules.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,57 @@

(phonological_rules=)
# Phonological rules

MFA 2.1 has the ability to specify phonological rules as a separate input from the dictionary itself. The idea here is that we can over-generate a really large lexicon without having to manually specify variants of commonly applying rules. This lexicon is then paired down to just the attested forms.

Rules for languages with MFA 2.1 models can be found in [mfa-models/config/acoustic/rules](https://github.com/MontrealCorpusTools/mfa-models/tree/main/config/acoustic/rules), though not all languages have been refreshed for 2.1.

Rules are specified via yaml dictionaries like the following example of cot-caught merger from the [English MFA phonological rules file](https://github.com/MontrealCorpusTools/mfa-models/blob/main/config/acoustic/rules/english_mfa.yaml):

:::{code} yaml
rules:
- following_context: $ # caught-cot merger
preceding_context: ''
replacement: ɑː
segment: ɒː
- following_context: '[^ɹ]' # caught-cot merger
preceding_context: ''
replacement: ɑː
segment: ɒː
- following_context: $ # caught-cot merger
preceding_context: ''
replacement: ɑ
segment: ɒ
- following_context: '[^ɹ]' # caught-cot merger
preceding_context: ''
replacement: ɑ
segment: ɒ
:::

For this rule, I've specified 4 rules for long/short variants and for the following context. Long/short vowels are both present in the dictionary and are correlated with stress, but note that long/short variants are modeled as part of the same [phone group](phone_groups.md). The following context involves whether the vowel occurs at the end of the word (``following_context: $``) or if it is in the middle of the word not followed by a rhotic (``following_context: '[^ɹ]'``), as "stark" and "stork" have distinct pronunciations in r-ful dialects with the merger.

These rules are compiled to regular expressions and used to replace the ``segment`` with the ``replacement``. For deletions, the replacement field is empty and for insertions, the segment field is empty. Additionally, both the segment and replacement fields can be sequences of segments or regular expressions themselves. Some more complex examples:

:::{code} yaml
rules:
- following_context: '' # deleting d after n
preceding_context: 'n'
replacement: ''
segment: d
- following_context: '[^ʊɔɝaɛeoæɐɪəɚɑʉɒi].*' # syllabic l
preceding_context: ''
replacement: ɫ̩
segment: ə ɫ
- following_context: '' # schwa deletion
preceding_context: ''
replacement: ɹ ə
segment: 'ə ɹ ə'
- following_context: ''
preceding_context: ''
replacement: dʒ
segment: d[ʲʷ]? ɹ
- following_context: $ # ask metathesis
preceding_context: ''
replacement: k s
segment: s k
:::
8 changes: 7 additions & 1 deletion docs/source/user_guide/workflows/train_acoustic_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ You can train new :term:`acoustic models` from scratch using MFA, and export the
Phone set
=========

.. note::

See :doc:`phone groups <../implementations/phone_groups>` for how to customize phone groups to your specific needs rather than using the preset phone groups of the defined phone sets in this section.

The type of phone set can be specified through ``--phone_set``. Currently only ``IPA``, ``ARPA``, and ``PINYIN`` are supported, but I plan to make it more customizable in the future. The primary benefit of specifying the phone set is to create phone topologies that are more sensible than the defaults.

The default phone model uses 3 HMM states to represent phones, as that generally does a decent job of capturing the dynamic nature of phones. Something like an aspirated stop typically has three clear states, a closure, a burst, and an aspiration period. However, other phones like a tap, glottal stop, or unstressed schwa are so short that they can cause misalignment errors. For these, a single HMM state is more sensible, so they have a shorter minimum duration (each HMM state has a minimum 10ms duration). For vowels, 3 states generally makes sense for monophthongs, where one state corresponds to the onset, one to the "steady state", and one to the offset. For diphthongs and triphthongs, three states doesn't map as clearly to the states, as you'll have an onset, a first target, a transition, a second target, and an offset (and a third target for tiphthongs). Specifying phone sets will use preset stops, affricates, diphthongs, triphthongs and extra short segments. Certain diacritics (``ʱʼʰʲʷⁿˠ``) will result in one more state being added, as they represent quite different acoustics from the base phone.
Expand Down Expand Up @@ -200,14 +204,16 @@ An additional benefit is in guiding the decision tree clustering for triphone mo
- All monophthong, diphthongs, triphthongs with tone 5
-


Pronunciation modeling
======================

For the default configuration, pronunciation probabilities are estimated following the second and third SAT blocks. See :ref:`training_dictionary` for more details.

A recent experimental feature for training acoustic models is the ``--train_g2p`` flag which changes the pronunciation probability estimation from a lexicon based estimation to instead using a G2P model as the lexicon. The idea here is that we have pronunciations generated by the initial blocks much like for the standard lexicon-based approach, but instead of estimating probabilities for individual word/pronunciation pairs and the likelihood of surrounding silence, it learns a mapping between the graphemes of the input texts and the phones.

.. note::

See :doc:`phonological rules <../implementations/phonological_rules>` for how to specify regular expression-like phonological rules so you don't have to code every form for a regular rule.


Command reference
Expand Down
23 changes: 13 additions & 10 deletions montreal_forced_aligner/abc.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
from montreal_forced_aligner.helper import comma_join, load_configuration, mfa_open

if TYPE_CHECKING:
from pathlib import Path

from montreal_forced_aligner.data import MfaArguments, WorkflowType

Expand Down Expand Up @@ -870,12 +871,12 @@ class MfaModel(abc.ABC):
model_type = "base_model"

@classmethod
def pretrained_directory(cls) -> str:
def pretrained_directory(cls) -> Path:
"""Directory that pretrained models are saved in"""
from .config import get_temporary_directory

path = os.path.join(get_temporary_directory(), "pretrained_models", cls.model_type)
os.makedirs(path, exist_ok=True)
path = get_temporary_directory().joinpath("pretrained_models", cls.model_type)
path.mkdir(parents=True, exist_ok=True)
return path

@classmethod
Expand All @@ -888,16 +889,16 @@ def get_available_models(cls) -> List[str]:
list[str]
List of model names
"""
if not os.path.exists(cls.pretrained_directory()):
if not cls.pretrained_directory().exists():
return []
available = []
for f in os.listdir(cls.pretrained_directory()):
for f in cls.pretrained_directory().iterdir():
if cls.valid_extension(f):
available.append(os.path.splitext(f)[0])
available.append(f.stem)
return available

@classmethod
def get_pretrained_path(cls, name: str, enforce_existence: bool = True) -> str:
def get_pretrained_path(cls, name: str, enforce_existence: bool = True) -> Path:
"""
Generate a path to a pretrained model based on its name and model type
Expand All @@ -910,20 +911,22 @@ def get_pretrained_path(cls, name: str, enforce_existence: bool = True) -> str:
Returns
-------
str
Path
Path to model
"""
return cls.generate_path(cls.pretrained_directory(), name, enforce_existence)

@classmethod
@abc.abstractmethod
def valid_extension(cls, filename: str) -> bool:
def valid_extension(cls, filename: Path) -> bool:
"""Check whether a file has a valid extensions"""
...

@classmethod
@abc.abstractmethod
def generate_path(cls, root: str, name: str, enforce_existence: bool = True) -> Optional[str]:
def generate_path(
cls, root: Path, name: str, enforce_existence: bool = True
) -> Optional[Path]:
"""Generate a path from a root directory"""
...

Expand Down
Loading

0 comments on commit 703167e

Please sign in to comment.