Recommended text or phoneme tokenizer to use #7

francislata · 2023-11-03T02:20:04Z

Hi @cantabile-kwok, in the paper, there was not any recommended text or phoneme tokenizer to use. Do you have recommendations of what to use?

Thank you.

cantabile-kwok · 2023-11-03T03:08:30Z

That is a tricky thing, and I am also thinking about it. In our workflow, we used the Kaldi toolkit and a pre-defined (CMU dict-based) lexicon file to specify each word with its phone sequence. Now this may be too troublesome to use for users not familiar with it. As far as I know, there is a phonemizer g2p tool which, when built upon eSpeak, can generate accurate IPA transcriptions of given sentences. There is also a g2p_en python package which can give the CMU dict phonemes.

These might be easy to use, so you may have a try.

francislata · 2023-11-03T03:25:26Z

Thanks for that. I’ve been looking at the Coqui TTS tokenizer class, which will handle phonemization if needed. This takes care of almost everything, including text mormalization. So I’ll give it a try first and go from there. I’ll see if the g2p-en can be integrated too.

rishikksh20 · 2023-12-08T06:22:12Z

@francislata Have you used Coqui TTS tokenizer for your Unicats, is it working good ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended text or phoneme tokenizer to use #7

Recommended text or phoneme tokenizer to use #7

francislata commented Nov 3, 2023

cantabile-kwok commented Nov 3, 2023

francislata commented Nov 3, 2023

rishikksh20 commented Dec 8, 2023

Recommended text or phoneme tokenizer to use #7

Recommended text or phoneme tokenizer to use #7

Comments

francislata commented Nov 3, 2023

cantabile-kwok commented Nov 3, 2023

francislata commented Nov 3, 2023

rishikksh20 commented Dec 8, 2023