You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the spellcheck-autoencoder is conditioned to only reconstruct lexical identities of the input tokens, with no internal conditioning towards more human-like assessments of reconstruction correctness.
Since the decoder works left-to-right, it is currently mostly learning to represent string length, and then gradually working through greedily reconstructing longer and longer identical prefixes. For example, decoder output over epochs currently looks like this:
Original Token
Reconstruction at Epoch 1
Reconstruction at Epoch 2
Reconstruction at Epoch 3
rue de la 24 di septiembre
rue de laiji97jfoefokp
rue de la 22 de septxcg
rue de la 24 de septiembdr
chicago
chicgxha
chicagxo
chicago
As such, latent coordinates currently represent mostly string length and prefix information. The decoding process should be changed as follows:
Make length-decoding independent from lexical decoding. In a preflight process, a decoder RNN should create a "decoding bed", which relieves the lexical decoder from having to learn string length decoding.
Instead of lexical decoding, try phonetic decoding. However, it is tbd. which phonetic labels to use. As a start, spellfix1 transcription could be implemented. [2]
Make lexical decoding bidirectional. In order to prevent greedy prefix learning, lexical unrolling will be performed right-to-left first, and the resulting sequence then concatenated with the decoding bed. [1]
Currently, the spellcheck-autoencoder is conditioned to only reconstruct lexical identities of the input tokens, with no internal conditioning towards more human-like assessments of reconstruction correctness.
Since the decoder works left-to-right, it is currently mostly learning to represent string length, and then gradually working through greedily reconstructing longer and longer identical prefixes. For example, decoder output over epochs currently looks like this:
As such, latent coordinates currently represent mostly string length and prefix information. The decoding process should be changed as follows:
[1] Stacked bidirectional decoder architecture
[2] Spellfix1 Phonetic Hashing (spellfix1_phonehash)
Phonetic replacements:
Other rules (to be ignored/amended/selected):
Alternatively, try CMU Logios (http://www.speech.cs.cmu.edu/tools/lextool.html).
Alternatively, try CMU G2P (https://github.com/cmusphinx/g2p-seq2seq)
Notably, all these phonetic transcriptions are optimized for English/Latin languages.
The text was updated successfully, but these errors were encountered: