Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate phonetic/lexical and length reconstruction goals for spellcheck-autoencoder #42

Open
josephbirkner opened this issue Dec 11, 2017 · 0 comments
Assignees
Milestone

Comments

@josephbirkner
Copy link
Collaborator

josephbirkner commented Dec 11, 2017

Currently, the spellcheck-autoencoder is conditioned to only reconstruct lexical identities of the input tokens, with no internal conditioning towards more human-like assessments of reconstruction correctness.

Since the decoder works left-to-right, it is currently mostly learning to represent string length, and then gradually working through greedily reconstructing longer and longer identical prefixes. For example, decoder output over epochs currently looks like this:

Original Token Reconstruction at Epoch 1 Reconstruction at Epoch 2 Reconstruction at Epoch 3
rue de la 24 di septiembre rue de laiji97jfoefokp rue de la 22 de septxcg rue de la 24 de septiembdr
chicago chicgxha chicagxo chicago

As such, latent coordinates currently represent mostly string length and prefix information. The decoding process should be changed as follows:

  1. Make length-decoding independent from lexical decoding. In a preflight process, a decoder RNN should create a "decoding bed", which relieves the lexical decoder from having to learn string length decoding.
  2. Instead of lexical decoding, try phonetic decoding. However, it is tbd. which phonetic labels to use. As a start, spellfix1 transcription could be implemented. [2]
  3. Make lexical decoding bidirectional. In order to prevent greedy prefix learning, lexical unrolling will be performed right-to-left first, and the resulting sequence then concatenated with the decoding bed. [1]

[1] Stacked bidirectional decoder architecture

Encoded Vector ' Decoder
---------------'------------------------------------------------------------------------------
              (Forward Decoder)     "s"    "t"    " "    "l"    "o"    "u"    "i"    "s"    "$"
               '                     ∧      ∧      ∧      ∧      ∧      ∧      ∧      ∧      ∧
               '                     |      |      |      |      |      |      |      |      |
               '                    [ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]
               '                    /|     /|     /|     /|     /|     /|     /|     /|     /|
               '                   / |    / |    / |    / |    / |    / |    / |    / |    / |
               '                  |  |   |  |   |  |   |  |   |  |   |  |   |  |   |  |   |  |
              (Backward Decoder)  | "?"  | "?"  | " "  | "?"  | "o"  | "u"  | "i"  | "s"  | "$"
               '                  |  ∧   |  ∧   |  ∧   |  ∧   |  ∧   |  ∧   |  ∧   |  ∧   |  ∧
               '                  |  |   |  |   |  |   |  |   |  |   |  |   |  |   |  |   |  |
               '                  | [ ]<---[ ]<---[ ]<---[ ]<---[ ]<---[ ]<---[ ]<---[ ]<---[ ]
            ___'__________________|__|___|__|___|__|___|__|___|__|___|__|___|__|___|__|___|__|
[0]        |   '                     |      |      |      |      |      |      |      |      |
[0]        |  (Preflight Decoder)   "X"    "X"    "~"    "X"    "X"    "X"    "X"    "X"    "$"
[0]        |   '                     ∧      ∧      ∧      ∧      ∧      ∧      ∧      ∧      ∧
[0] ------>|   '                     |      |      |      |      |      |      |      |      |
[0]        |   '                    [ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]
[0]        |   '                     ∧      ∧      ∧      ∧      ∧      ∧      ∧      ∧      ∧
[0]        |___'_____________________|______|______|______|______|______|______|______|______|
[0]            '

[2] Spellfix1 Phonetic Hashing (spellfix1_phonehash)

Phonetic replacements:

  • A, E, I, O, U, Y ⟶ A
  • B, F, P, V ⟶ B
  • C, G, J, K, Q, S, X, Z ⟶ C
  • D, T ⟶ D
  • L ⟶ L
  • M, N ⟶ M
  • R ⟶ R
  • H, W ⟶ _

Other rules (to be ignored/amended/selected):

  • Omit double letters
  • Omit vowels beside L and R
  • Omit T before CH
  • Omit W before R
  • Omit D before J
  • Omit K or G before N at beginning of work

Alternatively, try CMU Logios (http://www.speech.cs.cmu.edu/tools/lextool.html).
Alternatively, try CMU G2P (https://github.com/cmusphinx/g2p-seq2seq)
Notably, all these phonetic transcriptions are optimized for English/Latin languages.

@josephbirkner josephbirkner added this to the Eindhoven18 milestone Dec 11, 2017
@josephbirkner josephbirkner self-assigned this Dec 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant