Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOS, SOS characters in dataloader and decoder #33

Open
paanguin opened this issue Feb 25, 2020 · 1 comment
Open

EOS, SOS characters in dataloader and decoder #33

paanguin opened this issue Feb 25, 2020 · 1 comment
Assignees

Comments

@paanguin
Copy link

paanguin commented Feb 25, 2020

I have a question about decoder inputs. I think the following pre-processing adds SOS and EOS token to label y.

seq_in = [torch.cat([sos, y], dim=0) for y in seq]

seq_out = [torch.cat([y, eos], dim=0) for y in seq]

It seems SpectrogramDataset also contain a process for adding SOS and EOS to label y.

transcript = constant.SOS_CHAR + transcript_file.read().replace('\n', '').lower() + constant.EOS_CHAR

But I think SpectrogramDataset should not do this. I think the decoder currently process the label like this:
y= HELLO

seq_in: SOS, SOS, H, E, L, L, O, EOS
seq_out: SOS, H, E, L, L, O, EOS, EOS

I'll be very grateful if you confirm whether this is correct or not.

@paanguin paanguin changed the title EOS, SOS characters in dataloader EOS, SOS characters in dataloader and decoder Feb 25, 2020
@gentaiscool
Copy link
Owner

Thank you for reporting this issue @paanguin. Actually, we are working on revamping the project with a newer version soon.

Yes, I agree we need to fix this part. There should be a single SOS and EOS tokens. We should not include SOS and EOS tokens in SpectrogramDataset. We will fix this issue in the next update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants