Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer.transform ignores error handling argument #44

Open
Anaphory opened this issue Oct 10, 2019 · 6 comments
Open

Tokenizer.transform ignores error handling argument #44

Anaphory opened this issue Oct 10, 2019 · 6 comments

Comments

@Anaphory
Copy link

The signature of Tokenizer.transform suggests that the function can take an argument to describe how to deal with undefined segments

def transform(self, word, column=Profile.GRAPHEME_COL, error=errors.replace):

but the actual fallback
target = self._errors['replace'](token)

is always the replace strategy.

(I came here looking for a keep strategy that would allow me to inspect the errors by bouncing them back to me instead of transforming them, even ignore replaces them by just nothing, but that's a different issue.)

@LinguList
Copy link
Contributor

I agree, I also think this is not that transparent, but I recommend you to have a look at pylexibank first, where we use segments for exactly this behaviro, so this will give you a definite tweak showing how you can in fact trigger this behavior.

@LinguList
Copy link
Contributor

That's to say that it is not impossible, it is just not that transparent and easy as the param suggests.

@Anaphory
Copy link
Author

I am able to trigger the behaviour I want. I just think that the current implementation has a bug.

@LinguList
Copy link
Contributor

If you play with this, you need to also submit a fix to pylexibank, to make sure we don't break compat there, please. I agree that it is better to refactor the current code.

@xrotwang
Copy link
Contributor

@Anaphory I don't think this is necessarily a bug. The problem is that when you transform - rather than just segment - this is a two-step process. And both steps of the process may require error handling. The error argument to Tokenizer.transform is honored by passing it to the first step:

word = self.op.tree.parse(word, error)

Now for the second step, mapping the output of the segmentation to a different symbol set according to the profile, it's not entirely clear what should be done. The default implementation of the replace strategy seems somewhat appropriate: Replace whatever isn't in the profile with the replacement marker. This is the rationale for using self._errors['replace'] in line 262.

I agree, though, that this is not particularly transparent, and also think that we might need two sets of error handling for the two processing steps, since using just one set may be somewhat surprising:

>>> from segments import Tokenizer, Profile
>>> t = Tokenizer(profile=Profile({'Grapheme': 'a', 'IPA': 'b'}), errors_replace=lambda s: '<{0}>'.format(s))
>>> t('ab', column='IPA')
'b <<b>>'

So, while I don't think the current code has a bug, I'm open to an enhancement that allows a second set of error handling functions for the mapping step.

@bambooforest
Copy link
Collaborator

In an email thread we mentioned that we could have a default OP for strict IPA as a data sanity check (default for ipa=True), so the user doesn’t have to create an OP.

And since the guys are in the process of assembling orthography profiles from lexibank for potential reuse, we could integrate segments with the collection of profiles, which would also be useful for tokenization and transformation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants