Tokenizer.transform ignores error handling argument #44

Anaphory · 2019-10-10T10:03:48Z

The signature of Tokenizer.transform suggests that the function can take an argument to describe how to deal with undefined segments

segments/src/segments/tokenizer.py

Line 229 in 369e36d

def transform(self, word, column=Profile.GRAPHEME_COL, error=errors.replace):

but the actual fallback

segments/src/segments/tokenizer.py

Line 262 in 369e36d

target = self._errors['replace'](token)

is always the replace strategy.

(I came here looking for a keep strategy that would allow me to inspect the errors by bouncing them back to me instead of transforming them, even ignore replaces them by just nothing, but that's a different issue.)

The text was updated successfully, but these errors were encountered:

LinguList · 2019-10-10T10:08:23Z

I agree, I also think this is not that transparent, but I recommend you to have a look at pylexibank first, where we use segments for exactly this behaviro, so this will give you a definite tweak showing how you can in fact trigger this behavior.

LinguList · 2019-10-10T10:10:08Z

That's to say that it is not impossible, it is just not that transparent and easy as the param suggests.

Anaphory · 2019-10-10T10:28:17Z

I am able to trigger the behaviour I want. I just think that the current implementation has a bug.

LinguList · 2019-10-10T10:35:42Z

If you play with this, you need to also submit a fix to pylexibank, to make sure we don't break compat there, please. I agree that it is better to refactor the current code.

xrotwang · 2019-10-11T07:04:26Z

@Anaphory I don't think this is necessarily a bug. The problem is that when you transform - rather than just segment - this is a two-step process. And both steps of the process may require error handling. The error argument to Tokenizer.transform is honored by passing it to the first step:

segments/src/segments/tokenizer.py

Line 254 in 369e36d

word = self.op.tree.parse(word, error)

Now for the second step, mapping the output of the segmentation to a different symbol set according to the profile, it's not entirely clear what should be done. The default implementation of the replace strategy seems somewhat appropriate: Replace whatever isn't in the profile with the replacement marker. This is the rationale for using self._errors['replace'] in line 262.

I agree, though, that this is not particularly transparent, and also think that we might need two sets of error handling for the two processing steps, since using just one set may be somewhat surprising:

>>> from segments import Tokenizer, Profile
>>> t = Tokenizer(profile=Profile({'Grapheme': 'a', 'IPA': 'b'}), errors_replace=lambda s: '<{0}>'.format(s))
>>> t('ab', column='IPA')
'b <<b>>'

So, while I don't think the current code has a bug, I'm open to an enhancement that allows a second set of error handling functions for the mapping step.

bambooforest · 2020-01-05T04:57:16Z

In an email thread we mentioned that we could have a default OP for strict IPA as a data sanity check (default for ipa=True), so the user doesn’t have to create an OP.

And since the guys are in the process of assembling orthography profiles from lexibank for potential reuse, we could integrate segments with the collection of profiles, which would also be useful for tokenization and transformation.

Anaphory mentioned this issue Oct 10, 2019

Add "keep" strategy for error handling #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer.transform ignores error handling argument #44

Tokenizer.transform ignores error handling argument #44

Anaphory commented Oct 10, 2019

LinguList commented Oct 10, 2019

LinguList commented Oct 10, 2019

Anaphory commented Oct 10, 2019

LinguList commented Oct 10, 2019

xrotwang commented Oct 11, 2019

bambooforest commented Jan 5, 2020

Tokenizer.transform ignores error handling argument #44

Tokenizer.transform ignores error handling argument #44

Comments

Anaphory commented Oct 10, 2019

LinguList commented Oct 10, 2019

LinguList commented Oct 10, 2019

Anaphory commented Oct 10, 2019

LinguList commented Oct 10, 2019

xrotwang commented Oct 11, 2019

bambooforest commented Jan 5, 2020