-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer.transform ignores error handling argument #44
Comments
I agree, I also think this is not that transparent, but I recommend you to have a look at pylexibank first, where we use segments for exactly this behaviro, so this will give you a definite tweak showing how you can in fact trigger this behavior. |
That's to say that it is not impossible, it is just not that transparent and easy as the param suggests. |
I am able to trigger the behaviour I want. I just think that the current implementation has a bug. |
If you play with this, you need to also submit a fix to pylexibank, to make sure we don't break compat there, please. I agree that it is better to refactor the current code. |
@Anaphory I don't think this is necessarily a bug. The problem is that when you transform - rather than just segment - this is a two-step process. And both steps of the process may require error handling. The segments/src/segments/tokenizer.py Line 254 in 369e36d
Now for the second step, mapping the output of the segmentation to a different symbol set according to the profile, it's not entirely clear what should be done. The default implementation of the I agree, though, that this is not particularly transparent, and also think that we might need two sets of error handling for the two processing steps, since using just one set may be somewhat surprising: >>> from segments import Tokenizer, Profile
>>> t = Tokenizer(profile=Profile({'Grapheme': 'a', 'IPA': 'b'}), errors_replace=lambda s: '<{0}>'.format(s))
>>> t('ab', column='IPA')
'b <<b>>' So, while I don't think the current code has a bug, I'm open to an enhancement that allows a second set of error handling functions for the mapping step. |
In an email thread we mentioned that we could have a default OP for strict IPA as a data sanity check (default for ipa=True), so the user doesn’t have to create an OP. And since the guys are in the process of assembling orthography profiles from lexibank for potential reuse, we could integrate segments with the collection of profiles, which would also be useful for tokenization and transformation. |
The signature of
Tokenizer.transform
suggests that the function can take an argument to describe how to deal with undefined segmentssegments/src/segments/tokenizer.py
Line 229 in 369e36d
but the actual fallback
segments/src/segments/tokenizer.py
Line 262 in 369e36d
is always the
replace
strategy.(I came here looking for a
keep
strategy that would allow me to inspect the errors by bouncing them back to me instead of transforming them, evenignore
replaces them by just nothing, but that's a different issue.)The text was updated successfully, but these errors were encountered: