You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
currently it seems not possible to provide custom text flags to page.get_text() using pymupdf4llm. This doesn't allow to:
Switch off TEXT_PRESERVE_LIGATURES
Switch on TEXT_DEHYPHENATE
Switch off TEXT_CID_FOR_UNKNOWN_UNICODE
Having access to these flags might be beneficial for specific use cases: would it be possible to allow customizing text flags using the conversion via a specific parameter in to_markdown()?
The text was updated successfully, but these errors were encountered:
TEXT_PRESERVE_LIGATURES - we plan to switch this off permanently and thus again not expose it in the API. It simply makes no sense to expect any downstream markdown renderer to adequately display these compound characters. Separate characters are simply good enough.
TEXT_DEHYPHENATE - this is and will need to be kept off, because otherwise word particles will be included in boundary boxes of lines and spans - making the compilation into markdown impossible. Exposure to the API cannot be granted.
TEXT_CID_FOR_UNKNOWN_UNICODE - we are considering accessibility via the API. I think that setting it off (in contrast to what we have today) would even be the best choice because only then the Invalid Unicode � is returned and could be used to decide about dynamically invoking OCR. An additional complication is coming upon us here as MuPDF is about to introduce an additional TEXT_GID_FOR_UNKNOWN_UNICODE (glyph id) option. In addition, per-character flags will inform us about the origin of a character, like:
is this a synthesized space?
is this a glyph id or a cid?
so we need more consideration about a meaningful representation within the API
Hi,
currently it seems not possible to provide custom text flags to page.get_text() using pymupdf4llm. This doesn't allow to:
Having access to these flags might be beneficial for specific use cases: would it be possible to allow customizing text flags using the conversion via a specific parameter in to_markdown()?
The text was updated successfully, but these errors were encountered: