Add support for customizing textflags in pymupdf4llm #214

ffracassi-reply · 2025-01-10T17:57:50Z

Hi,

currently it seems not possible to provide custom text flags to page.get_text() using pymupdf4llm. This doesn't allow to:

Switch off TEXT_PRESERVE_LIGATURES
Switch on TEXT_DEHYPHENATE
Switch off TEXT_CID_FOR_UNKNOWN_UNICODE

Having access to these flags might be beneficial for specific use cases: would it be possible to allow customizing text flags using the conversion via a specific parameter in to_markdown()?

JorjMcKie · 2025-01-11T00:08:49Z

This is possible in a limited way only:

TEXT_PRESERVE_LIGATURES - we plan to switch this off permanently and thus again not expose it in the API. It simply makes no sense to expect any downstream markdown renderer to adequately display these compound characters. Separate characters are simply good enough.
TEXT_DEHYPHENATE - this is and will need to be kept off, because otherwise word particles will be included in boundary boxes of lines and spans - making the compilation into markdown impossible. Exposure to the API cannot be granted.
TEXT_CID_FOR_UNKNOWN_UNICODE - we are considering accessibility via the API. I think that setting it off (in contrast to what we have today) would even be the best choice because only then the Invalid Unicode � is returned and could be used to decide about dynamically invoking OCR. An additional complication is coming upon us here as MuPDF is about to introduce an additional TEXT_GID_FOR_UNKNOWN_UNICODE (glyph id) option. In addition, per-character flags will inform us about the origin of a character, like:
- is this a synthesized space?
- is this a glyph id or a cid?
- so we need more consideration about a meaningful representation within the API

JorjMcKie added the enhancement New feature or request label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for customizing textflags in pymupdf4llm #214

Add support for customizing textflags in pymupdf4llm #214

ffracassi-reply commented Jan 10, 2025

JorjMcKie commented Jan 11, 2025

Add support for customizing textflags in pymupdf4llm #214

Add support for customizing textflags in pymupdf4llm #214

Comments

ffracassi-reply commented Jan 10, 2025

JorjMcKie commented Jan 11, 2025