Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for customizing textflags in pymupdf4llm #214

Open
ffracassi-reply opened this issue Jan 10, 2025 · 1 comment
Open

Add support for customizing textflags in pymupdf4llm #214

ffracassi-reply opened this issue Jan 10, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@ffracassi-reply
Copy link

Hi,

currently it seems not possible to provide custom text flags to page.get_text() using pymupdf4llm. This doesn't allow to:

  • Switch off TEXT_PRESERVE_LIGATURES
  • Switch on TEXT_DEHYPHENATE
  • Switch off TEXT_CID_FOR_UNKNOWN_UNICODE

Having access to these flags might be beneficial for specific use cases: would it be possible to allow customizing text flags using the conversion via a specific parameter in to_markdown()?

@JorjMcKie JorjMcKie added the enhancement New feature or request label Jan 10, 2025
@JorjMcKie
Copy link
Contributor

This is possible in a limited way only:

  • TEXT_PRESERVE_LIGATURES - we plan to switch this off permanently and thus again not expose it in the API. It simply makes no sense to expect any downstream markdown renderer to adequately display these compound characters. Separate characters are simply good enough.
  • TEXT_DEHYPHENATE - this is and will need to be kept off, because otherwise word particles will be included in boundary boxes of lines and spans - making the compilation into markdown impossible. Exposure to the API cannot be granted.
  • TEXT_CID_FOR_UNKNOWN_UNICODE - we are considering accessibility via the API. I think that setting it off (in contrast to what we have today) would even be the best choice because only then the Invalid Unicode � is returned and could be used to decide about dynamically invoking OCR. An additional complication is coming upon us here as MuPDF is about to introduce an additional TEXT_GID_FOR_UNKNOWN_UNICODE (glyph id) option. In addition, per-character flags will inform us about the origin of a character, like:
    • is this a synthesized space?
    • is this a glyph id or a cid?
    • so we need more consideration about a meaningful representation within the API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants