Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add native text rendering to muPDF backend #1159

Closed
wants to merge 1 commit into from

Conversation

mbway
Copy link
Contributor

@mbway mbway commented Sep 2, 2024

This is an investigation into the feasibility of rendering text to PDF in a way that a viewer application is able to understand the text information (i.e. be able to select and copy the text). This was requested in #1158 .

This PR introduces the TextPolicy.NATIVE setting to pass text information directly to the backend rather than 'baking' it to a series of glyph shapes since this looses the original text information.

If the backend supports text it can implement the draw_text method which gets called when TextPolicy.NATIVE is used. The PDF backend supports loading fonts and rendering text with arbitrary 2D transformations defined by a matrix.

The current implementation relies on a magic number to scale the font size correctly to match the baked text from the frontend (which I am treating as ground truth). For the files I have access to the value of 1.375 results in almost identical text however other fonts may require a different scale factor to get exact results.

In the following screenshot white is baked text and red is 'native' PDF text
image

@mbway
Copy link
Contributor Author

mbway commented Sep 2, 2024

another possibility for the text rendering is to use render_mode=3 which is invisible but allows the area where the text is to be selected. This is typically used for overlying OCR over scanned documents. This approach could be used if you prefer to always use the baked glyphs (because you can do advanced clipping etc) but still have the option to select the text. I think using insert_text instead of baking glyphs should result in a smaller file size though so may be desirable in some situations even if slightly less accurate.

@mozman
Copy link
Owner

mozman commented Sep 3, 2024

Sorry, but I will not add more complexity to the rendering process.

@mbway
Copy link
Contributor Author

mbway commented Sep 3, 2024

Are you able to elaborate on what you don't like about this solution? Maybe I can find something that you like better?

if bbox is None:
return
abstract_font = self.text_engine.get_font(font_face)
self.backend.draw_text(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to bypass the clipping stage, so text in viewports and clipped INSERTs will be draw at any time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is clipping done in the pipeline? In which case I think you are correct, I didn't really handle clipping so I can't comment on if that would be difficult to add or not

@@ -424,6 +429,46 @@ def draw_image(self, image_data: ImageData, properties: BackendProperties) -> No
oc=self.get_optional_content_group(properties.layer),
)

def register_font(self, font: AbstractFont) -> int:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this implementation cannot handle SHX fonts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume so. I wonder what autocad / other cad programs do when exporting if non-ttf fonts are used?

If both exact rendering and selectable/interpretable text was desirable then invisible text could be layered on top of the baked glyphs hypothetically

@mozman
Copy link
Owner

mozman commented Sep 3, 2024

I liked the fact that the burden of rendering text in backends was removed. This feature is only optional, but questions will still arise as to why the text looks different with different backends.

This implementation skips the clipping stage and renders text outside of VIEWPORTs and clipped INSERT entities and of course cannot render SHX fonts.

I think this feature causes more problems than it solves.

@mbway
Copy link
Contributor Author

mbway commented Sep 3, 2024

I definitely see your viewpoint that users may be confused by the edge cases and that discarding text information before the backend results in simpler backends and I wouldn't suggest that this text policy be set as the default for that reason. But I would think in some cases the ability to further process/analyze the output outweighs inaccuracies in rendering. I am not a user with this requirement though so I don't mind if we skip this feature.

It is a shame that without giving the backend more access, a user with this requirement cannot easily maintain a custom backend that handles text differently. I think a large restructuring would be required to allow this flexibility.

For now I suppose anyone with the requirement for text information in the resulting pdf can use this branch and let me know if they want it rebased in future.

@mozman
Copy link
Owner

mozman commented Sep 3, 2024

I created this tool with my needs in mind (I have been working with CAD for civil engineers for over 25 years) and I don't understand why users would want to create, edit and extract text in DXF files when that's what CAD applications are for. I have always been interested only in geometry stored in DXF files and automating geometry creation - an application independent scripting tool.

However, if someone wants to select/extract text from a DXF file I recommend a tool called ezdxf😄:

import ezdxf

doc = ezdxf.readfile("your.dxf")
for entity in doc.query("MTEXT TEXT"):
    print(entity.dxf.text)  # or write it into a file

@mbway
Copy link
Contributor Author

mbway commented Sep 3, 2024

There is some benefit in being able to access the text in its rendered form (i.e. position, colour etc) as DXF does not store this information plainly (hence the need for the rendering frontend). However, the use case for this may be niche and I'll let users who have an actual use case advocate for it if there are any.

Another benefit for letting PDF handle the text is smaller file sizes but again that may not be a priority.

@mbway mbway closed this Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants