Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying this patch: cant index any pdf file any more #1

Open
maxodoble opened this issue Apr 6, 2018 · 2 comments
Open

Applying this patch: cant index any pdf file any more #1

maxodoble opened this issue Apr 6, 2018 · 2 comments

Comments

@maxodoble
Copy link

maxodoble commented Apr 6, 2018

Hi,
i tried this patch on a test repo of Alfresco 201707GA.

High CPU usage is gone for the problematic test pdf page, but now no new pdf get's indexed any more:
Log shows:

2018-04-06 12:50:24,851 WARN [content.metadata.AbstractMappingMetadataExtracter] [catalina-exec-35] Metadata extraction failed (turn on DEBUG for full error): Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@24dfb72e Content: ContentAccessor[ contentUrl=store://2018/4/6/12/50/908747cc-a822-418a-87cb-4e79d8130a5f.bin, mimetype=application/pdf, size=364282, encoding=UTF-8, locale=en_US] Failure: org/apache/tika/parser/pdf/PDF2XHTMLnull 2018-04-06 12:50:45,094 DEBUG [content.metadata.MetadataExtracterConfigImpl] [catalina-exec-50] Tika metadata options passed to tika parser: TIKA_PARSER_PARSE_SHAPES=false
when i remove your patch, new pdf files are getting indexed again o.k.

@angelborroy any idea why this is happening?
Cheers,

Max

@angelborroy-ks
Copy link
Contributor

Tika version is different for 201707-GA, probably a different patch is required as this patch is developed for 201605-GA.

I know that this issue has been solved by Alfresco itself for 201803-EA, but I don't know when a new "GA" is available.

@sumitt
Copy link

sumitt commented Dec 28, 2023

Hi @angelborroy-ks @maxodoble,

I am also facing the same problem on Alfresco 201707GA after applying the patch that no pdf content is getting indexed. PDFs are getting searched only through metadata.

Is there any solution? @maxodoble Have you found anything to resolve content indexing?

Best regards,
Sumit Tomar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants