Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running OCR gives no results and NS_ERROR_FILE_NOT_FOUND #88

Open
TrakJohnson opened this issue Dec 17, 2024 · 21 comments
Open

Running OCR gives no results and NS_ERROR_FILE_NOT_FOUND #88

TrakJohnson opened this issue Dec 17, 2024 · 21 comments
Assignees

Comments

@TrakJohnson
Copy link

TrakJohnson commented Dec 17, 2024

Hi, just installed the plugin, when trying to OCR my first file I get the following error in the developer console:

NS_ERROR_FILE_NOT_FOUND: Component returned failure code: 0x80520012 (NS_ERROR_FILE_NOT_FOUND) [nsIFile.isDirectory]

I first thought that I had misconfigured tesseract/pdftoppm, but everything seems to look fine.. are there any ways to further investigate this ? I read through #87 but it doesn't seem related. Thanks !

Here's my configuration:

  • Zotero 7.0.11
  • Fedora Linux 41 / Linux 6.11.10-300.fc41.x86_64
  • Libraries:
❯ /usr/bin/pdftoppm -v                                                           
pdftoppm version 24.08.0
Copyright 2005-2024 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011, 2022 Glyph & Cog, LLC
❯ /usr/bin/tesseract -v                                                              ~
tesseract 5.4.1
 leptonica-1.84.1
  libgif 5.2.2 : libjpeg 6b (libjpeg-turbo 3.0.2) : libpng 1.6.40 : libtiff 4.6.0 : zlib 1.3.1.zlib-ng : libwebp 1.4.0
 Found AVX512BW
 Found AVX512F
 Found AVX512VNNI
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libcurl/8.9.1 OpenSSL/3.2.2 zlib/1.3.1.zlib-ng libidn2/2.3.7 nghttp2/1.62.1
  • Zotero-OCR settings:

image

@aborel
Copy link
Collaborator

aborel commented Dec 19, 2024

First we can check whether the problem happens at the pdftoppm or at the tesseract stage. Are the PNG images saved to the Zenodo item folder?

@aborel aborel self-assigned this Dec 19, 2024
@zzyzx-dc
Copy link

zzyzx-dc commented Jan 3, 2025

I am having the same issue and came here to see if anyone else was. Zotero 7.0.11 on Fedora Workstation 41.

Could not get children of file(/opt) because it does not exist
Error code: NS_ERROR_FILE_NOT_FOUND: Component returned failure code: 0x80520012 (NS_ERROR_FILE_NOT_FOUND) [nsIFile.isDirectory] zotero-ocr.js:87

Additionally, like the original post, I had to manually set the filepaths to /usr/bin/tesseract and /usr/bin/pdftoppm or it returned OperationError: Could not parse path (tesseract): NS_ERROR_FILE_UNRECOGNIZED_PATH but the documentation helped me realize I needed to locate the file paths myself. Thanks!

@aborel
Copy link
Collaborator

aborel commented Jan 3, 2025

Since the OP didn't answer, maybe you can check whether pdftoppm did its job?

@zzyzx-dc
Copy link

zzyzx-dc commented Jan 3, 2025

Sure thing - I am not sure how to check so you might have to walk me through it. When I go to the item folder (Zotero item - right click - Show file) there is only the PDF file.

@aborel
Copy link
Collaborator

aborel commented Jan 4, 2025

If you have selected "Save the intermediate PNGs as well in the folder" like the OP, then pdftoppm has not worked at all.
Can you post a screenshot of your Zotero-OCR settings?
What is the output in a shell if you run
/usr/bin/tesseract
and
/usr/bin/pdftoppm
?

@zzyzx-dc
Copy link

zzyzx-dc commented Jan 4, 2025

Do you think it is a problem with Fedora or with the dnf package for pdftoppm? I can try a different installation method if you wish, or I can see if I still have a ubuntu laptop... Nevermind, I tried on a Ubuntu machine and pdftoppm also hanged on that system.

If you have selected "Save the intermediate PNGs as well in the folder" like the OP, then pdftoppm has not worked at all. Can you post a screenshot of your Zotero-OCR settings?

image

What is the output in a shell if you run /usr/bin/tesseract and /usr/bin/pdftoppm ?

pdftoppm seems to hang:

$ /usr/bin/tesseract
Usage:
  /usr/bin/tesseract --help | --help-extra | --version
  /usr/bin/tesseract --list-langs
  /usr/bin/tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

$ /usr/bin/pdftoppm
^C

$ pdftoppm
^C

$ pdftoppm -h
pdftoppm version 24.08.0
Copyright 2005-2024 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011, 2022 Glyph & Cog, LLC
Usage: pdftoppm [options] [PDF-file [PPM-file-prefix]]
  -f <int>                                 : first page to print
  -l <int>                                 : last page to print
  -o                                       : print only odd pages
  -e                                       : print only even pages
  -singlefile                              : write only the first page and do not add digits
  -scale-dimension-before-rotation         : for rotated pdf, resize dimensions before the rotation
  -r <fp>                                  : resolution, in DPI (default is 150)

@zzyzx-dc
Copy link

zzyzx-dc commented Jan 4, 2025

pdftoppm version 0.86.1
Copyright 2005-2020 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

@zzyzx-dc
Copy link

zzyzx-dc commented Jan 4, 2025

Some troubleshooting I tried:

  1. installing the newer version from source - I tried, but compiling this was a bit above my abilities.

  2. running poppler tools on the pdf myself, in a different directory. This worked. I have png files which are viewable and look good.

~/Downloads$ pdftoppm -png DeVries\ -\ 2006\ -\ Medieval\ Warfare\ and\ the\ Value\ of\ a\ Human\ Life.pdf testfile
~/Downloads$ ls testfile*
testfile-01.png  testfile-05.png  testfile-09.png  testfile-13.png  testfile-17.png  testfile-21.png  testfile-25.png  testfile-29.png  
testfile-02.png  testfile-06.png  testfile-10.png  testfile-14.png  testfile-18.png  testfile-22.png  testfile-26.png  
testfile-03.png  testfile-07.png  testfile-11.png  testfile-15.png  testfile-19.png  testfile-23.png  testfile-27.png  
testfile-04.png  testfile-08.png  testfile-12.png  testfile-16.png  testfile-20.png  testfile-24.png  testfile-28.png  

@TrakJohnson
Copy link
Author

TrakJohnson commented Jan 5, 2025

Hi, sorry for the delay and thank you for your help! I have the same results as @zzyzx-dc. pdftoppm and tesseract work fine on their own (tested on the same pdf file), but using the plugin the PNGs don't get generated.

@aborel
Copy link
Collaborator

aborel commented Jan 5, 2025

Thanks for the details!
pdftoppm being stuck (or waiting for some input) when run without any argument seems to be the normal behaviour, I wasn't aware of that.
I suspected an incorrect location for the tesseract executable as I have recently noticed a bug in the location check code (it fails to display an error window in some cases), but @zzyzx-dc 's report indicates that this is not the case.

pdftoppm manual execution: interesting. Are you sure that the command line is using the same executable? Without an explicit path there could be several versions on your system. Try to run
/usr/bin/pdftoppm -png DeVries\ -\ 2006\ -\ Medieval\ Warfare\ and\ the\ Value\ of\ a\ Human\ Life.pdf testfile ~/Downloads$ ls testfile*
and/or
which pdftoppm
to make sure it's the same one as for Zotero-OCR.

@zzyzx-dc
Copy link

zzyzx-dc commented Jan 5, 2025

Yeah it's the same one:

$ which pdftoppm
/usr/bin/pdftoppm

Running /usr/bin/pdftoppm yields viewable pngs.

@aborel
Copy link
Collaborator

aborel commented Jan 5, 2025

So apparently your pdftoppm installation is OK, that's a good data point, thank you.
The only suggestion I have right now is to set the OCR language to eng . I am not sure that it will help, but with the current settings you might have a problem later.
I'll spend some more time on this in the next few days. The code should probably be improved to make diagnostics easier... but I hope we can solve this case without a new release (which might introduce a few new bugs as well). Sorry about the inconvenience!

@aborel
Copy link
Collaborator

aborel commented Jan 10, 2025

@zzyzx-dc I'm rewriting the pdftoppm/tesseract detection code for cases where no full path has been provided, so problems like this won't happen so much in the future.
But at the moment, if you set the OCR language to eng, does the plugin work for you or not? If not, what are the exact error messages you see in the console log?

@TrakJohnson
Copy link
Author

Hi, I reran things with the OCR language set to eng in Zotero settings, sadly I get the same results.

I've however just discovered the existence of the debug output feature in Zotero, if that can be helpful

[JavaScript Error: "NS_ERROR_FILE_NOT_FOUND: Component returned failure code: 0x80520012 (NS_ERROR_FILE_NOT_FOUND) [nsIFile.isDirectory]" {file: "jar:file:///home/theo/.zotero/zotero/ujicwv30.default/extensions/[email protected]!/zotero-ocr.js" line: 87}]

[JavaScript Error: "NS_ERROR_FILE_NOT_FOUND: Component returned failure code: 0x80520012 (NS_ERROR_FILE_NOT_FOUND) [nsIFile.isDirectory]" {file: "jar:file:///home/theo/.zotero/zotero/ujicwv30.default/extensions/[email protected]!/zotero-ocr.js" line: 87}]

[JavaScript Error: "NS_ERROR_FILE_NOT_FOUND: Component returned failure code: 0x80520012 (NS_ERROR_FILE_NOT_FOUND) [nsIFile.isDirectory]" {file: "jar:file:///home/theo/.zotero/zotero/ujicwv30.default/extensions/[email protected]!/zotero-ocr.js" line: 87}]

[JavaScript Error: "TypeError: this.gViewSourceUtils is undefined" {file: "resource://devtools/client/webconsole/webconsole.js" line: 223}]
viewSource@resource://devtools/client/webconsole/webconsole.js:223:5
onViewSource@resource://devtools/client/webconsole/service-container.js:43:35
onClick@resource://devtools/client/shared/components/Frame.js:265:18
invokeGuardedCallbackImpl@resource://devtools/client/shared/vendor/react-dom.js:74:10
invokeGuardedCallback@resource://devtools/client/shared/vendor/react-dom.js:111:29
invokeGuardedCallbackAndCatchFirstError@resource://devtools/client/shared/vendor/react-dom.js:125:25
executeDispatch@resource://devtools/client/shared/vendor/react-dom.js:346:42
executeDispatchesInOrder@resource://devtools/client/shared/vendor/react-dom.js:362:22
executeDispatchesAndRelease@resource://devtools/client/shared/vendor/react-dom.js:462:29
executeDispatchesAndReleaseTopLevel@resource://devtools/client/shared/vendor/react-dom.js:470:10
forEachAccumulated@resource://devtools/client/shared/vendor/react-dom.js:444:8
runEventsInBatch@resource://devtools/client/shared/vendor/react-dom.js:598:21
runExtractedEventsInBatch@resource://devtools/client/shared/vendor/react-dom.js:606:19
handleTopLevel@resource://devtools/client/shared/vendor/react-dom.js:4272:30
batchedUpdates$1@resource://devtools/client/shared/vendor/react-dom.js:15752:12
batchedUpdates@resource://devtools/client/shared/vendor/react-dom.js:1882:12
dispatchEvent@resource://devtools/client/shared/vendor/react-dom.js:4351:19
interactiveUpdates$1/<@resource://devtools/client/shared/vendor/react-dom.js:15803:14
unstable_runWithPriority@resource://devtools/client/shared/vendor/react.js:617:12
interactiveUpdates$1@resource://devtools/client/shared/vendor/react-dom.js:15802:12
interactiveUpdates@resource://devtools/client/shared/vendor/react-dom.js:1901:10
dispatchInteractiveEvent@resource://devtools/client/shared/vendor/react-dom.js:4328:21


[JavaScript Error: "NS_ERROR_FILE_NOT_FOUND: Component returned failure code: 0x80520012 (NS_ERROR_FILE_NOT_FOUND) [nsIFile.isDirectory]" {file: "jar:file:///home/theo/.zotero/zotero/ujicwv30.default/extensions/[email protected]!/zotero-ocr.js" line: 87}]

[JavaScript Error: "Could not get children of file(/opt) because it does not exist" {file: "chrome://zotero/content/xpcom/file.js" line: 339}]

[JavaScript Error: "NS_ERROR_FILE_NOT_FOUND: Component returned failure code: 0x80520012 (NS_ERROR_FILE_NOT_FOUND) [nsIFile.isDirectory]" {file: "jar:file:///home/theo/.zotero/zotero/ujicwv30.default/extensions/[email protected]!/zotero-ocr.js" line: 87}]

[JavaScript Error: "NS_ERROR_FILE_NOT_FOUND: Component returned failure code: 0x80520012 (NS_ERROR_FILE_NOT_FOUND) [nsIFile.isDirectory]" {file: "jar:file:///home/theo/.zotero/zotero/ujicwv30.default/extensions/[email protected]!/zotero-ocr.js" line: 87}]

appName => Zotero, version => 7.0.11 (x64), os => Linux 6.12.8-200.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jan  2 19:26:03 UTC 2025, locale => en-US, extensions => Zotero OCR (0.8.1, extension)

@aborel
Copy link
Collaborator

aborel commented Jan 12, 2025

The error is happening while the plugin is checking your tesseract path preference. I don't understand why this is the case, your screenshot says that it is /usr/bin/tesseract and it looks correct according to your shell tests.

However, I am not convinced that the failing code is really necessary - I am prepared to remove it. Still, a similar error could happen in a more useful check that is executed a few steps later, so I'd really like to understand the underlying situation. Before I create a new pre-version, could you try to run the following?

In your normal shell:

ls -l /usr/bin/tesseract

In Zotero (menu Tools > Developer > Error Console):

let ocrEngine = '/usr/bin/tesseract';
let pathOrFile = FileUtils.File(ocrEngine);
pathOrFile.isDirectory()

@alex-ca1123
Copy link

I am on Ubuntu 24.04.1, and I have a same result. This is because I am fuked by the snap package that isolates the application runtime. You should advise against linux users against snap packages or any containerized deployment.

https://forums.zotero.org/discussion/108471/installation-and-use-of-libreoffice-plugin-fails-on-ubuntu-22-04-3-using-zotero-snap

@alex-ca1123
Copy link

btw, there are probably some workarounds to let snap see paths of the base system, but I dont think it worth the hassle. advise users to use https://github.com/retorquere/zotero-deb

@aborel
Copy link
Collaborator

aborel commented Jan 20, 2025

@alex-ca1123 While this could indeed be useful, I still wish the other users could provide the requested information.

@alex-ca1123
Copy link

@aborel fedora has flatpak, same basic principal of evil vendorization efforts to fragment opensource community. https://discussion.fedoraproject.org/t/zotero-bibliography-manager-tarball-on-fedora-40-kde-how-i-got-it-working/132509 and I ran your directives, containerized app can't see host raw paths as expected.

@aborel
Copy link
Collaborator

aborel commented Jan 20, 2025

I get your point, it is certainly relevant, but it doesn't tell me what I wanted to know. The output of the requested commands is welcome.

@q-wertz
Copy link

q-wertz commented Jan 29, 2025

Having the same issue on Manjaro with Gnome Desktop and Zotero is installed as Flatpak.

I also get the following message on Browser Console:

NS_ERROR_FILE_NOT_FOUND: Component returned failure code: 0x80520012 (NS_ERROR_FILE_NOT_FOUND) [nsIFile.isDirectory] 2 zotero-ocr.js:87

Running the commands returns:

$ ls -l /usr/bin/tesseract
-rwxr-xr-x 1 root root 47256 11. Nov 09:22 /usr/bin/tesseract

Zotero:
Image

In my very limited understanding of Flatpak It requires either to bundle the binaries with the application or using flatpak-spawn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants