Remove headless-torch14 and headless-tf2 #314

stweil · 2022-04-27T08:17:52Z

No description provided.

Move ocrd_detectron2 to headless-tf1 to avoid conflicts in the main virtual environment. Signed-off-by: Stefan Weil <[email protected]>

Move ocrd_cis to headless-tf1 to avoid a conflict with ocrd_calamari. Signed-off-by: Stefan Weil <[email protected]>

stweil · 2022-04-27T08:21:13Z

The docker image size is reduced from 32.8 GB to 29.9 GB. I also expect a reduced build time.

stweil · 2022-04-27T08:36:55Z

Beside the main virtual environment now only a single 2nd virtual environment headless-tf1 is used for all submodules which require TensorFlow 1. These additional modules were also moved into headless-tf1:

ocrd_cis. It still uses an old version of calamari_ocr. This dependency will be removed in the future.
ocrd_detectron2. It requires detectron2 from ~~NVIDIA~~ Facebook Research which uses a special version of torch from ~~the same source~~ pytorch.org and requires NVIDIA libraries. A local build on our OCR server tries to install CUDA_VERSION 11.2 which does not exist, so the build fails unless 11.3 is required explicitly. Should ocrd_all enforce CUDA_VERSION 11.3? Would it be possible/desired to use the same torch for all submodules?

stweil · 2022-04-27T08:38:15Z

A test run with docker-maximum-cuda-git just finished successfully, see build log.

stweil · 2022-04-27T09:26:52Z

CircleCI now finished successfully in 38:15 min. Previous CI runs for pull requests used more than 42 min, so it looks like the build time was reduced by about 10 %.

stweil · 2022-04-28T20:14:32Z

@bertsky, @kba, do you need more time for the review or can I merge this PR?

bertsky

Also, I think the clean rule needs an update. Especially since the paths have been renamed a few times recently. (So the update must include both the old defaults and the current variables.)

bertsky · 2022-04-28T20:30:55Z

Makefile

@@ -432,9 +428,16 @@ OCRD_CIS += $(BIN)/ocrd-cis-ocropy-resegment
 OCRD_CIS += $(BIN)/ocrd-cis-ocropy-segment
 #OCRD_CIS += $(BIN)/ocrd-cis-ocropy-train
 OCRD_CIS += $(BIN)/ocrd-cis-postcorrect
-$(call multirule,$(OCRD_CIS)): ocrd_cis $(BIN)/ocrd
+$(call multirule,$(OCRD_CIS)): ocrd_cis $(BIN)/ocrd $(SUB_VENV_TF1)/bin/activate


If you pull the current cisocrgroup/ocrd_cis#87, this change is already obsolete.

Moving Ocropy to a sub-venv is something I would avoid at all cost.

That PR is from March 12. I don't want to wait for it with the PR here. We can easily revert the change for ocrd_cis as soon as it is updated in ocrd_all.

Nevertheless, it's the currently recommended version of ocrd_cis, and the same PR is already the basis of ocrd_all master.

I insist: changing venvs unnecessarily must be avoided. Users will likely then have two conflicting versions of ocrd_cis.

The current version of ocrd_cis in the latest ocrd_all still requires calamari_ocr==0.3.5. Therefore it cannot be in the same virtual environment as ocrd_calamari which is now in the main venv.

The current version of ocrd_cis in the latest ocrd_all still requires calamari_ocr==0.3.5. Therefore it cannot be in the same virtual environment as ocrd_calamari which is now in the main venv.

You still don't understand: the latest ocrd_all contains the same (PR) branch of ocrd_cis, but not that very last commit. That's why I said you needed to pull from that PR again.

No need to wait for merges. The whole purpose of the submodule logic is to be able to reference individual commits when needed. Development of ocrd_cis is slow and needs collaboration. Hence the practise of using my feature branches. (This has also been used for core in the past BTW.)

I was speaking of a pull request / merge for ocrd_all. That is required to get your changes for ocrd_cis.

I was speaking of a pull request / merge for ocrd_all. That is required to get your changes for ocrd_cis.

Ok, but the whole point was about updating ocrd_cis here in this PR.

Updating submodules should be done in separate pull requests which usually get merged only by @kba (at least that's how I have understood the workflow). I would like to merge the PR here and can prepare another PR which updates ocrd_cis and puts it back into the main venv later.

That does not make sense at all. Placing it in a venv in this PR, only to require making a new PR where it can be removed again.

Once more: I am against placing ocrd_cis into a sub-venv, and there is no need for it.

bertsky · 2022-04-28T20:49:43Z

Makefile

-$(call multirule,$(OCRD_DETECTRON2)): ocrd_detectron2 $(SUB_VENV_TORCH14)/bin/activate
+$(call multirule,$(OCRD_DETECTRON2)): ocrd_detectron2 $(SUB_VENV_TF1)/bin/activate


I fail to see the need to use the TF1 venv for Detectron2, though. If we already have a Pytorch in the main venv, then let's reuse it here. Or did you experience any conflicts with ocrd_typegroups_classifier?

The conflict comes from torch, as I have explained above (I updated that comment). The Python module detectron2 is not distributed by PyPI and requires a special torch module which depend on NVIDIA libraries and which we would have to use for all other OCR-D processors, too.

That's all covered by ocrd_detectron2's makefile though. I can run both modules fine in the same venv. I still don't know what Nvidia libraries you are referring to – please open an issue with ocrd_detectron2, as stated earlier.

I don't think that's an ocrd_detectron2 issue. Extract from https://github.com/stweil/ocrd_all/runs/6178981637:

ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory No broken requirements found. make[1]: Leaving directory '/build' Makefile:672: recipe for target '/usr/local/bin/ocrd-detectron2-segment-check' failed make: *** [/usr/local/bin/ocrd-detectron2-segment-check] Error 1

libtorch_cuda_cu.so is only included in the torch module from ~~Facebook Research~~ pytorch.org.

How is that not an ocrd_detectron2 issue?

Perhaps the CUDA version detection failed. It found 11.3.1 earlier, but Pytorch seems to be linked against 10.

libtorch_cuda_cu.so is only included in the torch module from Facebook Research.

Again: our Pytorch is from Pytorch.org or PyPI, not Nvidia or Facebook – you are confusing this.

I also wonder why this does not surface in our CircleCI Docker builds (which use the same core-cuda base image)...

I also wonder why this does not surface in our CircleCI Docker builds (which use the same core-cuda base image)...

It simply depends on the order of torch installations.

@bertsky, I now created an issue which shows the same details as the GitHub action log, see bertsky/ocrd_detectron2#10.

And you are right, the Python torch module was fetched from pytorch.org, not from Facebook Research. But the rest of my conclusion still applies: that version is not compatible with the one from PyPI, and we have either to use it for all OCR-D modules or use a separate venv for ocrd_detectron2 which requires it.

Note: with Pytorch 1.11 out, which is incompatible with Detectron2 v0.6 (built against 1.10.1), you need to update ocrd_detectron2 module once again to avoid runtime failure (std::bad_alloc during import).

I have tested a CPU-only installation with ocrd_detectron2 inside headless-tf1, and it does indeed work. (Also, the other modules there still seem to work.) 🤞

bertsky · 2022-04-28T21:02:44Z

ocrd_detectron2. It requires detectron2 from NVIDIA which uses a special version of torch from the same source. A local build on our OCR server tries to install CUDA_VERSION 11.2 which does not exist, so the build fails unless 11.3 is required explicitly. Should ocrd_all enforce CUDA_VERSION 11.3? Would it be possible/desired to use the same torch for all submodules?

First of all, Detectron2 is from Facebook Research, not Nvidia. Also, ocrd_detectron2 does not install a non-standard Pytorch distribution, but tries to fetch the wheels matching the system's CUDA version from Pytorch.org. (That's just a soft dependency – if no matching wheels are found there, then pip will fetch whatever it deems best from PyPI.)

Note also that this is not by accident: just as with TF, you must get a Pytorch matching your system's CUDA version, otherwise it won't run (not even in CPU mode). Because we want to support a wide range of systems (i.e. of CUDA versions, as well as CPU-only), we invest some effort here to determine the current CUDA version.

Lastly, wheels for Detectron2 depend on both the Pytorch and CUDA version. There simply is no PyPI release at all (by design). Building from source seems a waste of resources to me – but I might add that as a fallback option to ocrd_detectron2's Makefile in the future.

Regarding your build failure, please open an issue with ocrd_detectron2, providing details like logs.

stweil · 2022-04-29T04:51:29Z

First of all, Detectron2 is from Facebook Research, not Nvidia. Also, ocrd_detectron2 does not install a non-standard Pytorch distribution, but tries to fetch the wheels matching the system's CUDA version from Pytorch.org. (That's just a soft dependency – if no matching wheels are found there, then pip will fetch whatever it deems best from PyPI.)

... and will fail because of a missing shared NVIDIA library as soon as you run the executable. The installation of ocrd_detectron2 sets additional search paths for pip and gets not only detectron2 but also torch (and other modules) from there. A standard torch can be installed by other OCR-D processors, but that breaks ocrd_detectron2 (see GitHub action build log). Thanks for the hint to Facebook Research. I fixed my comment above.

Note also that this is not by accident: just as with TF, you must get a Pytorch matching your system's CUDA version, otherwise it won't run (not even in CPU mode). Because we want to support a wide range of systems (i.e. of CUDA versions, as well as CPU-only), we invest some effort here to determine the current CUDA version.

Regarding your build failure, please open an issue with ocrd_detectron2, providing details like logs.

Then it is impossible to use the NVIDIA libraries which are provided with Debian / Ubuntu with ocrd_detectron2 because they have CUDA_VERSION 11.2 which is unsupported by Facebook Research. This looks like the beginning of a new nightmare similar to the TF 1 one, but here there is at least a possible fallback to CPU (not so nice). You noticed that already yourself in bertsky/ocrd_detectron2#7. I now added a comment there for the build failure with CUDA version 11.2.

stweil · 2022-04-29T05:23:03Z

Also, I think the clean rule needs an update. Especially since the paths have been renamed a few times recently. (So the update must include both the old defaults and the current variables.)

This is unrelated to the PR here.

We discussed already whether the clean target should also handle former defaults and obviously have different opinions. I don't think that it is necessary to handle old defaults and vote against blowing up the Makefile with rules for them. There is always a very simple solution for a real clean: just remove anything and make a fresh build.

bertsky · 2022-04-29T06:16:48Z

Also, I think the clean rule needs an update. Especially since the paths have been renamed a few times recently. (So the update must include both the old defaults and the current variables.)

This is unrelated to the PR here.

It is perfectly related. You change venvs here. Thus,

the rule does not clean the current VIRTUAL_ENV anymore.
it does not catch the state which users had up to this point anymore.

We discussed already whether the clean target should also handle former defaults and obviously have different opinions.

I don't recall you were opposed to that notion. On what grounds? Obviously we previously decided to do it this way, nevertheless.

I don't think that it is necessary to handle old defaults and vote against blowing up the Makefile with rules for them.

You can hardly call a few more shell globs "blowing up the Makefile".

There is always a very simple solution for a real clean: just remove anything and make a fresh build.

The point is simply that users do not know what to delete, or forget about details. True even for advanced users (you cannot think of everything all the time).

stweil · 2022-04-29T07:24:19Z

It is perfectly related. You change venvs here.

I removed two virtual environments. make clean still removes any existing venv (also the old ones).

Makefile

stweil added 2 commits April 27, 2022 09:48

Remove unneeded virtual environment headless-torch14

babfc8b

Move ocrd_detectron2 to headless-tf1 to avoid conflicts in the main virtual environment. Signed-off-by: Stefan Weil <[email protected]>

Remove unneeded virtual environment headless-tf2

df2b460

Move ocrd_cis to headless-tf1 to avoid a conflict with ocrd_calamari. Signed-off-by: Stefan Weil <[email protected]>

stweil requested review from bertsky and kba April 27, 2022 08:38

bertsky requested changes Apr 28, 2022

View reviewed changes

stweil mentioned this pull request Apr 29, 2022

Update ocrd_cis #316

Merged

stweil commented May 3, 2022

View reviewed changes

Makefile Show resolved Hide resolved

kba mentioned this pull request May 3, 2022

Update 2022 05 03 #317

Merged

kba merged commit 4afd5e8 into OCR-D:master May 4, 2022

stweil deleted the headless-torch14 branch May 4, 2022 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove headless-torch14 and headless-tf2 #314

Remove headless-torch14 and headless-tf2 #314

stweil commented Apr 27, 2022

stweil commented Apr 27, 2022

stweil commented Apr 27, 2022 •

edited

Loading

stweil commented Apr 27, 2022

stweil commented Apr 27, 2022 •

edited

Loading

stweil commented Apr 28, 2022

bertsky left a comment

bertsky Apr 28, 2022

stweil Apr 29, 2022

bertsky Apr 29, 2022

stweil Apr 29, 2022

bertsky Apr 29, 2022

bertsky Apr 29, 2022

stweil Apr 29, 2022 •

edited

Loading

bertsky Apr 29, 2022

stweil Apr 29, 2022

bertsky Apr 29, 2022

bertsky Apr 28, 2022

stweil Apr 29, 2022

bertsky Apr 29, 2022

stweil Apr 29, 2022 •

edited

Loading

bertsky Apr 29, 2022

stweil Apr 29, 2022

stweil Apr 29, 2022

bertsky May 3, 2022

bertsky commented Apr 28, 2022

stweil commented Apr 29, 2022 •

edited

Loading

stweil commented Apr 29, 2022

bertsky commented Apr 29, 2022

stweil commented Apr 29, 2022

		$(call multirule,$(OCRD_DETECTRON2)): ocrd_detectron2 $(SUB_VENV_TORCH14)/bin/activate
		$(call multirule,$(OCRD_DETECTRON2)): ocrd_detectron2 $(SUB_VENV_TF1)/bin/activate

Remove headless-torch14 and headless-tf2 #314

Remove headless-torch14 and headless-tf2 #314

Conversation

stweil commented Apr 27, 2022

stweil commented Apr 27, 2022

stweil commented Apr 27, 2022 • edited Loading

stweil commented Apr 27, 2022

stweil commented Apr 27, 2022 • edited Loading

stweil commented Apr 28, 2022

bertsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stweil Apr 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stweil Apr 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bertsky commented Apr 28, 2022

stweil commented Apr 29, 2022 • edited Loading

stweil commented Apr 29, 2022

bertsky commented Apr 29, 2022

stweil commented Apr 29, 2022

stweil commented Apr 27, 2022 •

edited

Loading

stweil commented Apr 27, 2022 •

edited

Loading

stweil Apr 29, 2022 •

edited

Loading

stweil Apr 29, 2022 •

edited

Loading

stweil commented Apr 29, 2022 •

edited

Loading