-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grobid docker container - location of grobid-trainer #1167
Comments
Hi @cboulanger, indeed good question. I would mount the volume linking the local |
Hi, thanks! I have downloaded the Grobid source and bound the training-related directories to the container (from the 0.8.1-full image). Now I run into the next problem. I want to create the training files from a bunch of PDFs in a directory on the host. I do this:
But I am getting this:
How do I need to rewrite the above command to make it succeed? |
Hi @cboulanger I'm not sure what you're trying to do (and... I have no experience with apptainer). What was in my mind when I wrote my previous comment was that you can mount the directory with the container on the host machine, given that you have access to it. Then you operate independently from the container to be running or not. The Java command should be called from the host, but I'm not sure it's actually possible in your case. 🤔 |
What I am trying to do is to use Grobid on a High Performance Cluster which runs jobs only in containerized form. This means that I cannot do stuff involving a GPU unless it runs inside a container. Maybe that's not necessary for the "createTraining" batch job but it probably is for others. But to solve the problem at hand: the problem seems to be that I would need to build the project locally to have the compiled jar files on the host, wouldn't I? That would defeat the purpose of using the images in order not to have to set up a build environment. Or did I misunderstand something? |
ah, sorry, you would need to run the |
Thank you, sorry to be such a bother, where do I find that file :-) ? |
It should be under
|
Ok - I see - grobid-core is not part of the container. So I guess I won't be able to avoid setting up a development environment and build the project... Hope I'll manage! |
Yes, the docker image has been built to be efficient in term of disk space, so we've left out all the stuff, but we might consider having a docker image that allow performing both evaluation and training. |
Hello! There is a training web API already part of the Grobid service (typically as container with mounted paths), to start a training, get progress info, evaluation and fetch the trained model. A simple addition for this API would be the "createTraining" with a PDF as input and that should allow to do the full training without command line. |
That would be great- I am trying to build from source and am already running into problems with the Java version. An image that includes training would be most useful! |
For the record: I am sure you mention it somewhere in the docs but it wasn't immediately clear to me that I had to use Java 11 instead of the Java 21 that I had in my environment. Then it worked as described in the docs with
But as we agreed, it would be much better to have a special version of the image that included all training-related commands as services that could be invoked via the API - if the API clients could profit from that new API methods, it would be even better. Until then, I now know how to run the commands from the built source. Thanks! |
Even with the API to create training data from PDF, you would have to access the files somehow to correct them, and move them to the respective directories. So a mounted volume is necessary. Having said that, indeed not having to deal with running the stuff in local, finding the right jvm etc.. could improve our experience, yes. |
Certainly. one would have to use mounts to get the datasets and models in and out of the container
It did not work, I first tried upgrading the version in build.gradle, but the dependencies between the java, gradle, and kotlin versions were such that it could not be made to work. I think upgrading will involve some changes (there were also warnings about "deprecated gradle commands" or something). In any case, I have ~60 annotated articles involving footnotes (which is so far unsupported by Grobid) in the AnyStyle annotation format that I would love to contribute to the Grobid Ground Truth so that Grobid can better perform in the domain of the Humanities and Social Sciences. I'll first convert the datasets for the citation model only because that is the easiest and we want to compare our own LLM-based extraction method against Grobid's. |
Hi, following up on this - I am thinking of writing an apptainer build script from scratch instead of trying to work with the docker images, so that I can just use the source as it is and build and run it in the container. I am wondering: the Dockerfile including DL isn't just building the source with gradle, but performing very complex post-build operations. Why is that? Doesn't the repo if you run it with |
The DL image require some additional library (python-based, such as tensorflow) to be installed in specific way (usually troublesome). If you could make your apptainer image, starting from a docker image it would be probably better as it will come with the problematic libraries already placed in the right place. |
Hi, I am using the grobid/grobid/0.8.0 docker container converted to an apptainer image. I want to add new bibliographic training data dn use the Training API. However, I do not find the location of the training files - are they omitted from the docker image?
The text was updated successfully, but these errors were encountered: