-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slim versions of TFX Docker images #6921
Comments
Here's what I'm seeing when I build them by hand:
|
If that TFX image is based on the latest (1.16dev) image, then that is quite a saving, almost half. Interesting. Did you find it hard to build, @pritamdodeja ? And: |
The build wasn't so hard, I've included it below. The other images are just subsets of what's below. To your point about the tfx image, in my mind, I see tfx as the control plane and beam/tensorflow as the data plane, and as such, I'd imagine that the control plane doesn't add as much heft to this. The reason I'm going down this rabbit hole is I have beam/tfx code that runs on DirectRunner and embedded as Docker with beam, that doesn't run on DataflowRunner. I need to understand more about how the tfx image itself plays into the overall ecosystem of Vertex, TFX, Kubeflow, and Beam. Would appreciate any advice.
|
@axeltidemann can you please clarify about your particular use-case? It would help me understand this better. For example, there's the container that's executing each step in the Kubeflow pipeline, there's the beam container, there's the container that needs both CUDA and beam (e.g. transform component), there's the trainer and tuner component, which I imagine needs CUDA. I'm trying to understand the meaning/purpose of the tfx container itself. Appreciate your feedback! |
One reason (of many) that these large images are problematic is that GCP DataFlow jobs take forever to spin up new workers - anywhere from 15 to 30 minutes, in my experience! I initially thought this might be due to lengthy dependency installation on worker startup (as described here), but I've confirmed that my dependencies are pre-installed in my custom docker image (based on DataFlow system logs confirm the long duration of the image pull:
^ This happens for every worker that DataFlow spins up, which makes these jobs very slow to scale. Anything that can be done to reduce the size of this |
@stefandominicus-takealot @axeltidemann Can you see if the below works to reduce the size of the container and meet your objectives? I've used the below and it has been working for me. Size appears to be ~6GB. My transform job in Dataflow kicked off by Vertex ran successfully, Dataflow does show an "insight", I believe this has to do with the transform code whl being installed at runtime. Tensorflow is now including CUDA, so I believe nothing extra needs to be done for that anymore, I need to verify this part of it by comparing against local runs.
You can build it like this:
and you can push it like this:
|
Dear Users, |
@stefandominicus-takealot was the docker repo in the same region as where the pipeline was executing? Let me know if the above suggestion solves your issue. @janasangeetha I will try and make progress on this and track here. It might result in some documentation updates or some commits to the repo. I believe this should be a solvable issue. My thinking is the container should be such that it can execute all steps of a tfx pipeline locally and in GCP. Beam 2.61 just came out. My 2.60 containers are working without issue in GCP, I'll look at how the tfx container is built and test against pipelines that I know are working. |
@stefandominicus-takealot I tested this with a beam 2.61 container using a Dockerfile like the above, and my container is retrieved in 2 minutes 51 seconds. The tfx container you're using is 4.27x the size of the custom container I'm using. The only job with an insight about container image pre-building is the transform one. |
I spent some more time understanding this in more detail, and I'm going to summarize my understanding, what I've tried, and what the situation is per my understanding. Requirements: TFX should work (e.g. various runners can execute components via python -m approach) Testing (local): In my mind, if a pipeline works with docker embed mode for beam, that pipeline should work in Dataflow as in both these cases the beam container is executed. I can locally (fedora 41 + docker) bring up/build these containers. For example, I have a 13GB container that I can get into, that uses pyenv to build the python runtime, uses nvidia base image, has apache_beam[gcp] at 2.61.0, tfx at 1.15.0. I can likely integrate the requirements.txt from the official container and see what it balloons up to. Speaking of the tfx container, with a few modifications of requirements.txt, I can build a tfx1.16dev image of the docker container inline with whats in tfx/tools/docker/build_docker_image.sh (it's about 25G). In local testing via gcloud ai custom-jobs local-run --executor-image, I can verify the GPU is seen, I can go in and checkout /opt/apache/beam/boot, tf, tfx etc., however, in GCP, these containers don't work. I'm seeing errors like
Which I'm going to try and address by giving it more space. What is surprising is that bigger containers have successfully executed Dataflow steps. I believe we need better requirements/documentation/tests for the docker container. Secondly, I believe the container, as setup, is doing too much (e.g. it needs to be both tfx + tf + cuda + apache_beam, I've read there's issues with conda and beam, but some of the components are conda based, I see differences across v1.14 and v1.15). The container build is also very complex (e.g. wheel-builder building the local tfx and ml pipelines sdk) then installing these). I understand it makes sense from a CICD perspective, but the container that ships should be simpler in my opinion, and maybe that container can be a separate one than the one that builds the wheels and installs them and tests against pipelines etc. In any case, the upstream dependency/opaqueness of the deep learning containers coupled with some of the other complexity makes this tricky to solve. It would be great if we could establish a contract of sorts that these containers can be verified against. For example, using gcloud ai custom-jobs local-run where it verifies the container meets the contract locally or via cloud shell so that container troubleshooting is not as complex. Lastly, I believe these containers working in GCP for a particular release is critical as large scale training using tfx is dependent on it. For example, the 1.15 container does not work, and I've seen issues where people are staying back on 1.14 because of this issue. I'll post my updates here, I appreciate any guidance and direction that others can provide. |
Here's a container that I built that appears to be working in terms of tfx, tf, beam, and cuda. The last point I'm indirectly inferring because the tensorboard profiling I'm doing appears to be producing data.
The container appears to be a little over 9GB, so it loads fairly quickly. The biggest contributor to this is the 7GB layer created when tensorflow is installed with cuda. It includes the ml-pipelines-sdk as well. I'm not seeing any of the issues in terms of layers not loading etc., that I was seeing before. |
This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you. |
Hello! 👋 We've created a Docker image that significantly reduces the size compared to standard TFX Docker images.
It has been tested successfully on Vertex-ai pipeline. Here is the Dockerfile:
|
Excellent work, @KholofeloPhahlamohlaka-TAL ! (Full disclosure: we work together, but I think he deserves praise on the world wide web as well!) |
Hi @KholofeloPhahlamohlaka-TAL |
We would be very interested in anyone's experience building TFX images with Nvidia GPU support. In our experience, this can easily double the size, and it's not easy to get TF/TFX/CUDA etc versions to align and be 'found' by TF |
@adriangay can you see if the solution I provided above on Dec 6th above solves your issue? I still need to figure out on the TPU side, but I believe it should address the NVIDIA GPU + tensorflow (2.15.1) + tfx (1.15.1) situation. |
Could we have Docker images that are slimmer? Some examples of TFX Docker image sizes (compressed, even):
TFX 1.0: 5.67GB
TFX 1.5: 6.65GB
TFX 1.10: 8.53GB
TFX 1.15: 11.4GB
At least an explanation why the image sizes keep on growing would be great.
Or is the recommended way to build a Docker image yourself off a slim Python or Ubuntu image?
The text was updated successfully, but these errors were encountered: