Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retire libtorch dependency #159

Open
shindavid opened this issue Feb 5, 2025 · 0 comments
Open

Retire libtorch dependency #159

shindavid opened this issue Feb 5, 2025 · 0 comments

Comments

@shindavid
Copy link
Owner

shindavid commented Feb 5, 2025

Currently, the python side exports pytorch models to disk in a pytorch-specific format via torch.jit.trace(), and loads them on the c++ side via the torch::jit::load(). I have many gripes with this setup:

  1. torch.jit.trace() has a memory leak, forcing me to use a clunky workaround.
  2. On the C++ side, there is also a memory leak in torchlib, forcing me to use an even more clunky workaround (restart the self-play process every hour). This workaround may cause issues if/when we start tackling games that last a long time (such as 2048).
  3. The torchlib installation is a pain, demanding you to download a version that matches your machine CUDA version. This reduces our docker-setup flexibility. Furthermore, baking the installation step into the Dockerfile empirically leads to unacceptably slow docker image load times when running on runpod.io, which motivates this clunky workaround.
  4. There are some dynamic-library load issues due to a clash between the python pytorch package and the c++ torchlib library. The result of this is that we are unable to utilize a debug-build of the c++ FFI library, which limits our debugging options. There may be some workaround, but I have not been able to figure it out.

Due to these issues, I would like to retire our dependence on the torchlib library. We can do this by having pytorch export the models in the interoperable open ONNX format. On the c++ side, we can use Microsoft's open-source onnxruntime library. Besides holding the promise of addressing the above issues, this could bring additional benefits:

A. onnxruntime will likely be faster than torchlib (although this needs to be tested).
B. There are many tools out there to inspect/visualize ONNX model files, such as Netron.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant