TorchData or Torch.Utils.Data ? #580

GeorgeS2019 · 2022-04-20T02:29:58Z

GeorgeS2019
Apr 20, 2022

In TorchSharp, there are

TorchSharp.Data
TorchSharp.Utils

I am posting here more of a reflection of this Torchsharp NLP NER Support issue

It is an NLP use case and yet it is not part of TorchText. The data loading functions used in pytorch-ner is more related to torch.utils.data.

Question: is there a need for Torchsharp data utility functions to be more align with those of pytorch? Shall TorchSharp.Data more aligns with that of TorchData or TorchSharp.Utils more aligns with that of torch.utils.data?

Other discussion related to TorchData

NiklasGustafsson · 2022-04-20T13:55:40Z

NiklasGustafsson
Apr 20, 2022
Maintainer

Adding @michaelgsharp and @tarekgh for visibility.

@GeorgeS2019 -- good timing! We're just now starting to internally discuss better coverage of NLP scenarios for both ML.NET and TorchSharp. We're first discussing what to do about tokenizers, which is what torch.data.utils is primarily about. Haven't gotten very far in our discussion yet.

0 replies

GeorgeS2019 · 2022-06-06T15:56:40Z

GeorgeS2019
Jun 6, 2022
Author

@NiklasGustafsson @michaelgsharp @tarekgh

Having a preliminary look at how pytorch and onnx have been approaching Deep NLP.

I personally feel Onnxruntime has c++ codes that are far more ahead than that of pytorch (torchtext, tourch.data.util). The python and .NET communities are generally leaning towards the onnxruntime direction when come to deep NLP. One of key drivers is the easily availability of Deep NLP Onnx provided through e.g. Onnxruntime, Onnx/Zoo and HuggingFace.

I personally feel that the priority of TorchSharp and ML.NET has always been supporting .NET community to do deep ML within the framework of .NET. The decision of adhering to pytorch python syntax was correct. The .NET community's commitment to Torchsharp is the testimony of that decision.

For Deep NLP, I am skeptical, In my view, if Torchsharp needs to adhere to what TorchText, TorchData is doing.

For python community, it seems it is preferred to adopt what has been Implemed in onnxruntime, in term of pre and post inference operators. e.g. instead of using the tokenizers provided by TorchText, it is more effective to use the SAME tokenizers provided by onnxruntime.

In other words, when come to Deep NLP, onnxruntime is a more complete environment than that of TorchText, TorchData.

I suggest Torchsharp and ML.NET take the Onnxruntime path. Torchsharp provides bridging and utility codes typically performed by pytorch when come to building and manipulating models (e.g. Onnx) and training and inference (e.g. Onnxruntime).

For deep NLP, reusing what have been developed in onnxruntime is key for delivering deep NLP by torchsharp and ML.NET

Looking for feedback!

0 replies

tarekgh · 2022-06-06T16:23:33Z

tarekgh
Jun 6, 2022

CC @luisquintanilla

0 replies

GeorgeS2019 · 2022-06-06T19:56:38Z

GeorgeS2019
Jun 6, 2022
Author

@NiklasGustafsson @michaelgsharp @tarekgh @luisquintanilla

The following ReadMe from TorchText explains it.

Only basic are provided

NLP building blocks and
text-processing transformations

While onnxruntime provides not just basic but a more complete environment driven by contribution from industry users. This is why I thought onnxruntime is a better choice.

THE KEY: organize what onnxruntime provides in a folder and namespace structure consistent with TorchText.

The above suggestion does not include the workflow and nlp pipeline planned for TorchData.

The suggestion will have multiple impact not just for TorchSharp and ML.NET BUT also Microsoft Bot Framework.

torchtext

This repository consists of:

torchtext.datasets: The raw text iterators for common NLP datasets

torchtext.data: Some basic NLP building blocks

torchtext.transforms: Basic text-processing transformations

torchtext.models: Pre-trained models

torchtext.vocab: Vocab and Vectors related classes and factory functions

examples: Example NLP workflows with PyTorch and torchtext library.

0 replies

GeorgeS2019 · 2022-06-06T20:04:26Z

GeorgeS2019
Jun 6, 2022
Author

@SergeiAlonichau I included you here as the same tokenizers provided by TorchText and Onnxruntime to some extend also addressed by BlingFire.

The difference, c# sample already exist in BlingFire.
For TorchText, c# wrapper needed.
For Onnxruntine, the contributed tokenizers needed to be exposed into c api and subsequently making available as part of c# API wrapper of the C api.

0 replies

tarekgh · 2022-06-06T20:16:46Z

tarekgh
Jun 6, 2022

For ML.NET, we prefer having tokenizers design as HuggingFace tokenizers. It is sophisticated and more flexible. We are doing some prototypes to have the full implementation done in C# to avoid some other complexity with the native dependencies (like producing all supported architectures bits x86, x64, arm...etc. and, avoid packaging issues, and to avoid problems when need to use the tokenizers inside WASM if needed). Note, we want to have the tokenizers for global scenarios and not necessarily restricted for any specific runtimes (ONNX, TorchSharp,...).

CC @JakeRadMSFT @ericstj

2 replies

GeorgeS2019 Aug 5, 2022
Author

Great job! Microsoft.ML.Tokenizers

tarekgh Aug 5, 2022

Thanks @GeorgeS2019. This is the first version of tokenizers, so I expect we'll add more functionality to it as we go. At least we get the ball rolling.

GeorgeS2019 · 2022-06-06T20:58:36Z

GeorgeS2019
Jun 6, 2022
Author

@tarekgh

The onnxrunime project, which involves many industry partners, has resulted in the use case analysis into delivering REUSABLE pre- and post-processing algorithms. It is the Reusability (in both pre and post processing the result of consensus feedback from these industry partners) is what we HOPE in your claimed c# approach. With NativeAOT, the c# approach will lead to high performance without compromising ease of maintainability that comes with c#

0 replies

NiklasGustafsson · 2022-06-07T20:17:16Z

NiklasGustafsson
Jun 7, 2022
Maintainer

@GeorgeS2019 -- as always, thank you for your enthusiasm and interest in making DL in .NET the best it can be.

As far as TorchSharp goes, I want to make sure that whatever we do, we stick with the simple value prop -- it is .NET bindings to the native engine underlying Pytorch, i.e. libtorch. It is no more, no less. We can build other libraries and integrate other technologies into ML.NET, or SciSharp, or other things we haven't come up with yet, but TorchSharp should remain simple and focused.

So, in one sense, if it's documented at pytorch.org, then it belongs in TorchSharp. If not, then it doesn't.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchData or Torch.Utils.Data ? #580

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

torchtext

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

TorchData or Torch.Utils.Data ? #580

GeorgeS2019 Apr 20, 2022

Replies: 8 comments · 2 replies

NiklasGustafsson Apr 20, 2022 Maintainer

GeorgeS2019 Jun 6, 2022 Author

tarekgh Jun 6, 2022

GeorgeS2019 Jun 6, 2022 Author

torchtext

GeorgeS2019 Jun 6, 2022 Author

tarekgh Jun 6, 2022

GeorgeS2019 Aug 5, 2022 Author

tarekgh Aug 5, 2022

GeorgeS2019 Jun 6, 2022 Author

NiklasGustafsson Jun 7, 2022 Maintainer

GeorgeS2019
Apr 20, 2022

Replies: 8 comments 2 replies

NiklasGustafsson
Apr 20, 2022
Maintainer

GeorgeS2019
Jun 6, 2022
Author

tarekgh
Jun 6, 2022

GeorgeS2019
Jun 6, 2022
Author

GeorgeS2019
Jun 6, 2022
Author

tarekgh
Jun 6, 2022

GeorgeS2019 Aug 5, 2022
Author

GeorgeS2019
Jun 6, 2022
Author

NiklasGustafsson
Jun 7, 2022
Maintainer