DOCS-2494: Create custom ML training script, upload to registry, and submit training job #3071

sguequierre · 2024-06-25T17:22:43Z

Add ML training scripts to table in registry page
Add "Use a Custom Training Script" page to services/ML
Add applicable CLI commands to /cli

docs/registry/_index.md

sguequierre · 2024-07-02T19:44:34Z

docs/registry/_index.md


-You can search the available ML models from the Viam registry here:
+**Training scripts** to train machine learning models in the Viam cloud from custom Python scripts.


@tahiyasalam Is this accurate or would you know of a more descriptive way to phrase it?

Maybe Python source distributions used to train and produce models in the Viam cloud for custom machine learning?

Python source distributions sounds to me like the source code for python 3.9 o or other versions...

That's fair, we can just say Python scripts

tahiyasalam

Hi Sierra, thanks so much for working on this! I have some very high-level comments about the uploading training script portion. Let me know if you want to talk through this at all. Once those changes are consolidating, I can do a more in-depth review.

docs/services/ml/upload-training-script.md

tahiyasalam · 2024-07-03T13:43:32Z

docs/services/ml/upload-training-script.md

+   This script must take as command line inputs a `dataset_file` and `model_output_directory` and output the model artifacts it generates to the `model_output_directory`.
+
+- The `dataset_file` is a file containing the GCS blob paths and annotations associated with the dataset ID on which the user is training.
+- The `model_output_directory` is the location where the user saves the model artifacts (the TFLite, PyTorch, TF, ONNX files) they produce from training.


Here, I would provide an example of how to save the model.

So, it would be something like this

# This allows us to save model artifact to Viam, which will be viewable as a registry item for the ML model name and version specified def save_tflite_classification( model: Model, model_dir: str, model_name: str, target_shape: ty.Tuple[int, int, int], ) -> None: # Convert the model to tflite, with batch size 1 so the graph does not have dynamic-sized tensors. input = tf.keras.Input(target_shape, batch_size=1, dtype=tf.uint8) output = model(input, training=False) wrapped_model = tf.keras.Model(inputs=input, outputs=output) converter = tf.lite.TFLiteConverter.from_keras_model(wrapped_model) converter.target_spec.supported_ops = TFLITE_OPS tflite_model = converter.convert() # Save the model to Viam filename = os.path.join(model_dir, f"{model_name}.tflite") with open(filename, "wb") as f: f.write(tflite_model)

What does it mean by save "to Viam" or save "to GCS" here-- I can't tell what's happening from code besides that it looks like its saving the model locally (?) with f.write? @tahiyasalam

It's saving the model to the output directory that's parsed in the arguments. Then everything that's written to that directory converted into a package/registry item. That's what it means by save the model to Viam.

sorry this might be a stupid question but where is the filesystem that the output directory is expected to be on? like is it the user's computer? @tahiyasalam

Not a stupid question. This script runs a training job in the cloud. So the user writes a training scripts, submits it, and then we run it in the cloud. In the backend, we provide a Viam produced dataset file and then a model output directory to the script; these are the arguments that the user needs to parse within their own training script. Then, the user can save the model outputs to this model output directory in the "cloud". (Similarly within the script, the user needs to parse the dataset file and then feed it into the model) Once the job is complete, Viam looks at that directory and creates a package and registry item with all of the contents of the directory. Then, the user can interact with the model artifacts in the registry (e.g. deploy the model). I feel like it would be helpful to have this information in the docs somewhere. Let me know if you have any questions or want to chat. As I've mentioned, happy to talk about this at any level since it's a complex project with a significant dev X component.

ok, tried to update with some of this info-- lmk what you think

tahiyasalam · 2024-07-03T13:45:58Z

docs/services/ml/upload-training-script.md

+- The `dataset_file` is a file containing the GCS blob paths and annotations associated with the dataset ID on which the user is training.
+- The `model_output_directory` is the location where the user saves the model artifacts (the TFLite, PyTorch, TF, ONNX files) they produce from training.
+
+  Parse these within your training script.


I think this is super confusing to include as a whole and also has some Viam-implementation specific comments that won't make sense to the user, we should work to pair this down to a minimally working example that basically include the above code snippets and then training a small model, with comments for what exactly to train.

There will be example training scripts in the registry with more in-depth examples.

Sounds good but do you have any more examples/guidance currently for minimally working example training a small model? I don't know what that would look like @tahiyasalam

That would be the most recent registry item (ML training script) I sent you on Friday. Please look through and let me know if that makes sense to you. It is in our viam-dev org in prod, so once it's ready, we can make it public for others to view.

tahiyasalam · 2024-07-03T13:46:59Z

docs/services/ml/upload-training-script.md

+
+3. Follow the instructions to [create a `tar.gz` gzip'd tar file from your files](https://docs.python.org/3.10/distutils/sourcedist).
+
+You can reference the directory structure of this [example classification training script](https://app.viam.dev/packages/0fbe951e-d4c6-427f-985f-784b7b85842c/manual_testing_byots_classification/ml_training/latest/0fbe951e-d4c6-427f-985f-784b7b85842c).


We should not be referencing a package in staging/dev

npentrel · 2024-07-08T14:31:02Z

.github/workflows/get_modular_resources.py

@@ -55,7 +55,6 @@ async def main():
                }
                insert_resp = typesense_client.collections['resources'].documents.upsert(
        json_m)
-                print(insert_resp)


Please keep this - it's here for logging purposes for the github action, thanks!

docs/registry/_index.md

npentrel · 2024-07-08T14:40:02Z

docs/registry/_index.md

+The Viam registry hosts trained ML training scripts that users have made public, which you can use to train custom machine learning models from Python source distribution.
+You can also upload your own training script.
+
+<!-- TODO: link to new machine learning custom training page -->


npentrel · 2024-07-08T14:42:00Z

docs/services/ml/upload-training-script.md

+# SME: Tahiya S.
+---
+
+You can upload a custom training script to the [Viam registry](https://app.viam.com/registry/) to train your ML model.


This is missing more of an introduction. I think we need to explain why you would be using it instead of the regular training flow.

@tahiyasalam Would you be able to help provide any more context here-- why have customers been asking for this feature? Why would a regular user use it?

Here's a tidbit from the scope: Machine learning training is currently limited to classification and object detection, where the trained model is output as a TFLite model. A flexible training interface will allow more generic training by allowing users to upload any machine learning training code for their use-case. The model output type can be PyTorch, Tensorflow, TFLite, ONNX, or any other model artifacts.

tahiyasalam

This is a huge improvement and overall looks good, pending any outstanding comments from Naomi. Once merged, we can update the links in the frontend code in app. While testing, we can continue to make tickets for improvements to the docs.

tahiyasalam · 2024-07-10T20:16:39Z

It could be one of those model types or it could be unspecified/other. It’s just a way of flagging if it is classification/detection to use certain features (like the vision service) but it could be any type of ML model. On Jul 10, 2024, at 4:13 PM, Sierra Guequierre ***@***.***> wrote: @sguequierre commented on this pull request. In docs/services/ml/upload-training-script.md:

+You can train your classification and object detection models on the [Viam app](https://app.viam.com), where the trained model is output as a TFLite model.

+See [Train a Model](/services/ml/train-model/) for more information. +However, you can now also upload a custom training script to the [Viam registry](https://app.viam.com/registry/) to train your ML model with an output type of PyTorch, Tensorflow, TFLite, ONNX, or any other model artifacts. +A flexible training interface like this allows you to upload any machine learning training code for your use-case. ahh, what is --model-type cli flag used to indicate or set then? @tahiyasalam —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

docs/registry/_index.md

docs/services/ml/upload-training-script.md

npentrel · 2024-07-10T19:32:08Z

docs/services/ml/upload-training-script.md

+```json {class="line-numbers linkable-line-numbers"}
+{
+    "image_path": "/path/to/data/data1.jpeg",
+    "bounding_box_annotations":[


I assume you manually did this. Look at vs code json formatting tools, they'll make it pretty and consistent without you having to do this manually

Suggested change

"bounding_box_annotations":[

"bounding_box_annotations": [

npentrel · 2024-07-10T19:32:28Z

docs/services/ml/upload-training-script.md

+    "image_path": "/path/to/data/data2.jpeg",
+    "bounding_box_annotations":[
+        {
+            "annotation_label":"blue_star",


Spaces missing after colons

npentrel · 2024-07-10T19:39:50Z

docs/services/ml/upload-training-script.md

+
+For example, you can reference the logic from <file>model/training.py</file> from this [example classification training script](https://app.viam.com/packages/e76d1b3b-0468-4efd-bb7f-fb1d2b352fcb/custom-training-classification/ml_training/latest/e76d1b3b-0468-4efd-bb7f-fb1d2b352fcb) that trains a classification model using TensorFlow and Keras.
+
+Here, the logic to build and compile the classification model looks like this:


Do not start sentences with here

docs/services/ml/upload-training-script.md

npentrel · 2024-07-10T21:04:39Z

docs/services/ml/upload-training-script.md

+
+## Submit a training job
+
+You can use the Viam CLI's [`viam train submit`](/cli/#positional-arguments-submit) command to submit a training job.


Suggested change

You can use the Viam CLI's [`viam train submit`](/cli/#positional-arguments-submit) command to submit a training job.

docs/services/ml/upload-training-script.md

npentrel · 2024-07-10T21:09:24Z

docs/services/ml/upload-training-script.md

+    }
+```
+
+With classification annotations like this:


I don't know what happened with the comment on this but I was suggesting to remove these? Why do we need them in addition to the above?

oh I think I asked Tahiya a question about the first example-- it doesn't make sense to me as to why there's both types of annotations at once in the first example when according to docs "If you want to train an image classifier, use image tags. For an object detector, use bounding boxes."
is it valid dataset metadata?
she wanted to include the two types distinctly separate and also the example though I think @tahiyasalam

It is a valid dataset. We provide a dataset with all possible annotations and the user needs to parse the file and use the annotations that are relevant to them (classifications, bounding boxes, or neither). Since we are not managing the training for them, they need to do this themselves. It's helpful to see the two types, since they're formatted differently and need to be parsed differently (which is why I sent two parse functions originallt).

@sguequierre you might use the same dataset for two different training purposes. So you could add annotations for both and just ignore bounding boxes when you want to train a classifier. Or you could use the bounding boxes to train a classifier. If I am training a classifier to see if there is a dog in the picture I could use bounding boxes as input for that. The code would just look different

I think including both functions makes sense. I was more wondering why we need to have the json bits of the classification and bounding boxes again since they're also included above. We can keep them, I just think they're duplicating the info

I don't feel strongly about keeping them (the json bits) twice! I think originally it was difficult to read, but with the new formatting, it comes across as much clearer. But +1 to including both functions.

docs/services/ml/upload-training-script.md

sguequierre · 2024-07-10T21:53:28Z

docs/services/ml/upload-training-script.md

+```
+
+When you submit a training job with this training script, this function saves the model outputs to the `model_output_directory` in the cloud.
+Once the training job is complete, Viam looks at that directory and creates a package with all of the contents of the directory, creating or updating a registry item for the ML model.


@tahiyasalam Am I understanding this correctly?

docs/services/ml/upload-training-script.md

Co-authored-by: Naomi Pentrel <[email protected]>

npentrel

Let's merge this. I think it might be useful to still add a template. Readers probably shouldn't need to copy paste each bit individually, they can be provided with the bits that will remain the same for everyone as a template, we can explain what each bit does, and then we can tell them about the bit they need to implement themselves.

There's another thing that is missing entirely here and that is the main logic. Following the instructions as is, nothing would work.

Let's iterate on that separately though please and merge this now.

tahiyasalam · 2024-07-11T12:54:49Z

@npentrel @sguequierre Would you all mind making a follow-up ticket for improvements so that I can follow along?

docs/services/ml/upload-training-script.md

npentrel

This currently entirely misses the main logic that calls all these methods - we should fix that before merge. Then we can do other changes after but that needs to go in before merge.

docs/services/ml/upload-training-script.md

Co-authored-by: Naomi Pentrel <[email protected]>

viambot · 2024-07-11T16:27:12Z

You can view a rendered version of the docs from this PR at https://docs-test.viam.dev/3071

Test workflow

ecaca27

viambot added the safe to build This pull request is marked safe to build from a trusted zone label Jun 25, 2024

sguequierre added 7 commits June 25, 2024 15:09

Edit javascript for custom training script table

9d8d88b

Fix url

04f6308

Add proper api key and fix script

990ebfa

Finish upload script page

06f6d16

Fix from testing

eba2378

fixup

f2ff28e

fixup

f7fdd55

sguequierre marked this pull request as ready for review July 2, 2024 16:33

sguequierre added 3 commits July 2, 2024 15:41

python linter

b47cf27

Fix broken link

2b200ae

linter

3de2d8e

sguequierre force-pushed the DOCS-2494/ml-training-scripts branch from 8c8ae62 to 3de2d8e Compare July 2, 2024 19:41

sguequierre commented Jul 2, 2024

View reviewed changes

docs/registry/_index.md Outdated Show resolved Hide resolved

Apply suggestions from code review

01dc807

sguequierre commented Jul 2, 2024

View reviewed changes

Merge branch 'main' into DOCS-2494/ml-training-scripts

10f8447

sguequierre requested a review from tahiyasalam July 2, 2024 19:45

sguequierre changed the title ~~DOCS-2494: Document ML training scripts in registry~~ DOCS-2494: Create custom ML training scripts, upload to registry, and submit training job Jul 2, 2024

sguequierre changed the title ~~DOCS-2494: Create custom ML training scripts, upload to registry, and submit training job~~ DOCS-2494: Create custom ML training script, upload to registry, and submit training job Jul 2, 2024

tahiyasalam requested changes Jul 3, 2024

View reviewed changes

npentrel reviewed Jul 8, 2024

View reviewed changes

docs/registry/_index.md Show resolved Hide resolved

npentrel reviewed Jul 8, 2024

View reviewed changes

sguequierre added 3 commits July 9, 2024 14:48

Apply suggestions from code review

0cde168

Apply suggestions from code review

78e9a4b

Merge branch 'main' into DOCS-2494/ml-training-scripts

4acbd66

sguequierre requested a review from tahiyasalam July 9, 2024 18:59

sguequierre added 2 commits July 10, 2024 14:54

Apply suggestions from code review

0f2b7e5

fixup

52d2170

tahiyasalam approved these changes Jul 10, 2024

View reviewed changes

format

2660c6d

npentrel reviewed Jul 10, 2024

View reviewed changes

sguequierre commented Jul 10, 2024

View reviewed changes

docs/services/ml/upload-training-script.md Outdated Show resolved Hide resolved