-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOCS-2494: Create custom ML training script, upload to registry, and submit training job #3071
DOCS-2494: Create custom ML training script, upload to registry, and submit training job #3071
Conversation
sguequierre
commented
Jun 25, 2024
•
edited
Loading
edited
- Add ML training scripts to table in registry page
- Add "Use a Custom Training Script" page to services/ML
- Add applicable CLI commands to /cli
8c8ae62
to
3de2d8e
Compare
docs/registry/_index.md
Outdated
|
||
You can search the available ML models from the Viam registry here: | ||
**Training scripts** to train machine learning models in the Viam cloud from custom Python scripts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tahiyasalam Is this accurate or would you know of a more descriptive way to phrase it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Python source distributions used to train and produce models in the Viam cloud for custom machine learning
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python source distributions sounds to me like the source code for python 3.9 o or other versions...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, we can just say Python scripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Sierra, thanks so much for working on this! I have some very high-level comments about the uploading training script portion. Let me know if you want to talk through this at all. Once those changes are consolidating, I can do a more in-depth review.
This script must take as command line inputs a `dataset_file` and `model_output_directory` and output the model artifacts it generates to the `model_output_directory`. | ||
|
||
- The `dataset_file` is a file containing the GCS blob paths and annotations associated with the dataset ID on which the user is training. | ||
- The `model_output_directory` is the location where the user saves the model artifacts (the TFLite, PyTorch, TF, ONNX files) they produce from training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I would provide an example of how to save the model.
So, it would be something like this
# This allows us to save model artifact to Viam, which will be viewable as a registry item for the ML model name and version specified
def save_tflite_classification(
model: Model,
model_dir: str,
model_name: str,
target_shape: ty.Tuple[int, int, int],
) -> None:
# Convert the model to tflite, with batch size 1 so the graph does not have dynamic-sized tensors.
input = tf.keras.Input(target_shape, batch_size=1, dtype=tf.uint8)
output = model(input, training=False)
wrapped_model = tf.keras.Model(inputs=input, outputs=output)
converter = tf.lite.TFLiteConverter.from_keras_model(wrapped_model)
converter.target_spec.supported_ops = TFLITE_OPS
tflite_model = converter.convert()
# Save the model to Viam
filename = os.path.join(model_dir, f"{model_name}.tflite")
with open(filename, "wb") as f:
f.write(tflite_model)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean by save "to Viam" or save "to GCS" here-- I can't tell what's happening from code besides that it looks like its saving the model locally (?) with f.write? @tahiyasalam
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's saving the model to the output directory that's parsed in the arguments. Then everything that's written to that directory converted into a package/registry item. That's what it means by save the model to Viam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry this might be a stupid question but where is the filesystem that the output directory is expected to be on? like is it the user's computer? @tahiyasalam
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a stupid question. This script runs a training job in the cloud. So the user writes a training scripts, submits it, and then we run it in the cloud. In the backend, we provide a Viam produced dataset file and then a model output directory to the script; these are the arguments that the user needs to parse within their own training script. Then, the user can save the model outputs to this model output directory in the "cloud". (Similarly within the script, the user needs to parse the dataset file and then feed it into the model) Once the job is complete, Viam looks at that directory and creates a package and registry item with all of the contents of the directory. Then, the user can interact with the model artifacts in the registry (e.g. deploy the model). I feel like it would be helpful to have this information in the docs somewhere. Let me know if you have any questions or want to chat. As I've mentioned, happy to talk about this at any level since it's a complex project with a significant dev X component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, tried to update with some of this info-- lmk what you think
- The `dataset_file` is a file containing the GCS blob paths and annotations associated with the dataset ID on which the user is training. | ||
- The `model_output_directory` is the location where the user saves the model artifacts (the TFLite, PyTorch, TF, ONNX files) they produce from training. | ||
|
||
Parse these within your training script. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is super confusing to include as a whole and also has some Viam-implementation specific comments that won't make sense to the user, we should work to pair this down to a minimally working example that basically include the above code snippets and then training a small model, with comments for what exactly to train.
There will be example training scripts in the registry with more in-depth examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good but do you have any more examples/guidance currently for minimally working example training a small model? I don't know what that would look like @tahiyasalam
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be the most recent registry item (ML training script) I sent you on Friday. Please look through and let me know if that makes sense to you. It is in our viam-dev org in prod, so once it's ready, we can make it public for others to view.
|
||
3. Follow the instructions to [create a `tar.gz` gzip'd tar file from your files](https://docs.python.org/3.10/distutils/sourcedist). | ||
|
||
You can reference the directory structure of this [example classification training script](https://app.viam.dev/packages/0fbe951e-d4c6-427f-985f-784b7b85842c/manual_testing_byots_classification/ml_training/latest/0fbe951e-d4c6-427f-985f-784b7b85842c). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not be referencing a package in staging/dev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
@@ -55,7 +55,6 @@ async def main(): | |||
} | |||
insert_resp = typesense_client.collections['resources'].documents.upsert( | |||
json_m) | |||
print(insert_resp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep this - it's here for logging purposes for the github action, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops
docs/registry/_index.md
Outdated
The Viam registry hosts trained ML training scripts that users have made public, which you can use to train custom machine learning models from Python source distribution. | ||
You can also upload your own training script. | ||
|
||
<!-- TODO: link to new machine learning custom training page --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
# SME: Tahiya S. | ||
--- | ||
|
||
You can upload a custom training script to the [Viam registry](https://app.viam.com/registry/) to train your ML model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing more of an introduction. I think we need to explain why you would be using it instead of the regular training flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tahiyasalam Would you be able to help provide any more context here-- why have customers been asking for this feature? Why would a regular user use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a tidbit from the scope: Machine learning training is currently limited to classification and object detection, where the trained model is output as a TFLite model. A flexible training interface will allow more generic training by allowing users to upload any machine learning training code for their use-case. The model output type can be PyTorch, Tensorflow, TFLite, ONNX, or any other model artifacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a huge improvement and overall looks good, pending any outstanding comments from Naomi. Once merged, we can update the links in the frontend code in app. While testing, we can continue to make tickets for improvements to the docs.
It could be one of those model types or it could be unspecified/other. It’s just a way of flagging if it is classification/detection to use certain features (like the vision service) but it could be any type of ML model. On Jul 10, 2024, at 4:13 PM, Sierra Guequierre ***@***.***> wrote:
@sguequierre commented on this pull request.
In docs/services/ml/upload-training-script.md:
+You can train your classification and object detection models on the [Viam app](https://app.viam.com), where the trained model is output as a TFLite model.
+See [Train a Model](/services/ml/train-model/) for more information.
+However, you can now also upload a custom training script to the [Viam registry](https://app.viam.com/registry/) to train your ML model with an output type of PyTorch, Tensorflow, TFLite, ONNX, or any other model artifacts.
+A flexible training interface like this allows you to upload any machine learning training code for your use-case.
ahh, what is --model-type cli flag used to indicate or set then? @tahiyasalam
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
```json {class="line-numbers linkable-line-numbers"} | ||
{ | ||
"image_path": "/path/to/data/data1.jpeg", | ||
"bounding_box_annotations":[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you manually did this. Look at vs code json formatting tools, they'll make it pretty and consistent without you having to do this manually
"bounding_box_annotations":[ | |
"bounding_box_annotations": [ |
"image_path": "/path/to/data/data2.jpeg", | ||
"bounding_box_annotations":[ | ||
{ | ||
"annotation_label":"blue_star", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces missing after colons
|
||
For example, you can reference the logic from <file>model/training.py</file> from this [example classification training script](https://app.viam.com/packages/e76d1b3b-0468-4efd-bb7f-fb1d2b352fcb/custom-training-classification/ml_training/latest/e76d1b3b-0468-4efd-bb7f-fb1d2b352fcb) that trains a classification model using TensorFlow and Keras. | ||
|
||
Here, the logic to build and compile the classification model looks like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not start sentences with here
|
||
## Submit a training job | ||
|
||
You can use the Viam CLI's [`viam train submit`](/cli/#positional-arguments-submit) command to submit a training job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the Viam CLI's [`viam train submit`](/cli/#positional-arguments-submit) command to submit a training job. |
} | ||
``` | ||
|
||
With classification annotations like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what happened with the comment on this but I was suggesting to remove these? Why do we need them in addition to the above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I think I asked Tahiya a question about the first example-- it doesn't make sense to me as to why there's both types of annotations at once in the first example when according to docs "If you want to train an image classifier, use image tags. For an object detector, use bounding boxes."
is it valid dataset metadata?
she wanted to include the two types distinctly separate and also the example though I think @tahiyasalam
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a valid dataset. We provide a dataset with all possible annotations and the user needs to parse the file and use the annotations that are relevant to them (classifications, bounding boxes, or neither). Since we are not managing the training for them, they need to do this themselves. It's helpful to see the two types, since they're formatted differently and need to be parsed differently (which is why I sent two parse functions originallt).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sguequierre you might use the same dataset for two different training purposes. So you could add annotations for both and just ignore bounding boxes when you want to train a classifier. Or you could use the bounding boxes to train a classifier. If I am training a classifier to see if there is a dog in the picture I could use bounding boxes as input for that. The code would just look different
I think including both functions makes sense. I was more wondering why we need to have the json bits of the classification and bounding boxes again since they're also included above. We can keep them, I just think they're duplicating the info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly about keeping them (the json bits) twice! I think originally it was difficult to read, but with the new formatting, it comes across as much clearer. But +1 to including both functions.
``` | ||
|
||
When you submit a training job with this training script, this function saves the model outputs to the `model_output_directory` in the cloud. | ||
Once the training job is complete, Viam looks at that directory and creates a package with all of the contents of the directory, creating or updating a registry item for the ML model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tahiyasalam Am I understanding this correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
Co-authored-by: Naomi Pentrel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this. I think it might be useful to still add a template. Readers probably shouldn't need to copy paste each bit individually, they can be provided with the bits that will remain the same for everyone as a template, we can explain what each bit does, and then we can tell them about the bit they need to implement themselves.
There's another thing that is missing entirely here and that is the main logic. Following the instructions as is, nothing would work.
Let's iterate on that separately though please and merge this now.
@npentrel @sguequierre Would you all mind making a follow-up ticket for improvements so that I can follow along? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently entirely misses the main logic that calls all these methods - we should fix that before merge. Then we can do other changes after but that needs to go in before merge.
Co-authored-by: Naomi Pentrel <[email protected]>
You can view a rendered version of the docs from this PR at https://docs-test.viam.dev/3071 |