Counter Component #7700

pritamdodeja · 2024-11-26T17:21:38Z

If the feature is related to a specific library below, please raise an issue in
the respective repo directly:

TensorFlow Data Validation Repo

TensorFlow Model Analysis Repo

TensorFlow Transform Repo

TensorFlow Serving Repo

System information

TFX Version (you are using): 1.15.1
Environment in which you plan to use the feature (e.g., Local
(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc..): Local, GCP
Are you willing to contribute it (Yes/No): Yes, partially written

Describe the feature and the current behavior/state.

Knowing how many examples you have is a very useful thing. The Counter Component would count the number of examples in the input data and provide this information to downstream components (e.g. Tuner, Trainer) that may want to use that information.

Will this change the current API? How?

This would introduce a new component, it could potentially add additional information that could be sent to Trainer and Tuner components. Those inputs would naturally be made optional in order to not break existing API.

Who will benefit with this feature?

Users who would benefit from the pipeline knowing how many examples there are in input data.

Do you have a workaround or are completely blocked by this? :

Workaround currently is to count rows in the original csv etc., which introduces additional code to be maintained. This could also be done with tfrecords, but everybody would be solving the same problem in many different ways.

Name of your Organization (Optional)

Intuitive Cloud (GCP Partner).

Any Other info.

I have this partially written, I need advice on data formats etc (e.g. how to store this on disk) and how to deal with spans etc., I don't need help writing it per se, but rather need advise in how to make it fit well into the TFX ecosystem. I've been using StatisticsGen as the component to model this after, as StatisticsGen is most similar (e.g. produces numbers, works across splits).

pritamdodeja · 2024-11-27T09:11:31Z

In studying the potential protocol buffers I could use as a template for this, I ran into the protocol buffer used for StatisticsGen which has num_examples there. The problem is these cannot be fed to the Trainer component via custom_config as far as I know. A similar problem, but in a different setting, is at https://stackoverflow.com/questions/74935155/how-to-get-the-list-of-features-along-side-their-schema-and-stats-using-tfx/79229561#79229561

As such, I don't believe a Counter Component is needed, however, how does one go about using this num_examples in the rest of the pipeline? From my understanding, this would require updating the component spec, executor, and component for the trainer, and passing this information downstream. I would appreciate any guidance on this front in terms of how to get Trainer and Tuner to be aware of the amount of data they're training on without hard coding it.

pritamdodeja · 2024-12-03T15:45:00Z

I have this implemented locally. Based on this implementation, in run_fn(fn_args: tfx.components.FnArgs): we can use the following code, which addresses the same thing as proto.TrainArgs

splits = fn_args.num_examples.keys()                                                                                                                                                                  
training_examples = fn_args.num_examples['train']                                                                                                                                            
validation_examples = fn_args.num_examples['eval']

model.fit(                                                                                                                                                                                            
            train_dataset,                                                                                                                                                                                  
            steps_per_epoch=f(training_examples, BATCH_SIZE),                                                                                                                                                       
            validation_data=eval_dataset,                                                                                                                                                                   
            validation_steps=f(validation_examples, BATCH_SIZE),                                                                                                                                                              
            callbacks=callback_list,                                                                                                                                                                        
            epochs=EPOCHS,                                                                                                                                                                                  
      )

The Trainer would be configured as follows:

trainer = Trainer(                                                                                                                                                                                          
    module_file=trainer_file,                                                                                                                                                                               
    examples=transform.outputs['transformed_examples'],                                                                                                                                                     
    transform_graph=transform.outputs['transform_graph'],                                                                                                                                                   
    schema=transform.outputs['post_transform_schema'],                                                                                                                                                      
    hyperparameters=tuner.outputs['best_hyperparameters'],                                                                                                                                                  
    statistics=statistics_gen.outputs['statistics'], #new
)

Please let me know if there is interest in integrating this into TFX. If there is, I would write the test cases etc., and submit the PR. I would recommend we try this out with the vanilla Trainer component, make sure everything is good, then implement Tuner. After we're all good there, we can update the cloud versions of Trainer and Tuner.

Enables the user to use number of examples information computed by StatisticsGen in their training code. Passing statistics to trainer enables the use of fn_args.num_examples['train'] etc., in run_fn More details at: tensorflow#7700

pritamdodeja · 2025-01-16T19:06:34Z

Hello,

Can you please provide feedback on the PR I have mentioned above? Thank you!

Pritam

pritamdodeja added the type:feature label Nov 26, 2024

pritamdodeja mentioned this issue Dec 17, 2024

Add feature to optionally provide stats to trainer so number of examples can be determined dynamically #7734

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counter Component #7700

Counter Component #7700

pritamdodeja commented Nov 26, 2024

pritamdodeja commented Nov 27, 2024

pritamdodeja commented Dec 3, 2024

pritamdodeja commented Jan 16, 2025

Counter Component #7700

Counter Component #7700

Comments

pritamdodeja commented Nov 26, 2024

pritamdodeja commented Nov 27, 2024

pritamdodeja commented Dec 3, 2024

pritamdodeja commented Jan 16, 2025