Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counter Component #7700

Open
pritamdodeja opened this issue Nov 26, 2024 · 3 comments
Open

Counter Component #7700

pritamdodeja opened this issue Nov 26, 2024 · 3 comments

Comments

@pritamdodeja
Copy link
Contributor

If the feature is related to a specific library below, please raise an issue in
the respective repo directly:

TensorFlow Data Validation Repo

TensorFlow Model Analysis Repo

TensorFlow Transform Repo

TensorFlow Serving Repo

System information

  • TFX Version (you are using): 1.15.1
  • Environment in which you plan to use the feature (e.g., Local
    (Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc..): Local, GCP
  • Are you willing to contribute it (Yes/No): Yes, partially written

Describe the feature and the current behavior/state.

Knowing how many examples you have is a very useful thing. The Counter Component would count the number of examples in the input data and provide this information to downstream components (e.g. Tuner, Trainer) that may want to use that information.

Will this change the current API? How?

This would introduce a new component, it could potentially add additional information that could be sent to Trainer and Tuner components. Those inputs would naturally be made optional in order to not break existing API.

Who will benefit with this feature?

Users who would benefit from the pipeline knowing how many examples there are in input data.

Do you have a workaround or are completely blocked by this? :

Workaround currently is to count rows in the original csv etc., which introduces additional code to be maintained. This could also be done with tfrecords, but everybody would be solving the same problem in many different ways.

Name of your Organization (Optional)

Intuitive Cloud (GCP Partner).

Any Other info.

I have this partially written, I need advice on data formats etc (e.g. how to store this on disk) and how to deal with spans etc., I don't need help writing it per se, but rather need advise in how to make it fit well into the TFX ecosystem. I've been using StatisticsGen as the component to model this after, as StatisticsGen is most similar (e.g. produces numbers, works across splits).

@pritamdodeja
Copy link
Contributor Author

In studying the potential protocol buffers I could use as a template for this, I ran into the protocol buffer used for StatisticsGen which has num_examples there. The problem is these cannot be fed to the Trainer component via custom_config as far as I know. A similar problem, but in a different setting, is at https://stackoverflow.com/questions/74935155/how-to-get-the-list-of-features-along-side-their-schema-and-stats-using-tfx/79229561#79229561

As such, I don't believe a Counter Component is needed, however, how does one go about using this num_examples in the rest of the pipeline? From my understanding, this would require updating the component spec, executor, and component for the trainer, and passing this information downstream. I would appreciate any guidance on this front in terms of how to get Trainer and Tuner to be aware of the amount of data they're training on without hard coding it.

@pritamdodeja
Copy link
Contributor Author

I have this implemented locally. Based on this implementation, in run_fn(fn_args: tfx.components.FnArgs): we can use the following code, which addresses the same thing as proto.TrainArgs

splits = fn_args.num_examples.keys()                                                                                                                                                                  
training_examples = fn_args.num_examples['train']                                                                                                                                            
validation_examples = fn_args.num_examples['eval']
model.fit(                                                                                                                                                                                            
            train_dataset,                                                                                                                                                                                  
            steps_per_epoch=f(training_examples, BATCH_SIZE),                                                                                                                                                       
            validation_data=eval_dataset,                                                                                                                                                                   
            validation_steps=f(validation_examples, BATCH_SIZE),                                                                                                                                                              
            callbacks=callback_list,                                                                                                                                                                        
            epochs=EPOCHS,                                                                                                                                                                                  
      )

The Trainer would be configured as follows:

trainer = Trainer(                                                                                                                                                                                          
    module_file=trainer_file,                                                                                                                                                                               
    examples=transform.outputs['transformed_examples'],                                                                                                                                                     
    transform_graph=transform.outputs['transform_graph'],                                                                                                                                                   
    schema=transform.outputs['post_transform_schema'],                                                                                                                                                      
    hyperparameters=tuner.outputs['best_hyperparameters'],                                                                                                                                                  
    statistics=statistics_gen.outputs['statistics'], #new
)

Please let me know if there is interest in integrating this into TFX. If there is, I would write the test cases etc., and submit the PR. I would recommend we try this out with the vanilla Trainer component, make sure everything is good, then implement Tuner. After we're all good there, we can update the cloud versions of Trainer and Tuner.

pritamdodeja pushed a commit to pritamdodeja/tfx that referenced this issue Dec 17, 2024
Enables the user to use number of examples information computed by
StatisticsGen in their training code.  Passing statistics to trainer
enables the use of

fn_args.num_examples['train'] etc., in run_fn

More details at:
tensorflow#7700
@pritamdodeja
Copy link
Contributor Author

Hello,

Can you please provide feedback on the PR I have mentioned above? Thank you!

Pritam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant