-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model testing with Datasets #168
Comments
We should make this work with test_runner.py A good start would be to enable 2-3 datasets with 1-2 models. |
The current work is here Covering all 3 major data sources: ImageNet2012 (Images), SQuADv1.1 (Text), LibriSpeechASR (Sound) Next step: Enable multi-model scenarios e.g. Whisper (encoder-decoder), Stable Diffusion (text-encoder, unet, vae-decoder), etc. |
Current stateNote: The default atol/rtol values were used. For fp16, looking at the logs, the numbers look correct, but less precise then the tolerance. It might be better to report the difference as well to make te comparision easier. We can download and generate actual test data for the following models with its dataset: Imagenet dataset (image)Input: pixel_values, shape: (1, 3, 224, 224)
Test "resnet50_v1.5" has 7 cases:
Passed: 0
Failed: 7 Input: pixel_values, shape: (1, 3, 224, 224)
Test "resnet50_v1.5" has 7 cases:
Passed: 7
Failed: 0 Input: input_tensor:0, shape: (1, 3, 224, 224)
Test "resnet50_v1" has 7 cases:
Passed: 1
Failed: 6 Input: input_tensor:0, shape: (1, 3, 224, 224)
Test "resnet50_v1" has 7 cases:
Passed: 7
Failed: 0 timm-mobilenetv3-large_fp16.log Input: pixel_values, shape: (1, 3, 224, 224)
Test "timm-mobilenetv3-large" has 7 cases:
Passed: 0
Failed: 7 timm-mobilenetv3-large_fp32.log Input: pixel_values, shape: (1, 3, 224, 224)
Test "timm-mobilenetv3-large" has 7 cases:
Passed: 7
Failed: 0 Input: pixel_values, shape: (1, 3, 224, 224)
Test "vit-base-patch16-224" has 7 cases:
Passed: 0
Failed: 7 Input: pixel_values, shape: (1, 3, 224, 224)
Test "vit-base-patch16-224" has 7 cases:
Passed: 7
Failed: 0 SQuAD dataset (text)distilbert-base-cased-distilled-squad_fp16.log Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Test "distilbert-base-cased-distilled-squad" has 7 cases:
Passed: 0
Failed: 7 distilbert-base-cased-distilled-squad_fp32.log Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Test "distilbert-base-cased-distilled-squad" has 7 cases:
Passed: 7
Failed: 0 Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "gpt-j" has 42 cases:
Passed: 0
Failed: 42 Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "gpt-j" has 42 cases:
Passed: 42
Failed: 0 Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Test "roberta-base-squad2" has 7 cases:
Passed: 0
Failed: 7 Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Test "roberta-base-squad2" has 7 cases:
Passed: 7
Failed: 0 LibriSpeech dataset (audio)Input: input_values, shape: (1, 105440)
Test "wav2vec2-base-960h" has 7 cases:
Passed: 0
Failed: 7 Input: input_values, shape: (1, 105440)
Test "wav2vec2-base-960h" has 7 cases:
Passed: 1
Failed: 6 Input: input_features, shape: (1, 80, 3000)
Input: decoder_input_ids, shape: (1, 448)
Test "whisper-small-en" has 21 cases:
Passed: 0
Failed: 21 Input: input_features, shape: (1, 80, 3000)
Input: decoder_input_ids, shape: (1, 448)
Test "whisper-small-en" has 21 cases:
Passed: 21
Failed: 0 |
Current State pt2Imagenet dataset (image)Input: input_ids, shape: (10, 77)
Input: attention_mask, shape: (10, 77)
Input: pixel_values, shape: (1, 3, 224, 224)
Test "clip-vit-large-patch14" has 7 cases:
Passed: 0
Failed: 7 Input: input_ids, shape: (10, 77)
Input: attention_mask, shape: (10, 77)
Input: pixel_values, shape: (1, 3, 224, 224)
Test "clip-vit-large-patch14" has 7 cases:
Passed: 7
Failed: 0 SQuAD dataset (text)Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "gemma-2b-it" has 30 cases:
Passed: 0
Failed: 30 Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "gemma-2b-it" has 30 cases:
Passed: 30
Failed: 0 Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: decoder_input_ids, shape: (1, 384)
Test "t5-base" has 30 cases:
Passed: 0
Failed: 30 Note: All Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: decoder_input_ids, shape: (1, 384)
Test "t5-base" has 30 cases:
Passed: 30
Failed: 0 |
Current State pt3SQuAD dataset (text)Input: input_ids, shape: (1, 384)
Input: token_type_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Test "bert-large-uncased" has 5 cases:
Passed: 0
Failed: 5 Input: input_ids, shape: (1, 384)
Input: token_type_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Test "bert-large-uncased" has 5 cases:
Passed: 4
Failed: 1 Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "llama2-7b-chat-hf" has 17 cases:
Passed: 0
Failed: 17 Note: Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "llama2-7b-chat-hf" has 17 cases:
Passed: 17
Failed: 0 Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "llama3-8b-instruct" has 25 cases:
Passed: 0
Failed: 25 Note: Input: input_ids, shape: (1, 384)
Input: attention_mask, shape: (1, 384)
Input: position_ids, shape: (1, 384)
Test "llama3-8b-instruct" has 25 cases:
Passed: 25
Failed: 0 |
Add the following models:
|
DLRM-DCNv2Can't be exported to onnx due to But it could be created with these: To create the dataset: https://github.com/facebookresearch/dlrm/blob/main/torchrec_dlrm/scripts/process_Criteo_1TB_Click_Logs_dataset.sh |
Current State pt4COCO dataset (image) + Style prompts (text)stable-diffusion-2-1_text_encoder_fp16.log Input: input_ids, shape: (2, 77)
Test "text_encoder" has 5 cases:
Passed: 0
Failed: 5 stable-diffusion-2-1_text_encoder_fp32.log Input: input_ids, shape: (2, 77)
Test "text_encoder" has 5 cases:
Passed: 5
Failed: 0 stable-diffusion-2-1_unet_fp16.log Input: sample, shape: (2, 4, 64, 64)
Input: encoder_hidden_states, shape: (2, 77, 1024)
Input: timestep, shape: (1,)
Test "unet" has 25 cases:
Passed: 0
Failed: 25 Note: Outputs are stable-diffusion-2-1_unet_fp32.log Input: sample, shape: (2, 4, 64, 64)
Input: encoder_hidden_states, shape: (2, 77, 1024)
Input: timestep, shape: (1,)
Test "unet" has 25 cases:
Passed: 25
Failed: 0 stable-diffusion-2-1_vae_decoder_fp16.log Input: latent_sample, shape: (1, 4, 64, 64)
Test "vae_decoder" has 5 cases:
Passed: 0
Failed: 5 stable-diffusion-2-1_vae_decoder_fp32.log Input: latent_sample, shape: (1, 4, 64, 64)
Test "vae_decoder" has 5 cases:
Passed: 5
Failed: 0 stable-diffusion-2-1_vae_encoder_fp16.log Input: sample, shape: (1, 3, 512, 512)
Test "vae_encoder" has 5 cases:
Passed: 0
Failed: 5 Note: Some of the outputs are stable-diffusion-2-1_vae_encoder_fp32.log Input: sample, shape: (1, 3, 512, 512)
Test "vae_encoder" has 5 cases:
Passed: 0
Failed: 5 stable-diffusion-xl_text_encoder_2_fp16.log Input: input_ids, shape: (2, 77)
Test "text_encoder_2" has 5 cases:
Passed: 0
Failed: 5 stable-diffusion-xl_text_encoder_2_fp32.log Input: input_ids, shape: (2, 77)
Test "text_encoder_2" has 5 cases:
Passed: 5
Failed: 0 stable-diffusion-xl_text_encoder_fp16.log Input: input_ids, shape: (2, 77)
Test "text_encoder" has 5 cases:
Passed: 0
Failed: 5 stable-diffusion-xl_text_encoder_fp32.log Input: input_ids, shape: (2, 77)
Test "text_encoder" has 5 cases:
Passed: 5
Failed: 0 stable-diffusion-xl_unet_fp16.log Input: sample, shape: (2, 4, 128, 128)
Input: encoder_hidden_states, shape: (2, 77, 2048)
Input: timestep, shape: (1,)
Input: text_embeds, shape: (2, 1280)
Input: time_ids, shape: (2, 6)
Test "unet" has 25 cases:
Passed: 0
Failed: 25 stable-diffusion-xl_unet_fp32.log Input: sample, shape: (2, 4, 128, 128)
Input: encoder_hidden_states, shape: (2, 77, 2048)
Input: timestep, shape: (1,)
Input: text_embeds, shape: (2, 1280)
Input: time_ids, shape: (2, 6)
Test "unet" has 25 cases:
Passed: 25
Failed: 0 stable-diffusion-xl_vae_decoder_fp16.log Input: latent_sample, shape: (1, 4, 128, 128)
Test "vae_decoder" has 5 cases:
Passed: 0
Failed: 5 Note: Outputs are stable-diffusion-xl_vae_decoder_fp32.log Input: latent_sample, shape: (1, 4, 128, 128)
Test "vae_decoder" has 5 cases:
Passed: 5
Failed: 0 stable-diffusion-xl_vae_encoder_fp16.log Input: sample, shape: (1, 3, 1024, 1024)
Test "vae_encoder" has 5 cases:
Passed: 0
Failed: 5 Note: Outputs are stable-diffusion-xl_vae_encoder_fp32.log Input: sample, shape: (1, 3, 1024, 1024)
Test "vae_encoder" has 5 cases:
Passed: 0
Failed: 5 |
To properly test model accuracy, it is not enough to use random data, since it might not cover the proper range of the possible data.
We should collect candidates for datasets, and assign models to them.
The idea is to use public datasets. HuggingFace provides datasets.
It provies also helpers to load these in python.
Downloading is not enough, since each model has a pre- and post-processing step. It can vary for each model.
The whole process should be automatic and deterministic.
The text was updated successfully, but these errors were encountered: