Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda device-side runtime error when training on custom dataset for JSON outputs #170

Closed
SinclairHudson opened this issue May 12, 2024 · 4 comments · Fixed by #172
Closed

Comments

@SinclairHudson
Copy link
Contributor

Describe the bug
When attempting to train on this dataset: https://huggingface.co/datasets/azizshaw/text_to_json

To Reproduce
Steps to reproduce the behaviour:
Checkout main branch
Replace the data ingestion portion of llmtune/config.yml with:

data:
  file_type: "huggingface" # one of 'json', 'csv', 'huggingface'
  path: "azizshaw/text_to_json"
  prompt:
    >- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data
    {instruction}
    Now create a json object for the following scenario
    {input}
  prompt_stub:
    >- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
    {output}
  test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
  train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
  train_test_split_seed: 42

And then run

llmtune run llmtune/config.yml

Expected behavior
To my knowledge, this should run without error.

Environment:

  • OS: Ubuntu 20.04
  • running locally on a 3090
  • using the developer poetry environment/shell

This bug doesn't occur on the normal dataset, just on this other one. So, it could be something with a specific token or encoding in this dataset? Or there could be an issue with JSON outputs interfering with YAML syntax in the config.

@SinclairHudson
Copy link
Contributor Author

FYI, this dataset also doesn't work. It fails for another reason, TypeError somewhere else. This could be unrelated, but I'm wondering if the method for injecting prompts/responses is robust to stringified JSON.

# Data Ingestion -------------------
data:
  file_type: "huggingface" # one of 'json', 'csv', 'huggingface'
  path: "growth-cadet/jobpost_signals-to-json_test_mistral01gen"
  prompt:
    >- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data
    Given the following job posting, convert the text into a JSON object, with relevant fields.
    ## Job posting
    {context}
    ## JSON
  prompt_stub:
    >- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
    {mistral01_gen}
  test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
  train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
  train_test_split_seed: 42

@benjaminye
Copy link
Contributor

@SinclairHudson Thanks for flagging this issue. For "azizshaw/text_to_json", can you attach the error message? It ran fine for me.

For "growth-cadet/jobpost_signals-to-json_test_mistral01gen", I've identified issue with table display where int and float types weren't converted to str properly. I've patched this issue under #172

@SinclairHudson
Copy link
Contributor Author

device-side.txt
I've included the whole output, both stdout and stderr, for the "azizshaw/text_to_json" case.

@benjaminye
Copy link
Contributor

Can you run transformers-cli env and paste in the output?
Also, can you attach the config as well?

With above info - I will try to replicate and debug on my end.

Also, it could be an issue due to using multiple GPUs (huggingface/transformers#22546). If model is small enough, can you try to pin the weights on one GPU via device_map?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants