cuda device-side runtime error when training on custom dataset for JSON outputs #170

SinclairHudson · 2024-05-12T22:19:32Z

Describe the bug
When attempting to train on this dataset: https://huggingface.co/datasets/azizshaw/text_to_json

To Reproduce
Steps to reproduce the behaviour:
Checkout main branch
Replace the data ingestion portion of llmtune/config.yml with:

data:
  file_type: "huggingface" # one of 'json', 'csv', 'huggingface'
  path: "azizshaw/text_to_json"
  prompt:
    >- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data
    {instruction}
    Now create a json object for the following scenario
    {input}
  prompt_stub:
    >- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
    {output}
  test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
  train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
  train_test_split_seed: 42

And then run

llmtune run llmtune/config.yml

Expected behavior
To my knowledge, this should run without error.

Environment:

OS: Ubuntu 20.04
running locally on a 3090
using the developer poetry environment/shell

This bug doesn't occur on the normal dataset, just on this other one. So, it could be something with a specific token or encoding in this dataset? Or there could be an issue with JSON outputs interfering with YAML syntax in the config.

The text was updated successfully, but these errors were encountered:

SinclairHudson · 2024-05-12T23:11:32Z

FYI, this dataset also doesn't work. It fails for another reason, TypeError somewhere else. This could be unrelated, but I'm wondering if the method for injecting prompts/responses is robust to stringified JSON.

# Data Ingestion -------------------
data:
  file_type: "huggingface" # one of 'json', 'csv', 'huggingface'
  path: "growth-cadet/jobpost_signals-to-json_test_mistral01gen"
  prompt:
    >- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data
    Given the following job posting, convert the text into a JSON object, with relevant fields.
    ## Job posting
    {context}
    ## JSON
  prompt_stub:
    >- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
    {mistral01_gen}
  test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
  train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
  train_test_split_seed: 42

benjaminye · 2024-05-13T15:22:58Z

@SinclairHudson Thanks for flagging this issue. For "azizshaw/text_to_json", can you attach the error message? It ran fine for me.

For "growth-cadet/jobpost_signals-to-json_test_mistral01gen", I've identified issue with table display where int and float types weren't converted to str properly. I've patched this issue under #172

SinclairHudson · 2024-05-19T04:44:26Z

device-side.txt
I've included the whole output, both stdout and stderr, for the "azizshaw/text_to_json" case.

benjaminye · 2024-05-21T14:30:41Z

Can you run transformers-cli env and paste in the output?
Also, can you attach the config as well?

With above info - I will try to replicate and debug on my end.

Also, it could be an issue due to using multiple GPUs (huggingface/transformers#22546). If model is small enough, can you try to pin the weights on one GPU via device_map?

benjaminye mentioned this issue May 13, 2024

Fixed issue where Rich cannot display non-str types #172

Merged

benjaminye closed this as completed in #172 Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda device-side runtime error when training on custom dataset for JSON outputs #170

cuda device-side runtime error when training on custom dataset for JSON outputs #170

SinclairHudson commented May 12, 2024

SinclairHudson commented May 12, 2024

benjaminye commented May 13, 2024

SinclairHudson commented May 19, 2024

benjaminye commented May 21, 2024

cuda device-side runtime error when training on custom dataset for JSON outputs #170

cuda device-side runtime error when training on custom dataset for JSON outputs #170

Comments

SinclairHudson commented May 12, 2024

SinclairHudson commented May 12, 2024

benjaminye commented May 13, 2024

SinclairHudson commented May 19, 2024

benjaminye commented May 21, 2024