data sequence and customized csv dataset #2733

wangjiawen2013 · 2025-01-22T08:37:30Z

wangjiawen2013
Jan 22, 2025

Hi,
We can construct a InMemDataset from a csv file according to burn's example (https://github.com/tracel-ai/burn/blob/main/examples/custom-csv-dataset/src/dataset.rs).

But, when the csv is very wide (such as having 1000 columns), it is impossible to construct a struct with all the columns as fields manually. Are there an easy way ?

Besides, how to construct a tensor from a digital string ? Here is my intention:

let s = String::from("1, 2, 3, 4, 5, 6");
let tensor = Tensor::<B, 1>::from_floats(s);  // I want a tensor [1, 2, 3, 4, 5, 6], but this will not work

They may be useful when implementing a LSTM as sequences are needed.

Answered by antimora

Jan 27, 2025

There is also [DataframeDataset](https://burn.dev/burn-book/building-blocks/dataset.html#storage) option that you can use. It uses Polars dataframe underneath. You can use Polars to read and manipulate CSV. If you can achieve your goal via Polars, then you can load dataframe as a dataset.

View full answer

laggui · 2025-01-22T18:31:09Z

laggui
Jan 22, 2025
Maintainer

But, when the csv is very wide (such as having 1000 columns), it is impossible to construct a struct with all the columns as fields manually. Are there an easy way ?

The InMemDataset::from_csv(...) method is there for convenience. It simply uses the csv crate to parse each record and deserializes it into the provided struct thanks to serde.

burn/crates/burn-dataset/src/dataset/in_memory.rs

Lines 72 to 88 in 245fbcd

    
               pub fn from_csv<P: AsRef<Path>>( 
        
                   path: P, 
        
                   builder: &csv::ReaderBuilder, 
        
               ) -> Result<Self, std::io::Error> { 
        
                   let mut rdr = builder.from_path(path)?; 
        
                   let mut items = Vec::new(); 
        
                   for result in rdr.deserialize() { 
        
                       let item: I = result?; 
        
                       items.push(item); 
        
                   } 
        
                   let dataset = Self::new(items); 
        
                   Ok(dataset) 
        
               }

If that doesn't fit your needs, you can implement your own parsing to create the InMemDataset from your collection of items.

Besides, how to construct a tensor from a digital string ?
They may be useful when implementing a LSTM as sequences are needed.

You cannot construct a tensor from a string. For NLP tasks, you need to go from the string representation to tokens. This can be done in many different ways, so the implementation is up to the user.

Modern techniques involve tokenization, where strings (e.g., sentences) are split into smaller units (e.g., words, subwords, or characters) called tokens, and these tokens are mapped to unique integers using a vocabulary. See for example the tokenizer in the text classification example.

1 reply

wangjiawen2013 Jan 23, 2025
Author

I mean it's impracticable to construct an structure with too many (1000) fields as the datatype of the dataset item. As you can see from burn's example: https://github.com/tracel-ai/burn/blob/main/examples/custom-csv-dataset/src/dataset.rs

/// Diabetes patient record.
/// For each field, we manually specify the expected header name for serde as all names
/// are capitalized and some field names are not very informative.
#[derive(Deserialize, Serialize, Debug, Clone)]
pub struct DiabetesPatient {
    /// Age in years
    #[serde(rename = "AGE")]
    pub age: u8,

    /// Sex categorical label
    #[serde(rename = "SEX")]
    pub sex: u8,

    /// Body mass index
    #[serde(rename = "BMI")]
    pub bmi: f32,

    /// Average blood pressure
    #[serde(rename = "BP")]
    pub bp: f32,

    /// S1: total serum cholesterol
    #[serde(rename = "S1")]
    pub tc: u16,

    /// S2: low-density lipoproteins
    #[serde(rename = "S2")]
    pub ldl: f32,

    /// S3: high-density lipoproteins
    #[serde(rename = "S3")]
    pub hdl: f32,

    /// S4: total cholesterol
    #[serde(rename = "S4")]
    pub tch: f32,

    /// S5: possibly log of serum triglycerides level
    #[serde(rename = "S5")]
    pub ltg: f32,

    /// S6: blood sugar level
    #[serde(rename = "S6")]
    pub glu: u8,

    /// Y: quantitative measure of disease progression one year after baseline
    #[serde(rename = "Y")]
    pub response: u16,
}

Though we can parse the csv using from_csv, we still need to specify the data type of the dataset item manually, so we must define a structure before using the dataset. It is necessary to find an easy way to specify the data type when the dataset includes too many fields/columns.

antimora · 2025-01-27T17:31:59Z

antimora
Jan 27, 2025
Collaborator

There is also [DataframeDataset](https://burn.dev/burn-book/building-blocks/dataset.html#storage) option that you can use. It uses Polars dataframe underneath. You can use Polars to read and manipulate CSV. If you can achieve your goal via Polars, then you can load dataframe as a dataset.

2 replies

wangjiawen2013 Feb 2, 2025
Author

We are facing the same problem when using no matter polars DataframeDataset or InMemDataset. Here is a chunk of code adopted from https://github.com/tracel-ai/burn/blob/main/crates/burn-dataset/src/dataset/dataframe.rs, as you can see, we also need to define a struct TestData before using the dataset. When there are a lot of fields, it's very hard to define such a struct ! In some areas, such as biology, a cell can express tens of thousands of genes, so we need to define a struct with thousands of fields ! It's almost impossible! So are there any easy methods to construct such an struct ?

#[cfg(test)]
mod tests {
    use polars::prelude::*;
    use serde::Deserialize;

    use super::*;
    #[derive(Clone, Debug, Deserialize, PartialEq)]
    struct TestData {
        int32: i32,
        bool: bool,
        float64: f64,
        string: String,
        int16: i16,
        uint32: u32,
        uint64: u64,
        float32: f32,
        int64: i64,
        int8: i8,
        binary: Vec<u8>,
    }

    fn create_test_dataframe() -> DataFrame {
        let s0 = Column::new("int32".into(), &[1i32, 2i32, 3i32]);
        let s1 = Column::new("bool".into(), &[true, false, true]);
        let s2 = Column::new("float64".into(), &[1.1f64, 2.2f64, 3.3f64]);
        let s3 = Column::new("string".into(), &["Boo", "Boo2", "Boo3"]);
        let s6 = Column::new("int16".into(), &[1i16, 2i16, 3i16]);
        let s8 = Column::new("uint32".into(), &[1u32, 2u32, 3u32]);
        let s9 = Column::new("uint64".into(), &[1u64, 2u64, 3u64]);
        let s10 = Column::new("float32".into(), &[1.1f32, 2.2f32, 3.3f32]);
        let s11 = Column::new("int64".into(), &[1i64, 2i64, 3i64]);
        let s12 = Column::new("int8".into(), &[1i8, 2i8, 3i8]);

        let binary_data: Vec<&[u8]> = vec![&[1, 2, 3], &[4, 5, 6], &[7, 8, 9]];

        let s13 = Column::new("binary".into(), binary_data);
        DataFrame::new(vec![s0, s1, s2, s3, s6, s8, s9, s10, s11, s12, s13]).unwrap()
    }

    #[test]
    fn test_dataframe_dataset_creation() {
        let df = create_test_dataframe();
        let dataset = DataframeDataset::<TestData>::new(df);
        assert!(dataset.is_ok());
    }

wangjiawen2013 Feb 2, 2025
Author

Things may get even worse in another situation. In this situation, we don't want to use the raw DataframeDataset or InMemDataset or SqliteDataset, but want to pre-process the raw dataset and then use the clean dataset. So it is impossbile for us to pre-define the datasetitem struct because we even don't know what the processed dataset will be!

In both situations, I think it maybe not a good idea to specify the datasetitem struct. Pytorch provides TensorDataset to help structure out data into a format that PyTorch's dataloaders can understand, allowing us construct the training dataset easily without specifying the the datasetitem or customized dataset. So we can use TensorDataset as an temporary alternative when we cannot specify the appropriate datasetitem. Does burn support TensorDataset and could you give some examples on how to treat this situations (such as converting DataframeDataset or InMemDataset or SqliteDataset to TensorDataset)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data sequence and customized csv dataset #2733

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

data sequence and customized csv dataset #2733

wangjiawen2013 Jan 22, 2025

Replies: 2 comments · 3 replies

laggui Jan 22, 2025 Maintainer

wangjiawen2013 Jan 23, 2025 Author

antimora Jan 27, 2025 Collaborator

wangjiawen2013 Feb 2, 2025 Author

wangjiawen2013 Feb 2, 2025 Author

wangjiawen2013
Jan 22, 2025

Replies: 2 comments 3 replies

laggui
Jan 22, 2025
Maintainer

wangjiawen2013 Jan 23, 2025
Author

antimora
Jan 27, 2025
Collaborator

wangjiawen2013 Feb 2, 2025
Author

wangjiawen2013 Feb 2, 2025
Author