Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A parameter is specified but not used in datasets.arrow_dataset.Dataset.from_pandas() #7365

Open
NourOM02 opened this issue Jan 10, 2025 · 0 comments

Comments

@NourOM02
Copy link

Describe the bug

I am interested in creating train, test and eval splits from a pandas Dataframe, therefore I was looking at the possibilities I can follow. I noticed the split parameter and was hopeful to use it in order to generate the 3 at once, however, while trying to understand the code, i noticed that it has no added value (correct me if I am wrong or misunderstood the code).

from_pandas function code :

    if info is not None and features is not None and info.features != features:
            raise ValueError(
                f"Features specified in `features` and `info.features` can't be different:\n{features}\n{info.features}"
            )
        features = features if features is not None else info.features if info is not None else None
        if info is None:
            info = DatasetInfo()
        info.features = features
        table = InMemoryTable.from_pandas(
            df=df,
            preserve_index=preserve_index,
        )
        if features is not None:
            # more expensive cast than InMemoryTable.from_pandas(..., schema=features.arrow_schema)
            # needed to support the str to Audio conversion for instance
            table = table.cast(features.arrow_schema)
        return cls(table, info=info, split=split)

Steps to reproduce the bug

from datasets import Dataset
# Filling the split parameter with whatever causes no harm at all
data = Dataset.from_pandas(self.raw_data, split='egiojegoierjgoiejgrefiergiuorenvuirgurthgi')

Expected behavior

Would be great if there is no split parameter (if it isn't working), or to add a concrete example of how it can be used.

Environment info

  • datasets version: 3.2.0
  • Platform: Linux-5.15.0-127-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • huggingface_hub version: 0.27.1
  • PyArrow version: 18.1.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.9.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant