Releases: huggingface/datasets
Releases · huggingface/datasets
0.3.0
New methods to transform a dataset:
dataset.shuffle
: create a shuffled datasetdataset.train_test_split
: create a train and a test split (similar to sklearn)dataset.sort
: create a dataset sorted according to a certain columndataset.select
: create a dataset with rows selected following the given list of indices
Other features:
- Better instructions for datasets that require manual download
Important: if you load datasets that require manual downloads with an older version of
nlp
, instructions won't be shown and an error will be raised - Better access to dataset information (for instance
dataset.feature['label']
ordataset.dataset_size
)
Datasets:
- New: cos_e v1.0
- New: rotten_tomatoes
- New: german and italian wikipedia
New docs:
- documentation about splitting a dataset
Bug fixes:
- fix metric.compute that couldn't write on file
- fix squad_v2 imports
0.2.1
New datasets:
- ELI5
- CompGuessWhat?!
- BookCorpus
- Piaf
- Allociné
- BlendedSkillTalk
New features:
- .filter method
- option to do batching for metrics
- make datasets deterministic
New commands:
- nlp-cli upload_dataset
- nlp-cli upload_metric
- nlp-cli s3_datasets {ls,rm}
- nlp-cli s3_metrics {ls,rm}
New datasets + Apache Beam, new metrics, bug fixes
Datasets changes
- New: germeval14
- New: wmt
- New: Ubuntu dialog corpus
- New: squad spanish
- New: Quanta
- New: arcd
- New: Natural Questions (needs to be processed using a beam pipeline)
- New: C4 (needs to be processed using a beam pipeline)
- Skip the processing: wikipedia (english and french version are now already processed)
- Skip the processing: wiki40b (english version is now already processed)
- Renamed: anli -> art
- Better instructions: xsum
- Add .filter() for arrow datasets
- Add instruction message for manual data when required
Metrics changes:
- New: BERTScore
- Allow to add examples by element or by batch to compute a metric score
Commands:
- New: nlp-cli dummy_data: to help generate dummy data files to test dataset scripts
- New: nlp-cli run_beam: to run an apache beam pipeline to process a dataset in the cloud
Bug fixes:
- Now .map return the right values when run on different splits of the same dataset
- Fix input of the squad metric format to fit the format of the squad dataset
- Fix download from google drive for small files
- For datasets like glue or scientific paper, force the user to pick one sub-dataset to make things less confusing
More tests
- Local tests of dataset processing scripts
- AWS tests of dataset processing scripts
- Tests for arrow dataset methods
- Tests for arrow reader methods
First release
First release of the nlp
library.
Read the README.md for an introduction: https://github.com/huggingface/nlp/blob/master/README.md
Tutorial: https://colab.research.google.com/github/huggingface/nlp/blob/master/notebooks/Overview.ipynb
This is a beta release and the API is not expected to be stabilized yet (in particular the API for the metrics).
Documentation and tests are also still sparse.