Skip to content

Commit

Permalink
More consistent naming (#2611)
Browse files Browse the repository at this point in the history
* More consistent naming

* Update datasets/norne/README.md

Co-authored-by: Stas Bekman <[email protected]>

* Fix anchor

Co-authored-by: Stas Bekman <[email protected]>

* Remove backticks in name

* Remove backticks in anchor

* Replace Tensorflow with TensorFlow

* more 🤗

Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
  • Loading branch information
3 people authored Jul 13, 2021
1 parent dcc2cf1 commit 4aff493
Show file tree
Hide file tree
Showing 15 changed files with 81 additions and 80 deletions.
3 changes: 2 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,8 @@ A [more complete guide](https://github.com/huggingface/datasets/blob/master/ADD_

6. Finally, take some time to document your dataset for other users. Each dataset should be accompanied by a `README.md` dataset card in its directory which describes the data and contains tags representing languages and tasks supported to be easily discoverable. You can find information on how to fill out the card either manually or by using our [web app](https://huggingface.co/datasets/card-creator/) in the following [guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md).

7. If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section [*How to contribute to 🤗Datasets?*](#how-to-contribute-to-🤗Datasets). If you experience problems with the dummy data tests, you might want to take a look at the section *Help for dummy data tests* below.
7. If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section [*How to contribute to 🤗 Datasets?*](#how-to-contribute-to-Datasets). If you experience problems with the dummy data tests, you might want to take a look at the section *Help for dummy data tests* below.



### Help for dummy data tests
Expand Down
38 changes: 19 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
<a href="https://zenodo.org/badge/latestdoi/250213286"><img src="https://zenodo.org/badge/250213286.svg" alt="DOI"></a>
</p>

`🤗Datasets` is a lightweight library providing **two** main features:
🤗 Datasets is a lightweight library providing **two** main features:

- **one-line dataloaders for many public datasets**: one liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (in 467 languages and dialects!) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. With simple commands like `tokenized_dataset = dataset.map(tokenize_exemple)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
Expand All @@ -38,29 +38,29 @@
<a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/datasets/master/docs/source/imgs/course_banner.png"></a>
</h3>

`🤗Datasets` also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.
🤗 Datasets also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.

`🤗Datasets` has many additional interesting features:
- Thrive on large datasets: `🤗Datasets` naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
🤗 Datasets has many additional interesting features:
- Thrive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
- Smart caching: never wait for your data to process several times.
- Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
- Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX.

`🤗Datasets` originated from a fork of the awesome [`TensorFlow Datasets`](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between `🤗Datasets` and `tfds` can be found in the section [Main differences between `🤗Datasets` and `tfds`](#main-differences-between-datasets-and-tfds).
🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section [Main differences between 🤗 Datasets and `tfds`](#main-differences-between-datasets-and-tfds).

# Installation

## With pip

`🤗Datasets` can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)
🤗 Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

```bash
pip install datasets
```

## With conda

`🤗Datasets` can be installed using conda as follows:
🤗 Datasets can be installed using conda as follows:

```bash
conda install -c huggingface -c conda-forge datasets
Expand All @@ -72,13 +72,13 @@ For more details on installation, check the installation page in the documentati

## Installation to use with PyTorch/TensorFlow/pandas

If you plan to use `🤗Datasets` with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.
If you plan to use 🤗 Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.

For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html

# Usage

`🤗Datasets` is made to be very simple to use. The main methods are:
🤗 Datasets is made to be very simple to use. The main methods are:

- `datasets.list_datasets()` to list the available datasets
- `datasets.load_dataset(dataset_name, **kwargs)` to instantiate a dataset
Expand Down Expand Up @@ -106,7 +106,7 @@ squad_metric = load_metric('squad')
# Process the dataset - add a column with the length of the context texts
dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])})

# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗Transformers library)
# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Expand All @@ -117,11 +117,11 @@ For more details on using the library, check the quick tour page in the document

- Loading a dataset https://huggingface.co/docs/datasets/loading_datasets.html
- What's in a Dataset: https://huggingface.co/docs/datasets/exploring.html
- Processing data with `🤗Datasets`: https://huggingface.co/docs/datasets/processing.html
- Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/processing.html
- Writing your own dataset loading script: https://huggingface.co/docs/datasets/add_dataset.html
- etc.

Another introduction to `🤗Datasets` is the tutorial on Google Colab here:
Another introduction to 🤗 Datasets is the tutorial on Google Colab here:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb)

# Add a new dataset to the Hub
Expand All @@ -132,17 +132,17 @@ You will find [the step-by-step guide here](https://github.com/huggingface/datas

You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in [the documentation section about dataset sharing](https://huggingface.co/docs/datasets/share_dataset.html).

# Main differences between `🤗Datasets` and `tfds`
# Main differences between 🤗 Datasets and `tfds`

If you are familiar with the great `Tensorflow Datasets`, here are the main differences between `🤗Datasets` and `tfds`:
- the scripts in `🤗Datasets` are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request
- `🤗Datasets` also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) or [GLUE](https://gluebenchmark.com/).
- the backend serialization of `🤗Datasets` is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache).
- the user-facing dataset object of `🤗Datasets` is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache.
If you are familiar with the great TensorFlow Datasets, here are the main differences between 🤗 Datasets and `tfds`:
- the scripts in 🤗 Datasets are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request
- 🤗 Datasets also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) or [GLUE](https://gluebenchmark.com/).
- the backend serialization of 🤗 Datasets is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache).
- the user-facing dataset object of 🤗 Datasets is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache.

# Disclaimers

Similar to `TensorFlow Datasets`, `🤗Datasets` is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
Similar to TensorFlow Datasets, 🤗 Datasets is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a [GitHub issue](https://github.com/huggingface/datasets/issues/new). Thanks for your contribution to the ML community!

Expand Down
2 changes: 1 addition & 1 deletion datasets/newsroom/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ And additional features:
- compression_bin: low, medium, high.

This dataset can be downloaded upon requests. Unzip all the contents
"train.jsonl, dev.josnl, test.jsonl" to the tfds folder.
"train.jsonl, dev.josnl, test.jsonl" to the `tfds` folder.

### Supported Tasks and Leaderboards

Expand Down
2 changes: 1 addition & 1 deletion datasets/norne/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,7 @@ To access these reduced versions of the dataset, you can use the configs `bokmaa

NorNE was created as a collaboration between [Schibsted Media Group](https://schibsted.com/), [Språkbanken](https://www.nb.no/forskning/sprakbanken/) at the [National Library of Norway](https://www.nb.no) and the [Language Technology Group](https://www.mn.uio.no/ifi/english/research/groups/ltg/) at the University of Oslo.

NorNE was added to Huggingface Datasets by the AI-Lab at the National Library of Norway.
NorNE was added to 🤗 Datasets by the AI-Lab at the National Library of Norway.

### Licensing Information

Expand Down
2 changes: 1 addition & 1 deletion docs/source/exploring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ Up to now, the rows/batches/columns returned when querying the elements of the d

Sometimes we would like to have more sophisticated objects returned by our dataset, for instance NumPy arrays or PyTorch tensors instead of python lists.

🤗Datasets provides a way to do that through what is called a ``format``.
🤗 Datasets provides a way to do that through what is called a ``format``.

While the internal storage of the dataset is always the Apache Arrow format, by setting a specific format on a dataset, you can filter some columns and cast the output of :func:`datasets.Dataset.__getitem__` in NumPy/pandas/PyTorch/TensorFlow, on-the-fly.

Expand Down
10 changes: 5 additions & 5 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,17 @@ Datasets and evaluation metrics for natural language processing

Compatible with NumPy, Pandas, PyTorch and TensorFlow

🤗Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP).
🤗 Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP).

🤗Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):
🤗 Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):

Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
Lightweight and fast with a transparent and pythonic API
Strive on large datasets: 🤗Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
Smart caching: never wait for your data to process several times
🤗Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗Datasets viewer.
🤗 Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗 Datasets viewer.

🤗Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗Datasets and tfds can be found in the section Main differences between 🤗Datasets and tfds.
🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section Main differences between 🤗 Datasets and `tfds`.

Contents
---------------------------------
Expand Down
16 changes: 8 additions & 8 deletions docs/source/installation.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
# Installation

🤗Datasets is tested on Python 3.6+.
🤗 Datasets is tested on Python 3.6+.

You should install 🤗Datasets in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
You should install 🤗 Datasets in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). Create a virtual environment with the version of Python you're going to use and activate it.

Now, if you want to use 🤗Datasets, you can install it with pip. If you'd like to play with the examples, you must install it from source.
Now, if you want to use 🤗 Datasets, you can install it with pip. If you'd like to play with the examples, you must install it from source.

## Installation with pip

🤗Datasets can be installed using pip as follows:
🤗 Datasets can be installed using pip as follows:

```bash
pip install datasets
```

To check 🤗Datasets is properly installed, run the following command:
To check 🤗 Datasets is properly installed, run the following command:

```bash
python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"
Expand All @@ -27,7 +27,7 @@ It should download version 1 of the [Stanford Question Answering Dataset](https:
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f41900661182', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'}
```

If you want to use the 🤗Datasets library with TensorFlow 2.0 or PyTorch, you will need to install these seperately.
If you want to use the 🤗 Datasets library with TensorFlow 2.0 or PyTorch, you will need to install these seperately.
Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available)
and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform.

Expand All @@ -48,11 +48,11 @@ Again, you can run:
python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"
```

to check 🤗Datasets is properly installed.
to check 🤗 Datasets is properly installed.

## With conda

🤗Datasets can be installed using conda as follows:
🤗 Datasets can be installed using conda as follows:

```bash
conda install -c huggingface -c conda-forge datasets
Expand Down
Loading

1 comment on commit 4aff493

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.012591 / 0.011353 (0.001238) 0.004914 / 0.011008 (-0.006094) 0.037272 / 0.038508 (-0.001236) 0.043233 / 0.023109 (0.020124) 0.385063 / 0.275898 (0.109165) 0.416782 / 0.323480 (0.093302) 0.010309 / 0.007986 (0.002323) 0.005689 / 0.004328 (0.001361) 0.011463 / 0.004250 (0.007213) 0.050220 / 0.037052 (0.013168) 0.371986 / 0.258489 (0.113497) 0.424239 / 0.293841 (0.130398) 0.036030 / 0.128546 (-0.092517) 0.011690 / 0.075646 (-0.063956) 0.329907 / 0.419271 (-0.089364) 0.058576 / 0.043533 (0.015043) 0.365940 / 0.255139 (0.110801) 0.421020 / 0.283200 (0.137820) 0.101049 / 0.141683 (-0.040634) 2.017071 / 1.452155 (0.564916) 1.991758 / 1.492716 (0.499042)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.016820 / 0.018006 (-0.001186) 0.533649 / 0.000490 (0.533159) 0.002529 / 0.000200 (0.002329) 0.000415 / 0.000054 (0.000360)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.047123 / 0.037411 (0.009712) 0.030232 / 0.014526 (0.015706) 0.033804 / 0.176557 (-0.142753) 0.154769 / 0.737135 (-0.582366) 0.035212 / 0.296338 (-0.261126)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.492761 / 0.215209 (0.277552) 5.064150 / 2.077655 (2.986496) 2.337621 / 1.504120 (0.833501) 2.021944 / 1.541195 (0.480749) 2.123902 / 1.468490 (0.655412) 0.496727 / 4.584777 (-4.088050) 6.544778 / 3.745712 (2.799066) 4.040534 / 5.269862 (-1.229328) 1.633948 / 4.565676 (-2.931729) 0.060896 / 0.424275 (-0.363379) 0.007066 / 0.007607 (-0.000541) 0.661245 / 0.226044 (0.435201) 6.492883 / 2.268929 (4.223955) 3.028333 / 55.444624 (-52.416291) 2.477792 / 6.876477 (-4.398685) 2.469152 / 2.142072 (0.327080) 0.713606 / 4.805227 (-4.091622) 0.158330 / 6.500664 (-6.342334) 0.067693 / 0.075469 (-0.007776)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 15.871547 / 1.841788 (14.029760) 16.163419 / 8.074308 (8.089111) 43.385822 / 10.191392 (33.194430) 0.931739 / 0.680424 (0.251315) 0.714504 / 0.534201 (0.180303) 0.303416 / 0.579283 (-0.275868) 0.724150 / 0.434364 (0.289786) 0.238152 / 0.540337 (-0.302185) 1.160244 / 1.386936 (-0.226692)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.011637 / 0.011353 (0.000285) 0.004598 / 0.011008 (-0.006410) 0.041047 / 0.038508 (0.002539) 0.039521 / 0.023109 (0.016412) 0.434860 / 0.275898 (0.158962) 0.467137 / 0.323480 (0.143657) 0.008909 / 0.007986 (0.000924) 0.006151 / 0.004328 (0.001823) 0.011240 / 0.004250 (0.006990) 0.043557 / 0.037052 (0.006505) 0.421536 / 0.258489 (0.163047) 0.463900 / 0.293841 (0.170059) 0.035659 / 0.128546 (-0.092887) 0.012200 / 0.075646 (-0.063446) 0.326896 / 0.419271 (-0.092376) 0.057641 / 0.043533 (0.014109) 0.429042 / 0.255139 (0.173903) 0.473112 / 0.283200 (0.189913) 0.091974 / 0.141683 (-0.049709) 1.969643 / 1.452155 (0.517489) 2.001315 / 1.492716 (0.508598)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.076216 / 0.018006 (0.058210) 0.547000 / 0.000490 (0.546511) 0.030521 / 0.000200 (0.030321) 0.005602 / 0.000054 (0.005547)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.050171 / 0.037411 (0.012759) 0.029524 / 0.014526 (0.014998) 0.033417 / 0.176557 (-0.143140) 0.155654 / 0.737135 (-0.581481) 0.033441 / 0.296338 (-0.262897)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.526240 / 0.215209 (0.311031) 5.182572 / 2.077655 (3.104917) 2.463106 / 1.504120 (0.958986) 2.171269 / 1.541195 (0.630075) 2.123773 / 1.468490 (0.655283) 0.506553 / 4.584777 (-4.078223) 6.647253 / 3.745712 (2.901541) 4.033713 / 5.269862 (-1.236149) 1.630333 / 4.565676 (-2.935343) 0.060997 / 0.424275 (-0.363278) 0.006455 / 0.007607 (-0.001152) 0.677295 / 0.226044 (0.451251) 6.837012 / 2.268929 (4.568084) 3.203865 / 55.444624 (-52.240759) 2.503126 / 6.876477 (-4.373351) 2.614579 / 2.142072 (0.472506) 0.701083 / 4.805227 (-4.104145) 0.153609 / 6.500664 (-6.347055) 0.070260 / 0.075469 (-0.005209)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 16.088633 / 1.841788 (14.246845) 16.372860 / 8.074308 (8.298552) 41.657975 / 10.191392 (31.466583) 0.993223 / 0.680424 (0.312799) 0.754343 / 0.534201 (0.220142) 0.287690 / 0.579283 (-0.291593) 0.717119 / 0.434364 (0.282755) 0.237212 / 0.540337 (-0.303125) 1.174768 / 1.386936 (-0.212168)

CML watermark

Please sign in to comment.