diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index a8f94355154..becf2417dba 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -126,7 +126,8 @@ A [more complete guide](https://github.com/huggingface/datasets/blob/master/ADD_ 6. Finally, take some time to document your dataset for other users. Each dataset should be accompanied by a `README.md` dataset card in its directory which describes the data and contains tags representing languages and tasks supported to be easily discoverable. You can find information on how to fill out the card either manually or by using our [web app](https://huggingface.co/datasets/card-creator/) in the following [guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md). -7. If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section [*How to contribute to 🤗Datasets?*](#how-to-contribute-to-🤗Datasets). If you experience problems with the dummy data tests, you might want to take a look at the section *Help for dummy data tests* below. +7. If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section [*How to contribute to 🤗 Datasets?*](#how-to-contribute-to-Datasets). If you experience problems with the dummy data tests, you might want to take a look at the section *Help for dummy data tests* below. + ### Help for dummy data tests diff --git a/README.md b/README.md index e5450b6a5ab..d7f62b11afb 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ DOI

-`🤗Datasets` is a lightweight library providing **two** main features: +🤗 Datasets is a lightweight library providing **two** main features: - **one-line dataloaders for many public datasets**: one liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (in 467 languages and dialects!) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), - **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. With simple commands like `tokenized_dataset = dataset.map(tokenize_exemple)`, efficiently prepare the dataset for inspection and ML model evaluation and training. @@ -38,21 +38,21 @@ -`🤗Datasets` also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. +🤗 Datasets also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. -`🤗Datasets` has many additional interesting features: -- Thrive on large datasets: `🤗Datasets` naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow). +🤗 Datasets has many additional interesting features: +- Thrive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow). - Smart caching: never wait for your data to process several times. - Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping). - Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX. -`🤗Datasets` originated from a fork of the awesome [`TensorFlow Datasets`](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between `🤗Datasets` and `tfds` can be found in the section [Main differences between `🤗Datasets` and `tfds`](#main-differences-between-datasets-and-tfds). +🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section [Main differences between 🤗 Datasets and `tfds`](#main-differences-between-datasets-and-tfds). # Installation ## With pip -`🤗Datasets` can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance) +🤗 Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance) ```bash pip install datasets @@ -60,7 +60,7 @@ pip install datasets ## With conda -`🤗Datasets` can be installed using conda as follows: +🤗 Datasets can be installed using conda as follows: ```bash conda install -c huggingface -c conda-forge datasets @@ -72,13 +72,13 @@ For more details on installation, check the installation page in the documentati ## Installation to use with PyTorch/TensorFlow/pandas -If you plan to use `🤗Datasets` with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas. +If you plan to use 🤗 Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas. For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html # Usage -`🤗Datasets` is made to be very simple to use. The main methods are: +🤗 Datasets is made to be very simple to use. The main methods are: - `datasets.list_datasets()` to list the available datasets - `datasets.load_dataset(dataset_name, **kwargs)` to instantiate a dataset @@ -106,7 +106,7 @@ squad_metric = load_metric('squad') # Process the dataset - add a column with the length of the context texts dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])}) -# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗Transformers library) +# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library) from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') @@ -117,11 +117,11 @@ For more details on using the library, check the quick tour page in the document - Loading a dataset https://huggingface.co/docs/datasets/loading_datasets.html - What's in a Dataset: https://huggingface.co/docs/datasets/exploring.html -- Processing data with `🤗Datasets`: https://huggingface.co/docs/datasets/processing.html +- Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/processing.html - Writing your own dataset loading script: https://huggingface.co/docs/datasets/add_dataset.html - etc. -Another introduction to `🤗Datasets` is the tutorial on Google Colab here: +Another introduction to 🤗 Datasets is the tutorial on Google Colab here: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb) # Add a new dataset to the Hub @@ -132,17 +132,17 @@ You will find [the step-by-step guide here](https://github.com/huggingface/datas You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in [the documentation section about dataset sharing](https://huggingface.co/docs/datasets/share_dataset.html). -# Main differences between `🤗Datasets` and `tfds` +# Main differences between 🤗 Datasets and `tfds` -If you are familiar with the great `Tensorflow Datasets`, here are the main differences between `🤗Datasets` and `tfds`: -- the scripts in `🤗Datasets` are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request -- `🤗Datasets` also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) or [GLUE](https://gluebenchmark.com/). -- the backend serialization of `🤗Datasets` is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache). -- the user-facing dataset object of `🤗Datasets` is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache. +If you are familiar with the great TensorFlow Datasets, here are the main differences between 🤗 Datasets and `tfds`: +- the scripts in 🤗 Datasets are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request +- 🤗 Datasets also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) or [GLUE](https://gluebenchmark.com/). +- the backend serialization of 🤗 Datasets is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache). +- the user-facing dataset object of 🤗 Datasets is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache. # Disclaimers -Similar to `TensorFlow Datasets`, `🤗Datasets` is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. +Similar to TensorFlow Datasets, 🤗 Datasets is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a [GitHub issue](https://github.com/huggingface/datasets/issues/new). Thanks for your contribution to the ML community! diff --git a/datasets/newsroom/README.md b/datasets/newsroom/README.md index 9d80091ed9b..66b89d40cdf 100644 --- a/datasets/newsroom/README.md +++ b/datasets/newsroom/README.md @@ -61,7 +61,7 @@ And additional features: - compression_bin: low, medium, high. This dataset can be downloaded upon requests. Unzip all the contents -"train.jsonl, dev.josnl, test.jsonl" to the tfds folder. +"train.jsonl, dev.josnl, test.jsonl" to the `tfds` folder. ### Supported Tasks and Leaderboards diff --git a/datasets/norne/README.md b/datasets/norne/README.md index b17b90565f3..32eed38cd8c 100644 --- a/datasets/norne/README.md +++ b/datasets/norne/README.md @@ -238,7 +238,7 @@ To access these reduced versions of the dataset, you can use the configs `bokmaa NorNE was created as a collaboration between [Schibsted Media Group](https://schibsted.com/), [Språkbanken](https://www.nb.no/forskning/sprakbanken/) at the [National Library of Norway](https://www.nb.no) and the [Language Technology Group](https://www.mn.uio.no/ifi/english/research/groups/ltg/) at the University of Oslo. -NorNE was added to Huggingface Datasets by the AI-Lab at the National Library of Norway. +NorNE was added to 🤗 Datasets by the AI-Lab at the National Library of Norway. ### Licensing Information diff --git a/docs/source/exploring.rst b/docs/source/exploring.rst index 736fd04aa20..ff3174a59bd 100644 --- a/docs/source/exploring.rst +++ b/docs/source/exploring.rst @@ -190,7 +190,7 @@ Up to now, the rows/batches/columns returned when querying the elements of the d Sometimes we would like to have more sophisticated objects returned by our dataset, for instance NumPy arrays or PyTorch tensors instead of python lists. -🤗Datasets provides a way to do that through what is called a ``format``. +🤗 Datasets provides a way to do that through what is called a ``format``. While the internal storage of the dataset is always the Apache Arrow format, by setting a specific format on a dataset, you can filter some columns and cast the output of :func:`datasets.Dataset.__getitem__` in NumPy/pandas/PyTorch/TensorFlow, on-the-fly. diff --git a/docs/source/index.rst b/docs/source/index.rst index 39d1bc41d05..68d345bcad2 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -5,17 +5,17 @@ Datasets and evaluation metrics for natural language processing Compatible with NumPy, Pandas, PyTorch and TensorFlow -🤗Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP). +🤗 Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP). -🤗Datasets has many interesting features (beside easy sharing and accessing datasets/metrics): +🤗 Datasets has many interesting features (beside easy sharing and accessing datasets/metrics): Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2 Lightweight and fast with a transparent and pythonic API -Strive on large datasets: 🤗Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default. +Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default. Smart caching: never wait for your data to process several times -🤗Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗Datasets viewer. +🤗 Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗 Datasets viewer. -🤗Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗Datasets and tfds can be found in the section Main differences between 🤗Datasets and tfds. +🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section Main differences between 🤗 Datasets and `tfds`. Contents --------------------------------- diff --git a/docs/source/installation.md b/docs/source/installation.md index aec9f0871bd..61574b6121d 100644 --- a/docs/source/installation.md +++ b/docs/source/installation.md @@ -1,21 +1,21 @@ # Installation -🤗Datasets is tested on Python 3.6+. +🤗 Datasets is tested on Python 3.6+. -You should install 🤗Datasets in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're +You should install 🤗 Datasets in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). Create a virtual environment with the version of Python you're going to use and activate it. -Now, if you want to use 🤗Datasets, you can install it with pip. If you'd like to play with the examples, you must install it from source. +Now, if you want to use 🤗 Datasets, you can install it with pip. If you'd like to play with the examples, you must install it from source. ## Installation with pip -🤗Datasets can be installed using pip as follows: +🤗 Datasets can be installed using pip as follows: ```bash pip install datasets ``` -To check 🤗Datasets is properly installed, run the following command: +To check 🤗 Datasets is properly installed, run the following command: ```bash python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])" @@ -27,7 +27,7 @@ It should download version 1 of the [Stanford Question Answering Dataset](https: {'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f41900661182', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'} ``` -If you want to use the 🤗Datasets library with TensorFlow 2.0 or PyTorch, you will need to install these seperately. +If you want to use the 🤗 Datasets library with TensorFlow 2.0 or PyTorch, you will need to install these seperately. Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform. @@ -48,11 +48,11 @@ Again, you can run: python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])" ``` -to check 🤗Datasets is properly installed. +to check 🤗 Datasets is properly installed. ## With conda -🤗Datasets can be installed using conda as follows: +🤗 Datasets can be installed using conda as follows: ```bash conda install -c huggingface -c conda-forge datasets diff --git a/docs/source/loading_datasets.rst b/docs/source/loading_datasets.rst index a9f805bc6b8..4d3543bbe38 100644 --- a/docs/source/loading_datasets.rst +++ b/docs/source/loading_datasets.rst @@ -12,7 +12,7 @@ In this section we study each option. From the HuggingFace Hub ------------------------------------------------- -Over 135 datasets for many NLP tasks like text classification, question answering, language modeling, etc, are provided on the `HuggingFace Hub `__ and can be viewed and explored online with the `🤗Datasets viewer `__. +Over 135 datasets for many NLP tasks like text classification, question answering, language modeling, etc, are provided on the `HuggingFace Hub `__ and can be viewed and explored online with the `🤗 Datasets viewer `__. .. note:: @@ -61,12 +61,12 @@ This call to :func:`datasets.load_dataset` does the following steps under the ho .. note:: - An Apache Arrow Table is the internal storing format for 🤗Datasets. It allows to store arbitrarily long dataframe, + An Apache Arrow Table is the internal storing format for 🤗 Datasets. It allows to store arbitrarily long dataframe, typed with potentially complex nested types that can be mapped to numpy/pandas/python types. Apache Arrow allows you to map blobs of data on-drive without doing any deserialization. So caching the dataset directly on disk can use memory-mapping and pay effectively zero cost with O(1) random access. Alternatively, you can copy it in CPU memory (RAM) by setting the ``keep_in_memory`` argument of :func:`datasets.load_dataset` to ``True``. - The default in 🤗Datasets is to memory-map the dataset on disk unless you set ``datasets.config.IN_MEMORY_MAX_SIZE`` + The default in 🤗 Datasets is to memory-map the dataset on disk unless you set ``datasets.config.IN_MEMORY_MAX_SIZE`` different from ``0`` bytes (default). In that case, the dataset will be copied in-memory if its size is smaller than ``datasets.config.IN_MEMORY_MAX_SIZE`` bytes, and memory-mapped otherwise. This behavior can be enabled by setting either the configuration option ``datasets.config.IN_MEMORY_MAX_SIZE`` (higher precedence) or the environment @@ -187,7 +187,7 @@ Let's see an example of all the various ways you can provide files to :func:`dat CSV files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -🤗Datasets can read a dataset made of on or several CSV files. +🤗 Datasets can read a dataset made of on or several CSV files. All the CSV files in the dataset should have the same organization and in particular the same datatypes for the columns. @@ -224,7 +224,7 @@ If you want more control, the ``csv`` script provide full control on reading, pa JSON files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -🤗Datasets supports building a dataset from JSON files in various format. +🤗 Datasets supports building a dataset from JSON files in various format. The most efficient format is to have JSON files consisting of multiple JSON objects, one per line, representing individual data rows: @@ -268,7 +268,7 @@ In this case you will need to specify which field contains the dataset using the Text files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -🤗Datasets also supports building a dataset from text files read line by line (each line will be a row in the dataset). +🤗 Datasets also supports building a dataset from text files read line by line (each line will be a row in the dataset). This is simply done using the ``text`` loading script which will generate a dataset with a single column called ``text`` containing all the text lines of the input files as strings. @@ -430,7 +430,7 @@ For example, you can run the following if you want to force the re-download of t Integrity verifications ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -When downloading a dataset from the 🤗 dataset hub, the :func:`datasets.load_dataset` function performs by default a number of verifications on the downloaded files. These verifications include: +When downloading a dataset from the 🤗 Datasets Hub, the :func:`datasets.load_dataset` function performs by default a number of verifications on the downloaded files. These verifications include: - Verifying the list of downloaded files - Verifying the number of bytes of the downloaded files @@ -453,7 +453,7 @@ For example, run the following to skip integrity verifications when loading the Loading datasets offline ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Each dataset builder (e.g. "squad") is a python script that is downloaded and cached from either from the 🤗Datasets GitHub repository or from the `HuggingFace Hub `__. +Each dataset builder (e.g. "squad") is a python script that is downloaded and cached from either from the 🤗 Datasets GitHub repository or from the `HuggingFace Hub `__. Only the ``text``, ``csv``, ``json`` and ``pandas`` builders are included in ``datasets`` without requiring external downloads. Therefore if you don't have an internet connection you can't load a dataset that is not packaged with ``datasets``, unless the dataset is already cached. diff --git a/docs/source/loading_metrics.rst b/docs/source/loading_metrics.rst index a98de22dedd..9945eaa1bcf 100644 --- a/docs/source/loading_metrics.rst +++ b/docs/source/loading_metrics.rst @@ -71,7 +71,7 @@ This call to :func:`datasets.load_metric` does the following steps under the hoo .. note:: - The :class:`datasets.Metric` object uses Apache Arrow Tables as the internal storing format for predictions and references. It allows to store predictions and references directly on disk with memory-mapping and thus do lazy computation of the metrics, in particular to easily gather the predictions in a distributed setup. The default in 🤗Datasets is to always memory-map metrics data on drive. + The :class:`datasets.Metric` object uses Apache Arrow Tables as the internal storing format for predictions and references. It allows to store predictions and references directly on disk with memory-mapping and thus do lazy computation of the metrics, in particular to easily gather the predictions in a distributed setup. The default in 🤗 Datasets is to always memory-map metrics data on drive. Using a custom metric script ----------------------------------------------------------- diff --git a/docs/source/package_reference/logging_methods.rst b/docs/source/package_reference/logging_methods.rst index cde8631f958..7ca4cb757b3 100644 --- a/docs/source/package_reference/logging_methods.rst +++ b/docs/source/package_reference/logging_methods.rst @@ -1,7 +1,7 @@ Logging methods ---------------------------------------------------- -🤗Datasets tries to be very transparent and explicit about its inner working, but this can be quite verbose at times. +🤗 Datasets tries to be very transparent and explicit about its inner working, but this can be quite verbose at times. A series of logging methods let you easily adjust the level of verbosity of the whole library. diff --git a/docs/source/processing.rst b/docs/source/processing.rst index efa06eba45d..856b9882ab0 100644 --- a/docs/source/processing.rst +++ b/docs/source/processing.rst @@ -1,7 +1,7 @@ Processing data in a Dataset ============================================================== -🤗Datasets provides many methods to modify a Dataset, be it to reorder, split or shuffle the dataset or to apply data processing functions or evaluation functions to its elements. +🤗 Datasets provides many methods to modify a Dataset, be it to reorder, split or shuffle the dataset or to apply data processing functions or evaluation functions to its elements. We'll start by presenting the methods which change the order or number of elements before presenting methods which access and can change the content of the elements themselves. @@ -22,7 +22,7 @@ As always, let's start by loading a small dataset for our demonstrations: A subsequent call to any of the methods detailed here (like :func:`datasets.Dataset.sort`, :func:`datasets.Dataset.map`, etc) will thus **reuse the cached file instead of recomputing the operation** (even in another python session). - This usually makes it very efficient to process data with 🤗Datasets. + This usually makes it very efficient to process data with 🤗 Datasets. If the disk space is critical, these methods can be called with arguments to avoid this behavior (see the last section), or the cache files can be cleaned using the method :func:`datasets.Dataset.cleanup_cache_files`. @@ -391,7 +391,7 @@ To operate on batch of example, just set :obj:`batched=True` when calling :func: In other words, the mapped function should accept an input with the format of a slice of the dataset: :obj:`function(dataset[:10])`. -Let's take an example with a fast tokenizer of the 🤗Transformers library. +Let's take an example with a fast tokenizer of the 🤗 Transformers library. First install this library if you haven't already done it: @@ -406,9 +406,9 @@ Then we will import a fast tokenizer, for instance the tokenizer of the Bert mod >>> from transformers import BertTokenizerFast >>> tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased') -Now let's batch tokenize the ``sentence1`` fields of our dataset. The tokenizers of the 🤗Transformers library can accept lists of texts as inputs and tokenize them efficiently in batch (for the fast tokenizers in particular). +Now let's batch tokenize the ``sentence1`` fields of our dataset. The tokenizers of the 🤗 Transformers library can accept lists of texts as inputs and tokenize them efficiently in batch (for the fast tokenizers in particular). -For more details on the tokenizers of the 🤗Transformers library please refer to its `guide on processing data `__. +For more details on the tokenizers of the 🤗 Transformers library please refer to its `guide on processing data `__. This tokenizer will output a dictionary-like object with three fields: ``input_ids``, ``token_type_ids``, ``attention_mask`` corresponding to Bert model's required inputs. Each field contains a list (batch) of samples. @@ -496,7 +496,7 @@ As we can see, our dataset is now much longer (10470 row) and contains a single Now let's finish with the other example and try to do some data augmentation. We will use a Roberta model to sample some masked tokens. -Here we can use the `FillMaskPipeline of 🤗Transformers `__ to generate options for a masked token in a sentence. +Here we can use the `FillMaskPipeline of 🤗 Transformers `__ to generate options for a masked token in a sentence. We will randomly select a word to mask in the sentence and return the original sentence plus the two top replacements by Roberta. @@ -564,7 +564,7 @@ You can directly call map, filter, shuffle, and sort directly on a :obj:`dataset 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] } -This concludes our chapter on data processing with 🤗Datasets (and 🤗Transformers). +This concludes our chapter on data processing with 🤗 Datasets (and 🤗 Transformers). Concatenate several datasets ---------------------------- diff --git a/docs/source/quicktour.rst b/docs/source/quicktour.rst index 664e769ac18..d74f5962842 100644 --- a/docs/source/quicktour.rst +++ b/docs/source/quicktour.rst @@ -1,13 +1,13 @@ Quick tour ========== -Let's have a quick look at the 🤗Datasets library. This library has three main features: +Let's have a quick look at the 🤗 Datasets library. This library has three main features: - It provides a very **efficient way to load and process data** from raw files (CSV/JSON/text) or in-memory data (python dict, pandas dataframe) with a special focus on memory efficiency and speed. As a matter of example, loading a 18GB dataset like English Wikipedia allocate 9 MB in RAM and you can iterate over the dataset at 1-2 GBit/s in python. - It provides a very **simple way to access and share datasets** with the research and practitioner communities (over 130 NLP datasets are already accessible in one line with the library as we'll see below). - It was designed with a particular focus on interoperabilty with frameworks like **pandas, NumPy, PyTorch and TensorFlow**. -🤗Datasets provides datasets for many NLP tasks like text classification, question answering, language modeling, etc., and obviously these datasets can always be used for other tasks than their originally assigned task. Let's list all the currently provided datasets using :func:`datasets.list_datasets`: +🤗 Datasets provides datasets for many NLP tasks like text classification, question answering, language modeling, etc., and obviously these datasets can always be used for other tasks than their originally assigned task. Let's list all the currently provided datasets using :func:`datasets.list_datasets`: .. code-block:: @@ -28,7 +28,7 @@ Let's have a quick look at the 🤗Datasets library. This library has three main wikihow, wikipedia, wikisql, wikitext, winogrande, wiqa, wmt14, wmt15, wmt16, wmt17, wmt18, wmt19, wmt_t2t, wnut_17, x_stance, xcopa, xnli, xquad, xsum, xtreme, yelp_polarity -All these datasets can also be browsed on the `HuggingFace Hub `__ and can be viewed and explored online with the `🤗Datasets viewer `__. +All these datasets can also be browsed on the `HuggingFace Hub `__ and can be viewed and explored online with the `🤗 Datasets viewer `__. Loading a dataset -------------------- @@ -107,7 +107,7 @@ We can print one example of each class using :func:`datasets.Dataset.filter` and Now our goal will be to train a model which can predict the correct label (``not_equivalent`` or ``equivalent``) from a pair of sentences. -Let's import a pretrained Bert model and its tokenizer using 🤗Transformers. +Let's import a pretrained Bert model and its tokenizer using 🤗 Transformers. .. code-block:: @@ -130,8 +130,8 @@ Let's import a pretrained Bert model and its tokenizer using 🤗Transformers. You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') -🤗Transformers warns us that we should probably train this model on a downstream task before using it which is exactly what we are going to do. -If you want more details on the models and tokenizers of 🤗Transformers, you should refer to the documentation and tutorials of this library `which are available here `__. +🤗 Transformers warns us that we should probably train this model on a downstream task before using it which is exactly what we are going to do. +If you want more details on the models and tokenizers of 🤗 Transformers, you should refer to the documentation and tutorials of this library `which are available here `__. Tokenizing the dataset ^^^^^^^^^^^^^^^^^^^^^^ @@ -148,7 +148,7 @@ The first step is to tokenize our sentences in order to build sequences of integ >>> tokenizer.decode(tokenizer(dataset[0]['sentence1'], dataset[0]['sentence2'])['input_ids']) '[CLS] Amrozi accused his brother, whom he called " the witness ", of deliberately distorting his evidence. [SEP] Referring to him as only " the witness ", Amrozi accused his brother of deliberately distorting his evidence. [SEP]' -As you can see, the tokenizer has merged the pair of sequences in a single input separating them by some special tokens ``[CLS]`` and ``[SEP]`` expected by Bert. For more details on this, you can refer to `🤗Transformers's documentation on data processing `__. +As you can see, the tokenizer has merged the pair of sequences in a single input separating them by some special tokens ``[CLS]`` and ``[SEP]`` expected by Bert. For more details on this, you can refer to `🤗 Transformers's documentation on data processing `__. In our case, we want to tokenize our full dataset, so we will use a method called :func:`datasets.Dataset.map` to apply the encoding process to the whole dataset. To be sure we can easily build tensor batches for our model, we will truncate and pad the inputs to the max length of our model. @@ -188,7 +188,7 @@ To be able to train our model with this dataset and PyTorch, we will need to do .. note:: - We don't want the columns `sentence1` or `sentence2` as inputs to train our model, but we could still want to keep them in the dataset, for instance for the evaluation of the model. 🤗Datasets let you control the output format of :func:`datasets.Dataset.__getitem__` to just mask them as detailed in :doc:`exploring <./exploring>`. + We don't want the columns `sentence1` or `sentence2` as inputs to train our model, but we could still want to keep them in the dataset, for instance for the evaluation of the model. 🤗 Datasets let you control the output format of :func:`datasets.Dataset.__getitem__` to just mask them as detailed in :doc:`exploring <./exploring>`. The first modification is just a matter of renaming the column as follows (we could have done it during the tokenization process as well): diff --git a/docs/source/share_dataset.rst b/docs/source/share_dataset.rst index 65afdeea971..c03f30c19d9 100644 --- a/docs/source/share_dataset.rst +++ b/docs/source/share_dataset.rst @@ -3,7 +3,7 @@ Sharing your dataset Once you've written a new dataset loading script as detailed on the :doc:`add_dataset` page, you may want to share it with the community for instance on the `HuggingFace Hub `__. There are two options to do that: -- add it as a canonical dataset by opening a pull-request on the `GitHub repository for 🤗Datasets `__, +- add it as a canonical dataset by opening a pull-request on the `GitHub repository for 🤗 Datasets `__, - directly upload it on the Hub as a community provided dataset. Here are the main differences between these two options. @@ -31,7 +31,7 @@ Sharing a "canonical" dataset To add a "canonical" dataset to the library, you need to go through the following steps: -**1. Fork the** `🤗Datasets repository `__ by clicking on the 'Fork' button on the repository's home page. This creates a copy of the code under your GitHub user account. +**1. Fork the** `🤗 Datasets repository `__ by clicking on the 'Fork' button on the repository's home page. This creates a copy of the code under your GitHub user account. **2. Clone your fork** to your local disk, and add the base repository as a remote: @@ -60,7 +60,7 @@ To add a "canonical" dataset to the library, you need to go through the followin .. note:: - If 🤗Datasets was already installed in the virtual environment, remove + If 🤗 Datasets was already installed in the virtual environment, remove it with ``pip uninstall datasets`` before reinstalling it in editable mode with the ``-e`` flag. @@ -103,7 +103,7 @@ Push the changes to your account using: Sharing a "community provided" dataset ----------------------------------------- -In this page, we will show you how to share a dataset with the community on the `datasets hub `__. +In this page, we will show you how to share a dataset with the community on the `🤗 Datasets Hub `__. .. note:: @@ -115,12 +115,12 @@ Prepare your dataset for uploading ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We have seen in the :doc:`dataset script tutorial `: how to write a dataset loading script. Let's see how you can share it on the -`datasets hub `__. +`🤗 Datasets Hub `__. Dataset versioning ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Since version 2.0, the datasets hub has built-in dataset versioning based on git and git-lfs. It is based on the paradigm +Since version 2.0, the 🤗 Datasets Hub has built-in dataset versioning based on git and git-lfs. It is based on the paradigm that one dataset *is* one repo. This allows: @@ -144,7 +144,7 @@ For instance: Basic steps ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -In order to upload a dataset, you'll need to first create a git repo. This repo will live on the datasets hub, allowing +In order to upload a dataset, you'll need to first create a git repo. This repo will live on the 🤗 Datasets Hub, allowing users to clone it and you (and your organization members) to push to it. You can create a dataset repo directly from `the /new-dataset page on the website `__. @@ -159,7 +159,7 @@ Datasets, since that command :obj:`huggingface-cli` comes from the library. huggingface-cli login -Once you are logged in with your datasets hub credentials, you can start building your repositories. To create a repo: +Once you are logged in with your 🤗 Datasets Hub credentials, you can start building your repositories. To create a repo: .. code-block:: bash @@ -173,7 +173,7 @@ If you want to create a repo under a specific organization, you should add a `-- huggingface-cli repo create your_dataset_name --type dataset --organization your-org-name -This creates a repo on the datasets hub, which can be cloned. +This creates a repo on the 🤗 Datasets Hub, which can be cloned. .. code-block:: bash @@ -204,7 +204,7 @@ Additionally, if you want to change multiple repos at once, the `change_config.p `__ can probably save you some time. -Check the directory before pushing to the datasets hub. +Check the directory before pushing to the 🤗 Datasets Hub. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Make sure there are no garbage files in the directory you'll upload. It should only have: diff --git a/docs/source/torch_tensorflow.rst b/docs/source/torch_tensorflow.rst index abb24900646..9154b23e417 100644 --- a/docs/source/torch_tensorflow.rst +++ b/docs/source/torch_tensorflow.rst @@ -3,7 +3,7 @@ Using a Dataset with PyTorch/Tensorflow Once your dataset is processed, you often want to use it with a framework such as PyTorch, Tensorflow, Numpy or Pandas. For instance we may want to use our dataset in a ``torch.Dataloader`` or a ``tf.data.Dataset`` and train a model with it. -🤗Datasets provides a simple way to do this through what is called the format of a dataset. +🤗 Datasets provides a simple way to do this through what is called the format of a dataset. The format of a :class:`datasets.Dataset` instance defines which columns of the dataset are returned by the :func:`datasets.Dataset.__getitem__` method and cast them in PyTorch, Tensorflow, Numpy or Pandas types. diff --git a/notebooks/Overview.ipynb b/notebooks/Overview.ipynb index fc140bcc30a..6a7de539e4d 100644 --- a/notebooks/Overview.ipynb +++ b/notebooks/Overview.ipynb @@ -2503,11 +2503,11 @@ "id": "zNp6kK7OvSUg" }, "source": [ - "# HuggingFace `🤗Datasets` library - Quick overview\n", + "# HuggingFace 🤗 Datasets library - Quick overview\n", "\n", "Models come and go (linear models, LSTM, Transformers, ...) but two core elements have consistently been the beating heart of Natural Language Processing: Datasets & Metrics\n", "\n", - "`🤗Datasets` is a fast and efficient library to easily share and load dataset and evaluation metrics, already providing access to 150+ datasets and 12+ evaluation metrics.\n", + "🤗 Datasets is a fast and efficient library to easily share and load dataset and evaluation metrics, already providing access to 150+ datasets and 12+ evaluation metrics.\n", "\n", "The library has several interesting features (beside easy access to datasets/metrics):\n", "\n", @@ -2516,7 +2516,7 @@ "- Strive on large datasets: frees you from RAM memory limits, all datasets are memory-mapped on drive by default.\n", "- Smart caching with an intelligent `tf.data`-like cache: never wait for your data to process several times\n", "\n", - "`🤗Datasets` originated from a fork of the awesome Tensorflow-Datasets and the HuggingFace team want to deeply thank the team behind this amazing library and user API. We have tried to keep a layer of compatibility with `tfds` and a conversion can provide conversion from one format to the other." + "🤗 Datasets originated from a fork of the awesome Tensorflow-Datasets and the HuggingFace team want to deeply thank the team behind this amazing library and user API. We have tried to keep a layer of compatibility with `tfds` and a conversion can provide conversion from one format to the other." ] }, { @@ -3210,8 +3210,8 @@ "outputId": "5eb1eb3b-0f77-4935-c5f7-5f34747d0af7" }, "source": [ - "print(f\"👉Dataset len(dataset): {len(dataset)}\")\n", - "print(\"\\n👉First item 'dataset[0]':\")\n", + "print(f\"👉 Dataset len(dataset): {len(dataset)}\")\n", + "print(\"\\n👉 First item 'dataset[0]':\")\n", "pprint(dataset[0])" ], "execution_count": null, @@ -3219,9 +3219,9 @@ { "output_type": "stream", "text": [ - "👉Dataset len(dataset): 1057\n", + "👉 Dataset len(dataset): 1057\n", "\n", - "👉First item 'dataset[0]':\n", + "👉 First item 'dataset[0]':\n", "{'answers': {'answer_start': [177, 177, 177],\n", " 'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},\n", " 'context': 'Super Bowl 50 was an American football game to determine the '\n", @@ -3587,7 +3587,7 @@ "id": "Z4Fjr0DJawuS" }, "source": [ - "The above examples was a bit verbose. We can control the logging level of `🤗Datasets` with it's logging module:\n" + "The above examples was a bit verbose. We can control the logging level of 🤗 Datasets with it's logging module:\n" ] }, {