1  HF Datasets

Hugging Face Datasets is a python library that allows for easy access to the Hugging Face Hub.

1.1 Installing

pip install datasets huggingface_hub[hf_xet]

1.3 Downloading a dataset

A Dataset object isn’t really a dataset. It’s a split of a dataset or a split of a subset of a dataset (Figure 1.1).

flowchart
    Dataset --has many--> Split
    Dataset --has many--> Subset
    Subset --has many--> Split
    Split == equivalent === d[Dataset object]
Figure 1.1
from datasets import load_dataset
ds = load_dataset("openslr/librispeech_asr", split="validation.clean", streaming=True)

This is shorthand for

from datasets import load_dataset_builder
b = load_dataset_builder("openslr/librispeech_asr").as_streaming_dataset(split="validation.clean")