List splits and subsets
Datasets typically have splits and may also have subsets. A split is a subset of the dataset, like train
and test
, that are used during different stages of training and evaluating a model. A subset (also called configuration) is a sub-dataset contained within a larger dataset. Subsets are especially common in multilingual speech datasets where there may be a different subset for each language. If you’re interested in learning more about splits and subsets, check out the conceptual guide on “Splits and subsets”!
This guide shows you how to use the dataset viewer’s /splits
endpoint to retrieve a dataset’s splits and subsets programmatically. Feel free to also try it out with Postman, RapidAPI, or ReDoc
The /splits
endpoint accepts the dataset name as its query parameter:
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "/static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Fsplits%3Fdataset%3Dibm%2Fduorc%26quot%3B%3C%2Fspan%3E
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()
The endpoint response is a JSON containing a list of the dataset’s splits and subsets. For example, the ibm/duorc dataset has six splits and two subsets:
{
"splits": [
{ "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "train" },
{ "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "validation" },
{ "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "test" },
{ "dataset": "ibm/duorc", "config": "SelfRC", "split": "train" },
{ "dataset": "ibm/duorc", "config": "SelfRC", "split": "validation" },
{ "dataset": "ibm/duorc", "config": "SelfRC", "split": "test" }
],
"pending": [],
"failed": []
}