ClickHouse
ClickHouse is a fast and efficient column-oriented database for analytical workloads, making it easy to analyze Hub-hosted datasets with SQL. To get started quickly, use clickhouse-local
to run SQL queries from the command line and avoid the need to fully install ClickHouse.
Check this blog for more details about how to analyze datasets on the Hub with ClickHouse.
To start, download and install clickhouse-local
:
curl https://clickhouse.com/ | sh
For this example, youβll analyze the maharshipandya/spotify-tracks-dataset which contains information about Spotify tracks. Datasets on the Hub are stored as Parquet files and you can access it with the /parquet
endpoint:
import requests
r = requests.get("/static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Fparquet%3Fdataset%3Dmaharshipandya%2Fspotify-tracks-dataset%26quot%3B%3C%2Fspan%3E)
j = r.json()
url = [f['url'] for f in j['parquet_files']]
url
['https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet']
Aggregate functions
Now you can begin to analyze the dataset. Use the -q
argument to specify the query to execute, and the url
function to create a table from the data in the Parquet file.
You should set enable_url_encoding
to 0 to ensure the escape characters in the URL are preserved as intended, and max_https_get_redirects
to 1 to redirect to the path of the Parquet file.
Letβs start by identifying the most popular artists:
./clickhouse local -q "
SELECT count() AS c, artists
FROM url('https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet')
GROUP BY artists
ORDER BY c
DESC LIMIT 5
SETTINGS enable_url_encoding=0, max_http_get_redirects=1"
ββββcββ¬βartistsββββββββββ
β 279 β The Beatles β
β 271 β George Jones β
β 236 β Stevie Wonder β
β 224 β Linkin Park β
β 222 β Ella Fitzgerald β
βββββββ΄ββββββββββββββββββ
ClickHouse also provides functions for visualizing your queries. For example, you can use the bar
function to create a bar chart of the danceability of songs:
./clickhouse local -q "
SELECT
round(danceability, 1) AS danceability,
bar(count(), 0, max(count()) OVER ()) AS dist
FROM url('https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet')
GROUP BY danceability
ORDER BY danceability ASC
SETTINGS enable_url_encoding=0, max_http_get_redirects=1"
ββdanceabilityββ¬βdistββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 0 β β β
β 0.1 β βββββ β
β 0.2 β ββββββββββββββ β
β 0.3 β ββββββββββββββββββββββββ β
β 0.4 β βββββββββββββββββββββββββββββββββββββββββββββ β
β 0.5 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β 0.6 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β 0.7 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β 0.8 β ββββββββββββββββββββββββββββββββββββββββββ β
β 0.9 β βββββββββββ β
β 1 β β β
ββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
To get a deeper understanding about a dataset, ClickHouse provides statistical analysis functions for determining how your data is correlated, calculating statistical hypothesis tests, and more. Take a look at ClickHouseβs List of Aggregate Functions for a complete list of available aggregate functions.
User-defined function (UDFs)
A user-defined function (UDF) allows you to reuse custom logic. Many Hub datasets are often sharded into more than one Parquet file, so it can be easier and more efficient to create a UDF to list and query all the Parquet files of a given dataset from just the dataset name.
For this example, youβll need to run clickhouse-local
in console mode so the UDF persists between queries:
./clickhouse local
Remember to set enable_url_encoding
to 0 and max_https_get_redirects
to 1 to redirect to the path of the Parquet files:
SET max_http_get_redirects = 1, enable_url_encoding = 0
Letβs create a function to return a list of Parquet files from the tasksource/blog_authorship_corpus
:
CREATE OR REPLACE FUNCTION hugging_paths AS dataset -> (
SELECT arrayMap(x -> (x.1), JSONExtract(json, 'parquet_files', 'Array(Tuple(url String))'))
FROM url('/static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Fparquet%3Fdataset%3D%26%23x27%3B%3C%2Fspan%3E || dataset, 'JSONAsString')
);
SELECT hugging_paths('tasksource/blog_authorship_corpus') AS paths
['https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet','https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0001.parquet']
You can make this even easier by creating another function that calls hugging_paths
and outputs all the files based on the dataset name:
CREATE OR REPLACE FUNCTION hf AS dataset -> (
WITH hugging_paths(dataset) as urls
SELECT multiIf(length(urls) = 0, '', length(urls) = 1, urls[1], 'https://huggingface.co/datasets/{' || arrayStringConcat(arrayMap(x -> replaceRegexpOne(replaceOne(x, 'https://huggingface.co/datasets/', ''), '\\.parquet$', ''), urls), ',') || '}.parquet')
);
SELECT hf('tasksource/blog_authorship_corpus') AS pattern
https://huggingface.co/datasets/{tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000,tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0001}.parquet
Now use the hf
function to query any dataset by passing the dataset name:
SELECT sign, count(*), AVG(LENGTH(text)) AS avg_blog_length
FROM url(hf('tasksource/blog_authorship_corpus'))
GROUP BY sign
ORDER BY avg_blog_length
DESC LIMIT(5)
βββββββββββββ¬βββββββββ¬βββββββββββββββββββββ
β sign β count β avg_blog_length β
βββββββββββββΌβββββββββΌβββββββββββββββββββββ€
β Aquarius β 49687 β 1193.9523819107615 β
β Leo β 53811 β 1186.0665291483153 β
β Cancer β 65048 β 1160.8010392325666 β
β Gemini β 51985 β 1158.4132922958545 β
β Vurgi β 60399 β 1142.9977648636566 β
βββββββββββββ΄βββββββββ΄βββββββββββββββββββββ