Dataset viewer documentation

DuckDB

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

DuckDB

DuckDB is a database that supports reading and querying Parquet files really fast. Begin by creating a connection to DuckDB, and then install and load the httpfs extension to read and write remote files:

Python
JavaScript
import duckdb

url = "https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet"

con = duckdb.connect()
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")

Now you can write and execute your SQL query on the Parquet file:

Python
JavaScript
con.sql(f"SELECT sign, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM '{url}' GROUP BY sign ORDER BY avg_blog_length DESC LIMIT(5)")
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   sign    β”‚ count_star() β”‚  avg_blog_length   β”‚
β”‚  varchar  β”‚    int64     β”‚       double       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Cancer    β”‚        38956 β”‚ 1206.5212034089743 β”‚
β”‚ Leo       β”‚        35487 β”‚ 1180.0673767858652 β”‚
β”‚ Aquarius  β”‚        32723 β”‚ 1152.1136815084192 β”‚
β”‚ Virgo     β”‚        36189 β”‚ 1117.1982094006466 β”‚
β”‚ Capricorn β”‚        31825 β”‚  1102.397360565593 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

To query multiple files - for example, if the dataset is sharded:

Python
JavaScript
urls = ["https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet", "https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0001.parquet"]

con.sql(f"SELECT sign, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM read_parquet({urls}) GROUP BY sign ORDER BY avg_blog_length DESC LIMIT(5)")
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   sign   β”‚ count_star() β”‚  avg_blog_length   β”‚
β”‚ varchar  β”‚    int64     β”‚       double       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Aquarius β”‚        49687 β”‚  1191.417211745527 β”‚
β”‚ Leo      β”‚        53811 β”‚ 1183.8782219248853 β”‚
β”‚ Cancer   β”‚        65048 β”‚ 1158.9691612347804 β”‚
β”‚ Gemini   β”‚        51985 β”‚ 1156.0693084543618 β”‚
β”‚ Virgo    β”‚        60399 β”‚ 1140.9584430205798 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

DuckDB-Wasm, a package powered by WebAssembly, is also available for running DuckDB in any browser. This could be useful, for instance, if you want to create a web app to query Parquet files from the browser!

< > Update on GitHub