Parquet in Action: A Beginners Guide
Showcasing the Parquet Format Efficiently via an Exercise
Today, we are going to learn about how Parquet files work through an interactive exercise. Our goal is to retrieve the schema of a large Parquet file while downloading as little data as possible.
We are going to analyze the file remotely from Hugging Face without downloading anything locally.
Let's take a look at the fineweb-edu dataset, which has 1.3T tokens of educational web pages used to train large language models.
This is a huge dataset! We can also see that there are lots of parquet files in this dataset partitioned by date. We are going to use the CC-MAIN-2013-20/train-00000-of-00014.parquet
file which can be found here.
This file is 2.37 GB, but we want to extract the metadata and schema without having to download the entire file.
Overview of Parquet
Let's take a brief look at the general format of a Parquet file.
- Row Groups: Horizontal partitions of the data that group rows together. They allow for efficient querying and parallel processing of large datasets.
- Column Chunks: Vertical slices of data within each row group, containing values from a specific column. This columnar storage enables efficient compression and query performance.
Image Credit: Clickhouse
- Schema: Metadata that describes the layout and types of the columns in the Parquet file.
- Magic Bytes: A sequence of bytes at the beginning and end of a Parquet file that identifies it as a Parquet format.
PAR1
indicating Parquet.
Pay close attention to the last bit of the file which is the footer metadata. That is the most important part for us to retrieve the schema.
There are three components of the footer.
- The file metadata (n bytes before the footer metadata size message)
- The footer size (4 bytes before the footer magic bytes)
- Magic bytres (4 bytes at the end of the file (
PAR1
))
Magic bytes are really neat. Essentially, they are a standard to identify a file type very quickly. You can think of them as a signature for the file. Here's a really good list here of the different magic bytes and different file types.
Getting File Size Remotely with a HEAD Request
Let's start by sending a HEAD request to the URL. This should give us some metadata about the file without downloading the entire file.
import requests
url = "https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/resolve/main/data/CC-MAIN-2013-20/train-00000-of-00014.parquet?download=true"
# Get file content length with HEAD request
head_response = requests.head(url, allow_redirects=True)
file_size = int(head_response.headers['Content-Length'])
The HEAD request will only return the response headers with no actual content. You should see the response headers below:
Header | Value | Description |
---|---|---|
Content-Type | binary/octet-stream | Indicates the file is a binary file |
Content-Length | 2369456837 |
Size of the file in bytes (2.37 GB) |
Accept-Ranges | bytes | File supports partial content requests; specific byte ranges can be requested |
What does this show?
That we can use Range
header in our HTTP request to read certain bytes ranges. This gives us the power to query only the byte range of the footer.
HTTP Range Requests
Querying a Specific Range of Bytes of a File
Here's what a Range header looks like:
Range: bytes=0-100
This would read the first 100 bytes of the file.
Extracting the Footer Size
Now, let's use this Range header to read the footer size which will give us the correct byte range to read for the footer metadata.
To follow along, you will need to install the requests
and `pyarrow packages.
head_response = requests.head(url, allow_redirects=True)
file_size = int(head_response.headers['Content-Length'])
print(f"File size: {file_size} bytes")
Here, we retrieve the 4 bytes in front of the magic bytes which will give us the full footer size (m).
Reading Entire Footer
Now that we know all the variables:
- File Length
- Footer Length
We can use one last Range request to read the schema and metadata. We will use pyarrow
to read the schema and metadata from the raw bytes.
footer_start = file_size - 8 - footer_size
footer_headers = {"Range": f"bytes={footer_start}-{file_size-1}"}
footer_response = requests.get(url, headers=footer_headers)
# use pyarrow to extract metadata from bytes buffer
footer_buffer = io.BytesIO(footer_response.content)
parquet_file = pq.ParquetFile(footer_buffer)
parquet_schema = parquet_file.schema
parquet_metadata = parquet_file.metadata
print(parquet_file.schema)
print (parquet_file.metadata)
This will output:
<pyarrow._parquet.ParquetSchema object at 0x107f87c80>
required group field_id=-1 schema {
optional binary field_id=-1 text (String);
optional binary field_id=-1 id (String);
optional binary field_id=-1 dump (String);
optional binary field_id=-1 url (String);
optional binary field_id=-1 file_path (String);
optional binary field_id=-1 language (String);
optional double field_id=-1 language_score;
optional int64 field_id=-1 token_count;
optional double field_id=-1 score;
optional int64 field_id=-1 int_score;
}
<pyarrow._parquet.FileMetaData object at 0x107f79210>
created_by: parquet-cpp-arrow version 15.0.0
num_columns: 10
num_rows: 785906
num_row_groups: 786
format_version: 2.6
serialized_size: 3255315
Now that you understand how you can extract and get information from the Parquet remotely you can see how tools like DuckDB can query parquet files very efficiently.
Hugging Face even has a built in Parquet metadata viewer, powered by hyparquet that uses a very similar approach to above.
Fun Fact: You can read datasets from the Hugging Face Hub and scan parquet files with a couple different libraries: