Parquet in Action: A Beginners Guide

Community Article Published August 14, 2024

Showcasing the Parquet Format Efficiently via an Exercise

Today, we are going to learn about how Parquet files work through an interactive exercise. Our goal is to retrieve the schema of a large Parquet file while downloading as little data as possible.

We are going to analyze the file remotely from Hugging Face without downloading anything locally.

Let's take a look at the fineweb-edu dataset, which has 1.3T tokens of educational web pages used to train large language models.

image/png

This is a huge dataset! We can also see that there are lots of parquet files in this dataset partitioned by date. We are going to use the CC-MAIN-2013-20/train-00000-of-00014.parquet file which can be found here.

This file is 2.37 GB, but we want to extract the metadata and schema without having to download the entire file.

Overview of Parquet

Let's take a brief look at the general format of a Parquet file.

  • Row Groups: Horizontal partitions of the data that group rows together. They allow for efficient querying and parallel processing of large datasets.
  • Column Chunks: Vertical slices of data within each row group, containing values from a specific column. This columnar storage enables efficient compression and query performance.

image/png Image Credit: Clickhouse

  • Schema: Metadata that describes the layout and types of the columns in the Parquet file.
  • Magic Bytes: A sequence of bytes at the beginning and end of a Parquet file that identifies it as a Parquet format. PAR1 indicating Parquet.

image/png

Pay close attention to the last bit of the file which is the footer metadata. That is the most important part for us to retrieve the schema.

There are three components of the footer.

  • The file metadata (n bytes before the footer metadata size message)
  • The footer size (4 bytes before the footer magic bytes)
  • Magic bytres (4 bytes at the end of the file (PAR1))

Magic bytes are really neat. Essentially, they are a standard to identify a file type very quickly. You can think of them as a signature for the file. Here's a really good list here of the different magic bytes and different file types.

Getting File Size Remotely with a HEAD Request

Let's start by sending a HEAD request to the URL. This should give us some metadata about the file without downloading the entire file.

import requests

url = "https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/resolve/main/data/CC-MAIN-2013-20/train-00000-of-00014.parquet?download=true"

# Get file content length with HEAD request
head_response = requests.head(url, allow_redirects=True)
file_size = int(head_response.headers['Content-Length'])

The HEAD request will only return the response headers with no actual content. You should see the response headers below:

Header Value Description
Content-Type binary/octet-stream Indicates the file is a binary file
Content-Length 2369456837 Size of the file in bytes (2.37 GB)
Accept-Ranges bytes File supports partial content requests; specific byte ranges can be requested

What does this show?

That we can use Range header in our HTTP request to read certain bytes ranges. This gives us the power to query only the byte range of the footer.

HTTP Range Requests

Querying a Specific Range of Bytes of a File

Here's what a Range header looks like:

Range: bytes=0-100

This would read the first 100 bytes of the file.

Extracting the Footer Size

Now, let's use this Range header to read the footer size which will give us the correct byte range to read for the footer metadata.

To follow along, you will need to install the requests and `pyarrow packages.

head_response = requests.head(url, allow_redirects=True)
file_size = int(head_response.headers['Content-Length'])
print(f"File size: {file_size} bytes")

Here, we retrieve the 4 bytes in front of the magic bytes which will give us the full footer size (m).

image/png

Reading Entire Footer

Now that we know all the variables:

  • File Length
  • Footer Length

We can use one last Range request to read the schema and metadata. We will use pyarrow to read the schema and metadata from the raw bytes.

footer_start = file_size - 8 - footer_size
footer_headers = {"Range": f"bytes={footer_start}-{file_size-1}"}
footer_response = requests.get(url, headers=footer_headers)

# use pyarrow to extract metadata from bytes buffer
footer_buffer = io.BytesIO(footer_response.content)

parquet_file = pq.ParquetFile(footer_buffer)
parquet_schema = parquet_file.schema
parquet_metadata = parquet_file.metadata

print(parquet_file.schema)
print (parquet_file.metadata)

This will output:

<pyarrow._parquet.ParquetSchema object at 0x107f87c80>
required group field_id=-1 schema {
  optional binary field_id=-1 text (String);
  optional binary field_id=-1 id (String);
  optional binary field_id=-1 dump (String);
  optional binary field_id=-1 url (String);
  optional binary field_id=-1 file_path (String);
  optional binary field_id=-1 language (String);
  optional double field_id=-1 language_score;
  optional int64 field_id=-1 token_count;
  optional double field_id=-1 score;
  optional int64 field_id=-1 int_score;
}

<pyarrow._parquet.FileMetaData object at 0x107f79210>
  created_by: parquet-cpp-arrow version 15.0.0
  num_columns: 10
  num_rows: 785906
  num_row_groups: 786
  format_version: 2.6
  serialized_size: 3255315

Now that you understand how you can extract and get information from the Parquet remotely you can see how tools like DuckDB can query parquet files very efficiently.

Hugging Face even has a built in Parquet metadata viewer, powered by hyparquet that uses a very similar approach to above.

image/png

Fun Fact: You can read datasets from the Hugging Face Hub and scan parquet files with a couple different libraries: