Demystifying the Parquet File Format

The default file format for any data science workflow

Michael Berk
Towards Data Science

--

Have you ever used pd.read_csv() in pandas? Well, that command could have run ~50x faster if you had used parquet instead of CSV.

Photo by Mike Benna on Unsplash

In this post we will discuss apache parquet, an extremely efficient and well-supported file format. The post is geared towards data practitioners (ML, DE, DS) so we’ll be focusing on high-level concepts and using SQL to talk through core concepts, but links for further resources can be found throughout the post and in the comments.

Without further ado, let’s dive in!

Technical TLDR

Apache parquet is an open-source file format that provides efficient storage and fast read speed. It uses a hybrid storage format which sequentially stores chunks of columns, lending to high performance when selecting and filtering data. On top of strong compression algorithm support (snappy, gzip, LZO), it also provides some clever tricks for reducing file scans and encoding repeat variables.

If you care about speed, you should consider parquet.

But, what’s actually going on?

Ok, let’s slow down a bit and discuss parquet in plain english.

1 — The Data Storage Problem

Let’s say we are data engineers. We are looking to create a data architecture that facilitates Online Analytical Processes (OLAP), which are just selective queries geared towards data analysis. Some examples of data functions that are optimized in an OLAP environment are exploratory data analysis or decision science.

But how should we store our data on disk?

deltalake datalake datawarehouse parquet csv fast efficient performance read write file compression type data science data engineering
Figure 1: example of converting a 2-d table to binary. Image by author.

Well, there are a many considerations when thinking about whether our conversion is good or bad, but for OLAP workflows, we’re mainly concerned with two…

  • Read speed: how fast we can access and decode relevant information from binary files
  • Size on disk: how much space our file needs when in binary format

Note that there are other measures of success for a file compression algorithm, such as write speed and metadata support, but we’ll just focus on the above two for now.

So, how does parquet perform relative to a CSV file? It takes 87% less space and queries 34x faster (1 TB of data, S3 storage)— src

2 — Parquet Core Features

But how is parquet so much more effective than CSV and other popular file formats? The first answer is the storage layout…

2.1 — Hybrid Storage Layout

When we convert a 2-dimensional table to a sequence of 0’s and 1’s, we must think carefully about the optimal structure. Should the first column be written, then the second, then the third? Or should we sequentially store rows?

Traditionally there are three main layouts that convert our 2 dimensional table down to 1:

  1. Row-based: sequentially store rows (CSV).
  2. Column-based: sequentially store columns (ORC).
  3. Hybrid-base: sequentially store chunks of columns (Parquet).

We can see a graphical representation of each of these formats in figure 2.

deltalake datalake datawarehouse parquet csv fast efficient performance read write file compression type data science data engineering
Figure 2: the 3 main storage types. Image by author.

Now, hybrid layouts are really effective for OLAP workflows because they support both projection and predicates.

Projection is the process of selecting columns — you can think of it as the SELECT statement in a SQL query. Projection is best supported by a column-based layout. For instance, if we wanted to read the first column of a table using a column-based layout, we could just read the first n indices in our binary file, deserialize them, and present them to the user. Pretty efficient, right?

Predicates is the criteria used to select rows —you can think of it as the WHERE clause in a SQL query. Predicates are best supported by row-based storage. If we wanted all rows according to some criteria, such asInt >= 2, we can just order our table by Int (descending), scan until our criteria is not satisfied, and return all rows above that invalid row.

In both of these scenarios, we are looking to traverse as little of the file as possible. And, since data science often requires subsetting both rows and columns, a hybrid-based storage layout gives us a middle ground between columnar and row-based file formats.

Before moving on, it’s important to note that parquet is often described as a columnar format. However, due to the fact that it stores chunks of columns, as shown in the bottom of figure 2, a hybrid storage layout is a more precise description.

Great! So if we’re looking to pull data we can just store sequential chunks of columns and usually get very good performance. But does this method scale?

2.2 — Parquet Metadata

The answer is a resounding “yes.” Parquet leverages metadata to skip parts of our data that can be excluded according to our predicate.

deltalake datalake datawarehouse parquet csv fast efficient performance read write file compression type data science data engineering
Figure 3: hybrid-based storage layout conversion from 2 dimensions to 1. Image by author.

Taking the example table in figure 3, we can see that our row group size is 2, meaning we store 2 rows of a given column, 2 rows of the next column, 2 rows of the third column, and so on.

After we run out of columns, we move to the next set of rows. Note that we only have 3 rows in the above table, so the final row group would only have 1 row.

Now let’s say we are actually storing 100,000 values in our row groups instead of 2. If we look to find all rows where our Int column has a given value (i.e. an equality predicate), worst case scenario would involve scanning every row in the table.

deltalake datalake datawarehouse parquet csv fast efficient performance read write file compression type data science data engineering
Figure 4: example of how parquet handles predicates for row groups. Image by author.

Parquet intelligently solves this by storing max and min values for each row group, allowing us to skip entire row groups, as shown in figure 4. But that’s not all! Since parquet often writes many .parquet files to a single directory, we can look at the column metadata for an entire file in and determine whether it should be scanned.

By including some extra data, we are able to skip chunks of our data and dramatically increase query speeds. For more, see here.

2.3 — Parquet File Structure

Ok, so we’ve hinted at how data are converted from a 2-d format to a 1-d format, but how is the entire file system structured?

Well, as mentioned above, parquet can write many .parquet files for a single write. With small datasets, this is a problem and you should probably repartition your data before the write. However, for larger datasets, subsetting data into to multiple files can dramatically improve performance.

Overall, parquet follows the below structure. Let’s take a look at each one in turn…

Root > Parquet Files > Row Groups > Columns > Data Page

First, our file root, is just a directory that holds everything. Inside the root we have many individual .parquet files, each containing a partition of our data. A single parquet file is composed of many row groups and a single row group contains many columns. Finally, within our columns are data pages, which actually hold the raw data and some relevant metadata.

We can see a simplified representation of a single .parquet file in figure 4 below.

deltalake datalake datawarehouse parquet csv fast efficient performance read write file compression type data science data engineering
Figure 4: layers of the hierarchical structure of parquet in sequential order. Image by author.

If you’re curious about the details, here’s the documentation. Repetition and definition levels are also essential in fully understanding how the data page works, but those are just some bonuses.

3 — Additional Optimizations

Since its inception in 2013, parquet has gotten a lot smarter. The underlying structure has been primarily unchanged from the above format, but lots of cool features that improve performance for certain types of data have been added.

Let’s look at some examples…

3.1 — My data has with lots of duplicate values!

Solution: Run-Length Encoding (RLE)

Let’s say we have a column with 10,000,000 values, but all the values are 0. To store this information, we just need 2 numbers: 0 and 10,000,000 —the value and the number of times it repeated.

RLE does just that. When many sequential duplicates are found, parquet can encode that information as a tuple corresponding to the value and the count. In our example, that would save us from storing 9,999,998 numbers.

3.2 — My data has very big types!

Solution: Dictionary Encoding with Bit-Packing

Let’s say we have a column that contains country names, some of which are very long. If we wanted to store “The Democratic Republic of Congo,” we would need a string column that can handle at least 32 characters.

Dictionary encoding replaces each value in our column with a small integer and stores the mapping in our data page’s metadata. When on disk, our encoded values are bit-packed to take up the least amount of space possible, but when we read the data we can still convert our column back to its original values.

3.3 — I’m using complex filters on my data!

Solution: Projection and Predicate Pushdown

In a spark environment, we can avoid reading an entire table into memory through projection and predicate pushdown. Because spark operations are lazily evaluated, meaning they aren’t executed until we actually query data, spark can pull the least amount of data into memory.

As a quick reminder, predicates subset rows and projection subsets columns. So, if we know we only need a few columns and a subset of the rows, we don’t actually have to read the full table into memory — it’s filtered during the read.

3.4 — I have lots of different data!

Solution: Deltalake

Finally, Deltalake is an open-source “lakehouse” framework that combines the dynamic nature of a data lake with the structure of data warehouse. If you plan on using parquet as the base of your organization’s data warehouse, additional features such as ACID guarantees and transaction logs are really beneficial.

If you want more, here’s a great beginner-friendly resource for deltalake.

Summary

Parquet is a really effective file format for real-world use. It’s super effective at minimizing table scans and also compresses data to small sizes. If you are a data scientist, parquet probably should be your go-to file type.

Thanks for reading! I’ll be writing 11 more posts that bring academic research to the DS industry. Check out my comment for links to the main source for this post and some useful resources.

--

--