Ivo Stratev
Dec 27, 2022

--

There is a huge misconception in this post and that is that in the post it is claimed that Parquet is columnar file format. This is simply not true. The Parquet format is actually a hybrid format because the data is first split horizontally and then each column chunk is compressed and written independently... And this is a key design decision in order to take full advantage of modern CPUs but also to be able to skip some data based on computed statistics in the Metadata part in the footer of Parquet files...

Missing to mention RowGroups and the Footer is a huge mistake when talking about the Parquet file format. Even if the goal is to explain it in a simple way because those two things are just so key for the Parquet format.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Ivo Stratev
Ivo Stratev

Written by Ivo Stratev

Passionate about Programming. Interested in Highly Distributed Systems and the Microservice Architecture. In love with Math and proving things.

Responses (1)

Write a response