Bartosz Konieczny

Welcome to the blog!

My name is Bartosz Konieczny. I'm a freelance data engineer and author of the Data Engineering Design Patterns (O'Reilly) book. When I'm not helping clients solve data engineering challenges to drive business value, I enjoy sharing what I learned here.

Consulting → Courses & trainings → Data Engineering Design Patterns Book → Github → Most recent blog posts →

November 4, 2025 • Apache Spark SQL

Apache Spark and the show command

Some time ago when I was analyzing the execution of my Apache Spark job on Spark UI, I noticed a limit(...) action. It was weird as I actually was running only the show(...) command to display the DataFrame locally. At the time I understood why but hadn't found time to write a blog post. Recently Antoni reminded me on LinkedIn that I should have blogged about show(...) back then to better answer his question :)

Continue Reading →

October 31, 2025 • Delta Lake

Commit log decorator with userMetadata property

Your data won't always tell you all the things. Often you will be missing some additional and important context. Whether the data come from a backfilling run? Whether the data was generated from the most recent job version you deployed? All those questions can be answered with the Delta Lake feature called user metadata.

Continue Reading →

October 23, 2025 • Delta Lake

Truncating a Delta Lake table, aka metadata-only operations

There are two modes for data removal from a Delta Lake table, the data and the metadata ones. The first needs to identify the records to remove by running the explicit select query on the table. On another hand, the metadata mode doesn't interact with the data. It's often faster but due to the metadata-only character, it's also more limited.

Continue Reading →

October 15, 2025 • Delta Lake

Idempotent writer

If you are ~~old~~ experienced enough, you should remember Apache Spark Structured Streaming file sink where the commit log stores already written files in a dedicated file. Delta Lake uses a similar concept to guarantee idempotent writes, but with less storage overhead.

Continue Reading →

October 9, 2025 • Delta Lake

Tables cloning in Delta Lake

When I was writing the Data Engineering Design Patterns book I had to leave some great suggestions aside. One of them was a code snippet for the Passthrough replicator pattern with Delta Lake's clone feature. But all is not lost as my new Delta Lake blog post will focus on table cloning which is the backbone for implementing the Passthrough replicator pattern!

Continue Reading →

October 1, 2025 • Delta Lake

Constraints in Delta Lake

We all agree, data quality is essential to build trustworthy dashboards or ML algorithms. For so long the single possibility to validate the data for file formats before writing was reserved to the data processing jobs. Thankfully, Delta Lake constraints made this validation possible at the data storage layer (technically, it's still a compute layer but at a very high level of abstraction).

Continue Reading →

September 25, 2025 • Apache Spark Structured Streaming

Apache Spark Structured Streaming UI patterns

When you start a Structured Streaming job, your Spark UI will get a new tab in the menu where you follow the progress of the running jobs. In the beginning this part may appear a bit complex but there are some visual detection patterns that can help you understand what's going on.

Continue Reading →

September 17, 2025 • SQL

NULL in SQL, other traps

Last time I wrote about a special - but logical - behavior of NULLs in joins. Today it's time to see other queries where NULLs behave differently than columns with values.

Continue Reading →

September 10, 2025 • SQL

NULL is not a value - on joining nulls

If you know it, lucky you. If not, I bet you'll spend some time on getting the reason why two - apparently the same rows - don't match in your full outer join statement.

Continue Reading →

September 1, 2025 • Data engineering patterns

Get it once, few words on data deduplication patterns in data engineering

This blog post completes the data duplication problem I covered in my recent Data Engineering Design Patterns book by approaching the issue from a different angle.

Continue Reading →

August 25, 2025 • Delta Lake

Transactional patterns for Delta Lake before Catalog-managed tables

Dual writes - backend engineers have been facing this challenge for many years. If you are a data engineer with some projects running on production, you certainly faced it too. If not, I hope the blog post will shed some light on that issue and provide you a few solutions!

Continue Reading →

August 20, 2025 • Apache Spark Structured Streaming

What's new in Apache Spark 4.0.0 - Arbitrary state API v2 - batch

To close the topic of the new arbitrary stateful processing API in Apache Spark Structured Streaming let's focus on its...batch counterpart!

Continue Reading →

August 13, 2025 • Apache Spark Structured Streaming

What's new in Apache Spark 4.0.0 - Arbitrary state API v2 - internals

Last week we discovered the new way to write arbitrary stateful transformations in Apache Spark 4 with the transformWithState API. Today it's time to delve into the implementation details and try to understand the internal logic a bit better.

Continue Reading →

August 6, 2025 • Apache Spark Structured Streaming

What's new in Apache Spark 4.0 - Arbitrary state API v2 - introduction

Arbitrary stateful processing has been evolving a lot in Apache Spark. The initial version with updateStateByKey evolved to mapWithState in Apache Spark 2. When Structured Streaming was released, the framework got mapGroupsWithState and flatMapGroupsWithState. Now, Apache Spark 4 introduces a completely new way to interact with the arbitrary stateful processing logic, the Arbitrary state API v2!

Continue Reading →

July 30, 2025 • General data engineering

Alerts, guards, and data engineering

While I was writing about agnostic data quality alerts with ydata-profiling a few weeks ago, I had an idea for another blog post which generally can be summarized as "what do alerts do in data engineering projects". Since the answer is "it depends", let me share my thoughts on that.

Continue Reading →