Welcome to the blog!

My name is Bartosz Konieczny. I'm a freelance data engineer and author of the Data Engineering Design Patterns (O'Reilly) book. When I'm not helping clients solve data engineering challenges to drive business value, I enjoy sharing what I learned here.

Consulting β†’ Courses & trainings β†’ Data Engineering Design Patterns Book β†’ Github β†’ Most recent blog posts β†’

Commit log decorator with userMetadata property

Your data won't always tell you all the things. Often you will be missing some additional and important context. Whether the data come from a backfilling run? Whether the data was generated from the most recent job version you deployed? All those questions can be answered with the Delta Lake feature called user metadata.

Continue Reading β†’

Truncating a Delta Lake table, aka metadata-only operations

There are two modes for data removal from a Delta Lake table, the data and the metadata ones. The first needs to identify the records to remove by running the explicit select query on the table. On another hand, the metadata mode doesn't interact with the data. It's often faster but due to the metadata-only character, it's also more limited.

Continue Reading β†’

Idempotent writer

If you are old experienced enough, you should remember Apache Spark Structured Streaming file sink where the commit log stores already written files in a dedicated file. Delta Lake uses a similar concept to guarantee idempotent writes, but with less storage overhead.

Continue Reading β†’

Tables cloning in Delta Lake

When I was writing the Data Engineering Design Patterns book I had to leave some great suggestions aside. One of them was a code snippet for the Passthrough replicator pattern with Delta Lake's clone feature. But all is not lost as my new Delta Lake blog post will focus on table cloning which is the backbone for implementing the Passthrough replicator pattern!

Continue Reading β†’

Constraints in Delta Lake

We all agree, data quality is essential to build trustworthy dashboards or ML algorithms. For so long the single possibility to validate the data for file formats before writing was reserved to the data processing jobs. Thankfully, Delta Lake constraints made this validation possible at the data storage layer (technically, it's still a compute layer but at a very high level of abstraction).

Continue Reading β†’

Apache Spark Structured Streaming UI patterns

When you start a Structured Streaming job, your Spark UI will get a new tab in the menu where you follow the progress of the running jobs. In the beginning this part may appear a bit complex but there are some visual detection patterns that can help you understand what's going on.

Continue Reading β†’

NULL in SQL, other traps

Last time I wrote about a special - but logical - behavior of NULLs in joins. Today it's time to see other queries where NULLs behave differently than columns with values.

Continue Reading β†’

NULL is not a value - on joining nulls

If you know it, lucky you. If not, I bet you'll spend some time on getting the reason why two - apparently the same rows - don't match in your full outer join statement.

Continue Reading β†’

Get it once, few words on data deduplication patterns in data engineering

This blog post completes the data duplication problem I covered in my recent Data Engineering Design Patterns book by approaching the issue from a different angle.

Continue Reading β†’

Transactional patterns for Delta Lake before Catalog-managed tables

Dual writes - backend engineers have been facing this challenge for many years. If you are a data engineer with some projects running on production, you certainly faced it too. If not, I hope the blog post will shed some light on that issue and provide you a few solutions!

Continue Reading β†’

What's new in Apache Spark 4.0.0 - Arbitrary state API v2 - batch

To close the topic of the new arbitrary stateful processing API in Apache Spark Structured Streaming let's focus on its...batch counterpart!

Continue Reading β†’

What's new in Apache Spark 4.0.0 - Arbitrary state API v2 - internals

Last week we discovered the new way to write arbitrary stateful transformations in Apache Spark 4 with the transformWithState API. Today it's time to delve into the implementation details and try to understand the internal logic a bit better.

Continue Reading β†’

What's new in Apache Spark 4.0 - Arbitrary state API v2 - introduction

Arbitrary stateful processing has been evolving a lot in Apache Spark. The initial version with updateStateByKey evolved to mapWithState in Apache Spark 2. When Structured Streaming was released, the framework got mapGroupsWithState and flatMapGroupsWithState. Now, Apache Spark 4 introduces a completely new way to interact with the arbitrary stateful processing logic, the Arbitrary state API v2!

Continue Reading β†’

Alerts, guards, and data engineering

While I was writing about agnostic data quality alerts with ydata-profiling a few weeks ago, I had an idea for another blog post which generally can be summarized as "what do alerts do in data engineering projects". Since the answer is "it depends", let me share my thoughts on that.

Continue Reading β†’

Agnostic data alerts with ydata-profiling

Defining data quality rules and alerts is not an easy task. Thankfully, there are various ways that can help you automate the work. One of them is data profiling that we're going to focus on in this blog post!

Continue Reading β†’