My name is Bartosz Konieczny. I'm a freelance data engineer and author of the Data Engineering Design Patterns (O'Reilly) book. When I'm not helping clients solve data engineering challenges to drive business value, I enjoy sharing what I learned here.
Consulting β Courses & trainings β Data Engineering Design Patterns Book β Github β Most recent blog posts βThis year, I'll be exploring various software engineering principles and their application to data engineering. Clean Architecture is a prime example, and it serves as the focus of this post.
Databricks has recently extended natively supported data formats with Excels!
Even though eight years have passed since my blog post about various join types in Apache Spark SQL, I'm still learning something new about this apparently simple data operation which is the join. Recently in "Building Machine Learning Systems with a Feature Store: Batch, Real-Time, and LLM Systems" by Jim Dowling I read about ASOF Joins and decided to dedicate some space for them in the blog.
If you follow me, you know I'm an Apache Spark enthusiast. Despite that, I'm doing my best to keep my mind open to other technologies. The one that got my strong attention past years is Apache Flink and I found nothing better to start than comparing it with Apache Spark Structured Streaming.
Software applications, including the data engineering ones you're working on, may require flexible input parameters. These parameters are important because they often identify the tables or data stores the job interacts with, and also show what the expected outputs are. Despite their utility, they can also cause confusion within the code, especially when not managed properly. Let's see how to address them for PySpark jobs on Databricks.
Software engineering, and by extension data engineering, has many well-known acronyms. DRY, YAGNI, KISS, RTFM, SRP - they certainly ring a bell. But what about DAMP? Despite being less popular, this acronym also has a beneficial impact on your data engineering projects.
I discovered recursive CTE during my in-depth SQL exploration back in 2018. However, I have never had an opportunity to implement them in production. Until recently where I was migrating workflows from SQL Server to Databricks and one of them was using the recursive CTEs to build a hierarchy table. If it's the first time you hear about the recursive CTEs, let me share my findings with you!
Before I share the usual retrospective for the past year, I want to thank you for following along in 2025! Even though I'm primarily writing for "me-from-the-future", it's always great to know that people other than my future self find these posts helpful ;)
Databricks Jobs is still one of the best ways for running data processing code on Databricks. It supports a wide range of processing modes, from native Python and Scala jobs, to framework-based dbt queries. It doesn't require installing anything on your own as it's a full serverless offering. Finally, it's also flexible enough to cover most of the common data engineering use cases. One of these great flexible features is support of different input arguments via For Each task.
Dealing with numbers may be easy and challenging at the same time. When you operate on integers, you can encounter integers overflow. When you deal with floating-point types, which will be the topic of this blog post, you can encounter rounding issues.
Databricks Asset Bundles (DAB) simplify managing Databricks jobs and resources a lot. And they are also flexible because besides the YAML-based declarative way you can add some dynamic behavior with scripts.
Picture this. You get a list of values in a column and you need to combine each of them with another row. The simplest way for doing that is to use the explode operation and create a dedicated row for the concatenated values. Unlucky you, several rows in the input have nulls instead of the list.
Some time ago when I was analyzing the execution of my Apache Spark job on Spark UI, I noticed a limit(...) action. It was weird as I actually was running only the show(...) command to display the DataFrame locally. At the time I understood why but hadn't found time to write a blog post. Recently Antoni reminded me on LinkedIn that I should have blogged about show(...) back then to better answer his question :)
Your data won't always tell you all the things. Often you will be missing some additional and important context. Whether the data come from a backfilling run? Whether the data was generated from the most recent job version you deployed? All those questions can be answered with the Delta Lake feature called user metadata.
There are two modes for data removal from a Delta Lake table, the data and the metadata ones. The first needs to identify the records to remove by running the explicit select query on the table. On another hand, the metadata mode doesn't interact with the data. It's often faster but due to the metadata-only character, it's also more limited.