Stopping a Structured Streaming query

Streaming jobs are supposed to run continuously but it applies to the data processing logic. After all, sometimes you may need to release a new job package with upgraded dependencies or improved business logic. What happens then?

Continue Reading β†’

Data enrichment strategies in Apache Flink

Data enrichment is a crucial step in making data more usable by the business users. Doing that with a batch is relatively easy due to the static nature of the dataset. When it comes to streaming, the task is more challenging.

Continue Reading β†’

Rolling history logs in Spark History UI

Stream processing is great but it brings some gotchas that are not obvious. Logs are one of them.

Continue Reading β†’

Schema tracking in Delta Lake

Streaming Delta tables is slightly different from streaming native streaming sources, such as Apache Kafka topics. One of the significant differences is schema enforcement. It leads to the job failure in case of schema changes of the streamed table.

Continue Reading β†’

StreamingQueryListener, from states to questions

Apache Spark leverages the observer design pattern for the framework-to-code communication. One of the consumers' implementations is StreamingQueryListener.

Continue Reading β†’

Processing time trigger, to be or not to be?

That's the question. The lack of the processing time trigger means more a reactive micro-batch triggering but it cannot be considered as the single true best practice. Let's see why.

Continue Reading β†’

Apache Flink and the input data reading

I'm writing this unexpected blog post because I got stuck with watermarks and checkpoints and felt that I was missing some basics. Even though this introduction is a bit negative, the exploration for the data reading enabled my other discoveries.

Continue Reading β†’

Anatomy of a Structured Streaming job

Apache Spark Structured Streaming relies on the micro-batch pattern which evaluates the same query in each execution. That's only a high level vision, though. Under-the-hood, there are many other interesting things that happen.

Continue Reading β†’

Min rate limits for Apache Kafka

I bet you know it already. You can limit the max throughput for Apache Spark Structured Streaming jobs for popular data sources such as Apache Kafka, Delta Lake, or raw files. Have you known that you can also control the lower limit, at least for Apache Kafka?

Continue Reading β†’

What's new on the cloud for data engineers - part 12 (10.2023-02.2024)

It's time for another part of "What's new on the cloud for data engineers". Let's see what happened in the last 5 months.

Continue Reading β†’

Table file formats - streaming writer: Delta Lake

The previous blog from the series we discovered streaming reader. However, an end-to-end streaming Delta Lake pipeline also requires a writer which will be our focus today.

Continue Reading β†’

Apache Flink and cluster components deep dive

Previously you could read about transformation of a user job definition into an executable stream graph. Since this explanation was relatively high-level, I decided to deep dive into the final step executing the code.

Continue Reading β†’

Static enrichment dataset with Delta Lake

Data enrichment is one of common data engineering tasks. It's relatively easy to implement with static datasets because of the data availability. However, this apparently easy task can become a nightmare if used with inappropriate technologies.

Continue Reading β†’

Table file formats - streaming reader: Delta Lake

Even though I'm into streaming these days, I haven't really covered streaming in Delta Lake yet. I only slightly blogged about Change Data Feed but completely missed the fundamentals. Hopefully, this and next blog posts will change this!

Continue Reading β†’

Files streaming is quite a challenge

It's technically possible to process files in a continuous way from a streaming job. However, if you are expecting some latency sensitive job, this will always be slower than processing data directly from a streaming broker. Why?

Continue Reading β†’