Back to Blog
Data EngineeringApache KafkaAirflowArchitecture

Batch vs. Streaming Processing: How to Choose the Right Model for Your Data

7 April 20268 min read

Your CEO wants to see last night's revenue on a dashboard when she arrives at work. Your fraud team needs to block a suspicious transaction within 2 seconds of it occurring. Both are data problems — but they require completely different solutions. The choice between batch and streaming is fundamentally about one question: how stale can your data be before it stops being useful?

The analogy: doing laundry

Imagine you work at a laundromat. Batch processing is like waiting until all the machines are full before running them — you collect dirty clothes all day and do one big wash at 2am. Streaming is like washing each item the moment a customer drops it off.

Both approaches get clothes clean. But they require completely different equipment, workflows, staffing, and costs. A 2am overnight wash cycle is far cheaper and simpler to run than a 24/7 instant-wash operation. The question is which one your customers actually need.

What is batch processing?

In a batch pipeline, data accumulates over a period of time and is processed in one scheduled run — typically hourly, nightly, or weekly. The pipeline starts, does its work, and stops.

Common tools: Apache Airflow (orchestration), dbt (SQL transformations), Apache Spark in batch mode, AWS Glue, Google Cloud Dataflow in batch mode.

Batch is the right choice when you need:

  • Nightly revenue reports and financial summaries
  • Monthly invoicing or billing runs
  • Historical data backfills or migrations
  • Training machine learning models on past data
  • Any reporting where data from a few hours ago is “fresh enough”

The vast majority of analytics workloads at growing companies fall into this category. A dashboard showing yesterday’s sales is usually enough — nobody needs to know the exact revenue figure from 4 minutes ago to make a business decision.

What is stream processing?

In a streaming pipeline, each event is processed the moment it arrives — or within seconds. Instead of a scheduled job that starts and stops, a streaming system runs continuously, consuming events one by one (or in tiny micro-batches of a few seconds).

Common tools: Apache Kafka (message broker/stream backbone), Apache Flink, Kafka Streams, Spark Structured Streaming, AWS Kinesis.

Streaming is genuinely necessary when you need:

  • Fraud detection — block a bad transaction within 2 seconds
  • Real-time dashboards — live user counts, live order tracking
  • IoT sensor monitoring — alert when a temperature exceeds a threshold
  • Operational alerting — notify on-call when error rate spikes
  • Live personalisation — recommend based on what a user did 30 seconds ago

The tradeoff: why streaming costs more

Here’s the part most engineers don’t explain clearly enough: streaming is significantly harder and more expensive to build and operate than batch. Not a little — quite a lot.

With batch, your pipeline runs once, does its work, and turns off. With streaming, you’re running a system 24/7. You must now handle:

  • Message ordering — events can arrive out of sequence. Does it matter?
  • Late-arriving data — what if an event from 2 hours ago shows up now?
  • Exactly-once processing — what happens when your consumer crashes mid-way? Do you reprocess?
  • Consumer group management — multiple consumers reading the same Kafka topic
  • Stateful computations — “count unique users in the last 5 minutes” requires maintaining state across events
  • Schema evolution — what if the event format changes?

None of these problems exist in batch. Batch is simple: read data, transform it, write it. Done. Every one of these streaming concerns requires additional engineering and ongoing operational overhead.

The middle ground: micro-batching

Most teams don’t actually need true streaming. They need near real-time — data that’s 5 minutes old instead of 12 hours old. That’s micro-batching: running your batch job every 5–10 minutes rather than nightly.

Spark Structured Streaming and dbt incremental models can run in micro-batch mode. You get most of the freshness benefit of streaming at a fraction of the operational complexity. No Kafka, no consumer groups, no watermarks. Just a very fast batch job.

The micro-batch sweet spot: If your stakeholders ask for “near real-time” data, try 15-minute batch first. In most cases, nobody can tell the difference — and you’ve saved yourself months of streaming complexity.

Going deeper: watermarks and exactly-once

If you do go streaming, two concepts you’ll encounter constantly:

Watermarks tell your system how late is too late. If a mobile app event was generated at 10:00am but arrives at your consumer at 12:00pm (because the user was offline), do you still process it? A watermark is the cutoff — events past it are dropped or handled as exceptions. Get this wrong and you’ll see mysterious gaps in your data where late events were silently discarded.

Exactly-once semantics means each event is processed precisely once, even if your system crashes mid-run. Both Flink and Kafka Streams support this, but it adds overhead. If your use case can tolerate duplicates (“at-least-once”) — like counting page views where a few extra doesn’t matter — you can skip this complexity. If you’re processing financial transactions, you cannot.

How to decide: the three-question framework

Ask yourself:

  1. How stale can this data be before it stops being useful? If the answer is “a day is fine,” nightly batch wins.
  2. What’s the real cost of being 15 minutes late? For most reporting use cases, nothing bad happens. For fraud detection or live trading, everything breaks.
  3. Does delayed or incorrect data cause actual harm? Financial, legal, or safety-critical harm justifies streaming. A slightly stale KPI dashboard does not.

Most growing companies will answer “batch” or “micro-batch” for 90% of their use cases. Reserve streaming for the 10% where freshness is genuinely business-critical. Start simple. Upgrade when the evidence is clear.

Currently accepting new clients

Working on something similar?

We build the pipelines, warehouses, and dashboards behind problems like these.

Get in touch

We use privacy-friendly analytics — no cookies, no personal data. Privacy Policy