Streaming-First Data Architecture: Why We Replaced Batch ETL with CDC + Iceberg
The Batch ETL Tax
Here’s a conversation that happened every Monday morning at our client’s company: “The numbers in the dashboard don’t match what I’m seeing in SAP.” The answer was always the same — the dashboard shows yesterday’s data, and somebody updated SAP this morning.
This client, a Fortune 500 CPG company, had a classic enterprise data problem. Sales data in SAP, supply chain metrics in custom tools, marketing spend in yet another system. A nightly batch ETL job would extract everything, transform it, and load it into a Snowflake warehouse. By the time executives saw the dashboards on Monday, the data was 24-48 hours stale.
For years, that was acceptable. But when the company started running real-time promotional campaigns and needed to see the impact within hours (not days), the batch pipeline became a bottleneck. They didn’t need faster batch — they needed a different architecture.
The Three-Tier Lakehouse
The 2026 data lakehouse isn’t a single product. It’s a set of interoperating open components arranged in three tiers based on data freshness:
Hot tier — Sub-second queries against continuously updated data. This is where operational dashboards live. We used Apache Flink with materialized views that update as events arrive.
Warm tier — Analytical queries against data that’s minutes to hours old. This is the bread and butter of the lakehouse. Apache Iceberg tables on S3, queryable by Trino and Snowflake.
Cold tier — Historical data for compliance, backfilling, and ad-hoc deep analysis. Same Iceberg tables, but with aggressive compaction and cheaper storage tiers.
The key insight: you don’t stream everything to the hot tier. Most analytical queries are fine with 5-minute-old data. The hot tier is expensive to operate and only justified for operational use cases. We streamed to the warm tier by default and promoted specific tables to the hot tier only when the business case demanded it.
CDC: The ETL Killer
Change Data Capture with Debezium was the backbone of our migration. Instead of running nightly SELECT * queries against production databases (which hammered performance), Debezium reads the database transaction log and emits every insert, update, and delete as an event on a Kafka topic.
The setup was straightforward:
PostgreSQL (source) → Debezium → Kafka → Flink → Iceberg (S3)
But “straightforward” doesn’t mean “easy.” Three things bit us:
Schema evolution. When the source team added a column to the SAP integration table, the CDC stream started producing events with a new field. Our Flink job expected the old schema and crashed. We added Avro schema registry with compatibility checks — new fields are allowed, removed fields are not, type changes require a migration plan.
Tombstone events. Debezium emits tombstone events (null value, non-null key) for hard deletes. Our initial Flink job treated these as malformed events and dropped them. That meant deleted records persisted in the lakehouse forever. We added explicit tombstone handling that maps to Iceberg equality deletes.
Backfill vs. streaming. When you first connect Debezium to a database, it needs to snapshot the entire table before it can start streaming changes. For a 500M-row table, that snapshot took 6 hours and produced a spike that overwhelmed our Kafka cluster. We ended up doing the initial backfill with Spark (bulk load into Iceberg) and only switched to Debezium streaming after the baseline was in place.
Apache Iceberg: Why We Picked It
We evaluated Delta Lake, Hudi, and Iceberg. All three solve similar problems — ACID transactions on object storage, schema evolution, time travel. We chose Iceberg for three reasons:
-
Engine independence. Iceberg works with Spark, Flink, Trino, Snowflake, and DuckDB. Delta Lake has better Spark integration but weaker support elsewhere. Since our client uses both Snowflake and Trino, Iceberg was the pragmatic choice.
-
Hidden partitioning. Iceberg partitions data without exposing partition columns to the user. Queries don’t need to specify partition filters — the engine handles it automatically. This matters when non-technical users are writing Trino queries.
-
Community momentum. As of early 2026, Iceberg has won the table format debate for new projects. The tooling ecosystem is growing faster than the alternatives. We’re building for the next 5 years, not the next 5 months.
The Compaction Problem
Here’s the thing nobody tells you about streaming to Iceberg: it creates a lot of small files. If you commit every 30 seconds, each commit creates a new set of data files. After a day, you have thousands of tiny files. Query performance craters because the query engine has to open and read each one.
The fix is compaction — periodically merging small files into larger ones. We run compaction as a scheduled Flink job every 15 minutes for hot tables and every 6 hours for warm tables. The trade-off: during compaction, queries see slightly stale metadata. We accept this because the alternative — no compaction — makes queries 10-50x slower.
We also learned to set snapshot expiration aggressively. Iceberg’s time-travel feature is useful for debugging but creates metadata bloat. We keep 7 days of snapshots for hot tables and 30 days for cold tables. Anything older gets expired.
Results
After 16 weeks:
- 47 data sources consolidated from 3 business units
- 5-minute data freshness replacing 24-hour batch cycles
- 85% faster executive reporting (queries that took minutes now take seconds)
- $6.2B portfolio visible in real-time for the first time
- Zero nightly batch failures (because there are no nightly batches)
The last point matters more than it sounds. The old batch pipeline failed 2-3 times per month — a corrupted file, a source system timeout, a schema change. Each failure meant stale dashboards until someone noticed and fixed it. Streaming with CDC is more resilient because it processes changes incrementally. A 5-minute blip in the source system means 5 minutes of delayed data, not a full-day reprocessing job.
What We’d Do Differently
Start with fewer sources. We migrated all 47 sources simultaneously. It would have been faster (and less stressful) to migrate the 5 highest-priority sources first, prove the architecture, and then onboard the rest in waves.
Invest in a data quality layer earlier. We added Great Expectations checks after week 10. By then, we’d already shipped bad data twice — once from a Debezium tombstone bug, once from a schema evolution we didn’t catch. Data quality gates should be in the pipeline from day one.
Use dbt for the warm tier from the start. We initially wrote Flink SQL for all transformations. Flink is great for streaming, but for the warm tier (where data is already in Iceberg), dbt’s SQL-based transformations are simpler to write, test, and maintain. We migrated warm-tier transforms to dbt in week 12 and wished we’d done it in week 2.
We build streaming data platforms as part of our Data Science & Data Engineering practice. See the real-time fraud detection case study for another streaming-lakehouse build, or talk to us about a similar challenge.