Predict Failures.
Before They Happen.
One compressor alert, 68 hours early, saved $340K of
lost production. We wired up 12,000 sensors across
four plants into Kafka, Flink and PyTorch, and
trained federated anomaly models that respect
each plant’s data.
IoT predictive maintenance pipeline
We built predictive maintenance across 12,000+ sensors and four plants. The model flags failures 72 hours out, unplanned downtime fell 34%, and the client banked $8.6M a year in avoided outages.
Sensor data flows through AWS IoT Core into Kafka, Flink builds rolling and spectral features in real time, and per-plant PyTorch models train with Flower (with Bonawitz secure aggregation and DP-SGD on the gradients) so the shared patterns travel without the raw data leaving the facility. AWS Chronos / TimesFM 2.0 sit underneath as the foundation-model baseline that the bespoke models have to beat per failure mode, and conformal prediction sets the alert thresholds.
ML Engineering Approach
-
Ingestion at the floor — 12,000+ sensors hit AWS IoT Core and land in Kafka. We normalised MQTT, OPC-UA and Modbus into one event schema so the downstream pipelines only see one shape.
-
Streaming features — Flink computes rolling stats, spectral features and degradation curves live, and we store them in TimescaleDB for both training and serving so training/serving skew stays bounded.
-
Federated training, securely — Each plant trains its own PyTorch model on-site, and we aggregate the shared patterns with Flower (or NVIDIA FLARE on the OT side) under Bonawitz secure aggregation and DP-SGD, so raw data never leaves a facility and gradient leakage is bounded.
-
Alerts operators trust — Predictions land in Grafana with a conformal-interval-derived severity score, a per-failure-mode horizon, and a pointer to the sensor that triggered them. The maintenance team acts on ranked alerts, not a wall of red.
What was actually hard
Each plant runs different equipment, different cycles and different environments. A model trained on Plant A was worse than useless at Plant B. We needed local learning without centralising sensitive production data, and — harder — we needed the operators to trust a prediction enough to schedule a repair 72 hours before anything on their panel looked wrong.

Project Outcome
Unplanned downtime dropped 34% across all four plants. The 72-hour window gave maintenance crews enough lead time to slot repairs into planned windows instead of shutting a line down at 3am, and the first compressor catch alone paid for the pipeline.
monitored > 72hr prediction
window > 34% downtime
reduction > $8.6M annual
savings


“We caught a compressor failure 68 hours before it would have shut down the line. That single alert saved us $340K in lost production.”
@ Frank B.
VP Manufacturing — Industrial Equipment Maker



