Book a 30-min call
$ cd ~/projects/iot-predictive-maintenance-pipeline agent.shipped · in production

Predict Failures.
Before They Happen.

One compressor alert, 68 hours early, saved $340K of
lost production. We wired up 12,000 sensors across
four plants into Kafka, Flink and PyTorch, and
trained federated anomaly models that respect
each plant’s data.

  • Home
  • IoT predictive maintenance pipeline
Industrial sensors and manufacturing equipment

IoT predictive maintenance pipeline

Industry
Industrial Manufacturing
Timeline
16 weeks
Key result
72-hour predictions, $8.6M savings
Tech stack
Apache Kafka, Apache Flink, TimescaleDB, Chronos / TimesFM 2.0 baselines, PyTorch, Flower federated learning + Bonawitz secure aggregation + DP-SGD, conformal prediction, Grafana, AWS IoT Core

We built predictive maintenance across 12,000+ sensors and four plants. The model flags failures 72 hours out, unplanned downtime fell 34%, and the client banked $8.6M a year in avoided outages.

Sensor data flows through AWS IoT Core into Kafka, Flink builds rolling and spectral features in real time, and per-plant PyTorch models train with Flower (with Bonawitz secure aggregation and DP-SGD on the gradients) so the shared patterns travel without the raw data leaving the facility. AWS Chronos / TimesFM 2.0 sit underneath as the foundation-model baseline that the bespoke models have to beat per failure mode, and conformal prediction sets the alert thresholds.

ML Engineering Approach
  • Ingestion at the floor — 12,000+ sensors hit AWS IoT Core and land in Kafka. We normalised MQTT, OPC-UA and Modbus into one event schema so the downstream pipelines only see one shape.

  • Streaming features — Flink computes rolling stats, spectral features and degradation curves live, and we store them in TimescaleDB for both training and serving so training/serving skew stays bounded.

  • Federated training, securely — Each plant trains its own PyTorch model on-site, and we aggregate the shared patterns with Flower (or NVIDIA FLARE on the OT side) under Bonawitz secure aggregation and DP-SGD, so raw data never leaves a facility and gradient leakage is bounded.

  • Alerts operators trust — Predictions land in Grafana with a conformal-interval-derived severity score, a per-failure-mode horizon, and a pointer to the sensor that triggered them. The maintenance team acts on ranked alerts, not a wall of red.

What was actually hard

Each plant runs different equipment, different cycles and different environments. A model trained on Plant A was worse than useless at Plant B. We needed local learning without centralising sensitive production data, and — harder — we needed the operators to trust a prediction enough to schedule a repair 72 hours before anything on their panel looked wrong.

Industrial manufacturing plant equipment

Project Outcome

Unplanned downtime dropped 34% across all four plants. The 72-hour window gave maintenance crews enough lead time to slot repairs into planned windows instead of shutting a line down at 3am, and the first compressor catch alone paid for the pipeline.

> 12K+ sensors
monitored
> 72hr prediction
window
> 34% downtime
reduction
> $8.6M annual
savings
Industrial machine under predictive monitoring
Manufacturing floor with monitored equipment
Apache KafkaApache FlinkTimescaleDBChronos / TimesFM 2.0PyTorchFlower (federated)Bonawitz secure aggregationDP-SGDConformal predictionGrafanaAWS IoT Core

“We caught a compressor failure 68 hours before it would have shut down the line. That single alert saved us $340K in lost production.”

@ Frank B.

VP Manufacturing — Industrial Equipment Maker

Industrial IoT monitoring infrastructure
  • [Chronos] baseline
  • [TimesFM 2.0] forecasting
  • [Flower] federated
  • [Secure aggregation] privacy
  • [Conformal] intervals
  • [Industrial] scale