Change for the Better: Improving Predictions by Automating Drift Detection

Подписаться 3,7 тыс.

50% 1

Change for the Better: Improving Predictions by Automating Drift Detection by Peter Webb & GokhanAtinc at Big Things Conference 2021
A machine learning solution is only as good as its data. But real-world data does not always stay within the bounds of the training set, posing a significant challenge for the data scientist: how to detect and respond to drifting data? Drifting data poses three problems: detecting and assessing drift-related model performance degradation; generating a more accurate model from the new data; and deploying a new model into an existing machine learning pipeline.
Using a real-world predictive maintenance problem, we demonstrate a solution that addresses each of these challenges: data drift detection algorithms periodically evaluate observation variability and model prediction accuracy; high-fidelity physics-based simulation models precisely label new data; and integration with industry-standard machine learning pipelines supports continuous integration and deployment. We reduce the level of expertise required to operate the system by automating both drift detection and data labelling.
Process automation reduces costs and increases reliability. The lockdowns and social distancing of the last two years reveal another advantage: minimizing human intervention and interaction to reduce risk while supporting essential social services. As we emerge from the worst of this pandemic, accelerating adoption of machine autonomy increases the demand for the automation of human expertise.
Consider a fleet of electric vehicles used for autonomous package delivery. Their batteries degrade over time, increasing charging time and diminishing vehicle range. The batteries are large and expensive to replace, and relying on a statistical estimate of battery lifetime inevitably results in replacing some batteries too soon and some too late. A more cost-effective approach collects battery health and performance data from each vehicle and uses machine learning models to predict the remaining useful lifetime of each battery. But changes in the operating environment may introduce drift into health and performance data. External temperature, for example, affects battery maximum charge and discharge rate. And then the model predictions become less accurate.
Our solution streams battery data through Kafka to production and training subsystems: a MATLAB Production Server-deployed model that predicts each battery’s remaining useful lifetime and a thermodynamically accurate physical Simulink model of the battery that automatically labels the data for use in training new models. Since simulation-based labeling is much slower than model-based prediction, the simulation cannot be used in production. The production subsystem monitors the deployed model and the streaming data to detect drift. Drift-induced model accuracy degradation triggers the training system to create new models from the most current training sets. Newly trained models are uploaded to a model registry where the production system can retrieve and integrate them into the deployed machine learning pipeline.