Implementing Artificial Intelligence to solve a practical business problem for a software client.
This project was one of my personal favorites among the recent work we completed. The problem uncovered a few unexpected challenges, and the solution was less straightforward than we initially estimated. But ultimately the anomaly detector produced results that beat our expectations, once again validating the interest in neurocomputing that is overtaking the industry.
Problem
Our client is a software company specializing in remote data acquisition, asset tracking and telematics. The system is a mission critical component for their customers and must satisfy the high availability and uptime requirement. To maintain the service level, our client’s devops team have implemented over 500 alerts to identify negative performance trends and address the developing problems before service failure. Their effort was very successful and decreased customer service incidents relating to service degradation by 90%. Excellent.
However, two problems remained. One, because of the distributed nature of the data collection devices a localized service degradation could occur. Two, customer specific issues arose that were not reflected in general trends. The general variability in data throughput and volume masked those problems. So the team discovered a degradation when a customer called to report the problem. Not the best outcome.
The team considered customer specific alerts. But the high rate of customer acquisition and variability of usage trends proved the customer specific alerts intractable to setup and manage. They generated a high rate of false positives and loss of confidence in the system among the devops team.
Solution
Our team agreed the solution had to include a machine learning system able to learn the trends over time on per customer and per region basis. The system had to also learn the variability on hourly, daily, weekly and monthly basis.
We considered a few options but settled on an Long Short-Term Memory (LSTM) neural networks implementation. These types of networks excel at finding complex relationships in multivariate time series data. A perfect fit.
The basic idea of anomaly detection with LSTM neural network is this: the system looks at the previous values over hours or days and predicts the behavior for the next minute. If the actual value a minute later is within, let’s say, one standard deviation, then there is no problem. If it is more it is an anomaly.
Results
The results matched the expectations of predicting the system behaviour with 98.8% accuracy during the two month of parallel testing. Although only one qualified service degradation event occurred during the test period, that anomaly was successfully detected. Further validating the efficacy of using Artificial Intelligence for performance monitoring, this anomaly was missed by established threshold methods.
I will dive into details on the approach and winning designs in the further sections.
Conclusions
Adoption of neurocomputing across many industries is on the rise, and with that a growing number of success stories. Neural networks are helping to turn previously intractable problems into solvable challenges. They help improve outcomes of previous solutions by a margin that warrants investment.
We encourage all devops teams and business leaders to revisit problems that were once shelved for the lack of available solutions, and all problems where results are suboptimal. There are novel methods that can elevate the business to the next level of cognitive computing.
Our Approach
We had access to six months of usable data as a training dataset. The team identified 18 parameters that could contribute to LSTM learning the sequences. Examples of those were number of reporting devices, number of queries, number of failed queries, average query execution time, company IDs, geo-regions. The LSTM was designed to predict 5 output values for the next minute, such as number of queries, number of reporting devices, etc.
Since LSTM networks analyse the previous values in timesteps, we chose three different tensor configurations: 16, 64, and 256 time steps. We designed tensors with both the non-overlapping and overlapping time windows. We resampled the data to 1 minute frequency, and used sum and mean aggregation for values. The team planned to test these tensor shapes and empirically identify the best configurations.
Similarly, the team built three LSTM architectures ranging by numbers of LSTM layers, hidden neurons, and the counts of fully connected layers. We wanted to identify the best performers along with the proper hyperparameters through experimentation.
Then we readied for some fun and began training the neural nets.
Challenges
At first, things did not go so well. During the network training the loss values were higher than normal and stopped decreasing after only 4 epochs. The validation accuracy froze after 2 epochs at a measly .08.
After a minor panic and unnecessary doubts in the system design, we traced the problem to an arithmetic error with data scaling. The fix ensured that all tensor values scaled to a mean of zero and the standard deviation of 1.
(A good lesson here for all AI practitioners - if a model does not work as expected, check your training and label data, then check your data again. Don’t give up on a design too quickly.)
On the second run we reached the validation accuracy on a test set of .72. Better, but far from acceptable considering the goal of 98.5% accuracy.
We traced the problem to false positives generated by introduction of the new customers into the timeseries. Our design accounted for this but caused the network to overfit when many new customers were introduced at once. We solved the problem by introducing a binary flag identifying new customers. We also removed the first 30 days of new customer data from training datasets as it proved too erratic.
Ultimately, we reached the required accuracy of 98.8%.
Winning Configuration
The winning configuration was the simplest design:
2 x LSTM layers with 64 hidden neurons each, 64 time steps, and batch size of 64.
2 x fully connected layers
optimizer=’adam, loss=’mae’, learning rate = 0.001
We trained the network over 100 epochs using Tesla M60 GPU cards
We were surprised at the excellent performance of the simplest design expecting a more complex model to outperform it. Another design that included more LSTM and GRU units, attained equal performance but required 20% longer training time, and 18% longer prediction time. Thus the simpler design was the one chosen for production.
In the following blog I will discuss the data pipeline for the integration of the LSTM anomaly detector into production. It will include real-time tensor generation, Kafka streaming and on-line retraining methodology.
Opmerkingen