
If you’re an IT service provider, systems reliability is one of the concerns high on your lists. Back in my earlier days of development I worked for a medium sized electronics parts company. They had the unique business model of selling very small parts, like transistors and capacitors as individual parts. Back before the whole Maker movement took hold, it used to be nearly impossible to buy these small parts without buying a giant spool. The manufacturer of these parts wanted them sold by the hundreds and thousands, not ten or twenty.
So the website we maintained sold thousands of such electronics and tools all over the world. We would even buy bulk from China, and then turn around and sell the individual parts back to China with a significant markup. All that to say, site reliability was a huge concern. Enough so that one day when the site was down, the president of the company came to our little development corner and passed back-and-forth while we did all we could to get our beast of a monolithic .NET version 2 web application to behave under a high traffic load.
I learned the true value of systems reliability that day. The president and CEO of the company raced Porsches. He loved Porsches. We had pictures of Porsches all over the executive area of the company. I think he had at least two or three that were his regular drive and likely the same number modified for racing. Porsches were his thing. So when he told us that with each half hour the site was down we were costing as much as a brand new Porsche–I think it was more like, “you guys are literally crashing a brand new Porsche into the wall every thirty minutes this site is down,” we understood the severity of our failing application.

It was then that the value and importance of site reliability really solidified in my mind. It’s something as a solutions developer that I’m regularly taking into account with my designs. But can an AI help me do a better a job? Can I make a system even better by adding AI to the systems?
What is AIOps? Isn’t it the same as MLOPs or DevOps or Triceratops?
AIOps is basically Artificial Intelligence for IT operations. Basically, enterprise systems create a lot of data with all their various logs and system events. These logs are sometimes centralized, if you have a unified logging strategy, but most of the time the logs are in different servers, in the cloud, on-premises, and even on IoT and Edge devices. The goal of AIOps is to use that data to produce a view of your assets with a goal toward seeing its dependencies, it’s processes, its failures, and get an overall idea of how the asset’s performance could be improved.
AIOps can help by automating common tasks, recognizing serious issues, and streamlining communication between the different areas of responsibility within organizations. Sounds magical? Where do I get an AIOPs? Can I just plug it in and start getting these benefits?
Well, not quite. Like many of the best solutions in IT, it’s not a switch that you can just turn on or a box that you can just add to your network. Just like DevOps, AIOps is a journey. It’s a discipline. It’s a process. I know, nothing is ever as easy as it seems it should be, but the value to AIOps for some organizations does outweigh the drawback.

Where does AIOps fit within an operations team?
AIOps can help out in the following areas of your organization:
- Systems
- DevOps
- Customer Support
For systems, the most common use is for hardware systems failure predictions. For most of us in the cloud this is something we don’t generally consider. But if you’re using a hybrid model and still have some of those old rack mounted servers running important mission critical jobs, using AIOps for hardware failure prediction is likely something you’ll care about. AIOps can also help with device and systems provisioning. Managing VM pools and container clusters based on website traffic or workloads is easily within the grasp of a machine learning algorithm.
DevOps is probably one of the first places to start experimenting with AIOps. Using AI to aid in deployments, especially if you have hundreds of rollouts of software a day, can help detect anomalies and catch latent issues. Anomaly detection comes into play for your monitoring strategy, and AI is the perfect partner to help with incident management. If any of these are your pain points, you might need to add an AI to your DevOps team.
And of course for customer experience issues around site and system failures, there are bots, decision support systems, and automated communications options that provide greater detail than just a simple alert.
This is just a high level overview of some of the possible solutions. There are hundreds of ways that AI can not only help your IT operations teams, but reach deeper into your business operations. AI help monitor industrial equipment for failures, retail systems for security and compliance, and can help with supply chain optimization.
[…] Device VS Module: A Somewhat Deep Dive into the Differences Between Azure IoT Hub Devices and Modules How do you use AI to improve reliability? […]
[…] Obviously, lots of discussion around Edge Computing, protocols, Fog computing, Augmented Reality, Machine Learning, MLOPs, DevOps, EdgeOps, and of course […]