Insights
Leveraging AI for operational efficiency in IT
Omanakuttan Namboodiri
See the challenges, priorities, and expectations of IT operations teams in modernizing IT incident detection and response.
Omanakuttan Namboodiri
Enterprises worldwide are increasingly embracing digital business models and workflows, making IT the central driver of availability, agility, and reliability for their products and services. IT now plays a critical role in expanding market share and enhancing brand visibility. Businesses are placing higher demands on IT, expecting greater responsiveness, operational reliability, and the flexibility to adapt to rapid changes. Here's a quick look at how AI is transforming the landscape of enterprise IT operations, addressing key challenges with innovative solutions.
DIVIDER
Challenges in large-scale enterprise IT operations
Hybrid IT landscape: A diverse ecosystem of legacy systems, on-premise infrastructure, cloud platforms, and SaaS solutions presents a complex IT environment. To address this, organizations require a unified IT operations solution layer that enhances observability, drives automation, and delivers actionable insights at scale.
Siloed Monitoring: Fragmented and effort-intensive monitoring can cause difficulties in proactive remediation. Intelligent, automatic analysis of events, logs, and metrics for proactive remediation can help in large-scale IT operations requiring high levels of reliability and responsiveness.
Freeing up more bandwidth from routine and repetitive operational and maintenance tasks: More automation is needed to free up bandwidth for transformational and innovation initiatives.
Providing insights from operational data for adaptive and proactive planning: There is a demand for unified data and insights to enable better planning and decisions.
The main goals in large-scale enterprise IT operations:
- Prevent or avoid critical outages.
- Reduce the Mean Time To Detect (MTTD) and the noise in monitoring and the effort or fatigue
- Reduce the Mean Time To Resolve (MTTR), effort and time for restoring the services
- Provide accurate insights from the huge volume of data, both structured and unstructured, available with the IT operations
Let's explore how IT operations function without AI support compared to the transformative impact AI can have in addressing the challenges of a high-scale, demanding business environment.
Without AI
Monitoring, Incident Management, Service request fulfillment, and operations leadership are the four key functions and teams that run effective IT operations in large-scale enterprise IT. The infrastructure can become complex in global 24X7 distributed-scaled operations. Typically, the functions are organized in this manner:
DIVIDER
Monitoring:
Several native and third-party tools monitor the health of infrastructure components, networks, and applications that run business services. Monitoring tools generate huge volumes of data alerts, logs, metrics and traces. Monitoring teams not relying on AI solutions could end up dealing with high noise levels, poor observability, and alert fatigue. The bandwidth of people with expertise to detect critical outages and proactive remediations could get thinly spread with the high levels of noisy monitoring inputs.
DIVIDER
Incident management:
Incident management teams spring into action when a critical event occurs, such as a significant performance degradation or a slowdown in a vital customer-facing business service. Engineers analyze relevant metrics and logs to make an informed guess about potential root causes. They then examine distributed tracing to pinpoint the specific component or microservice in the chain that might be causing the issue. The IT topology linked to the affected business services helps the team assess the cascading impact of the performance degradation. They also draw on historical remediation data and experiences to determine the most effective resolution. This process can be both stressful and labor-intensive for IT operations teams tasked with maintaining critical business services.
DIVIDER
Service requests:
The service desk is another function where significant time and effort go into IT operations. While most organizations have task and SoP-level automations, there is good room for deploying AI to increase the self-service levels. Advancements in generative and conversational AI can help end-to-end automations, from automating intent identification of end-user requests to fully automating the runbook tasks. Auto-triaging and routing are also avenues for saving effort and time in the service request fulfilment process.
DIVIDER
Insights:
Unified insights from IT operations in most large enterprises are challenging due to the siloed and distributed nature of data. Analyzing data and deriving insights for decision-making takes significant human-in-the-loop, ad-hoc data preparation and reporting. A unified AIOps layer and Generative AI-based solutions can bring about a step change in this. Copilots based on unified AIOP data can provide relevant inferences in dashboards or reports or as part of the decision-making process flow.
DIVIDER
With AI
Let us see how AI and generative AI-based approaches can bring efficiency and responsiveness to these operations.
DIVIDER
Monitoring and observability:
AI models can process alerts, logs and metrics data at scale. This can help achieve very high levels of observability and alert noise reduction with minimal human effort. AI-based prediction models can help avoid critical outages. Following are key use cases for AI if the goal is to improve observability and MTTD:
- Alert correlation and clustering – Streaming alerts from several sources can be correlated and clustered using AI models and rules based on experience. This can help reduce alert noise and monitoring fatigue resulting from the first-ine infrastructure and application monitoring telemetry.
- Automatic triaging and routing – AI and rule-based approaches can create tickets and route them to the right teams. They can wait for retrieval alerts if the systems are restored automatically, and they can auto-close tickets when necessary, saving significant effort and time.
- Incident or service outage prediction – Predicting outages based on patterns in alerts, metrics and historical data could help avoid critical outages. With the improvement in forecast accuracy, loss avoidance potential is very high, making it a must-try use case for business-critical systems.
Compared to traditional alert monitoring, the AI-based approach can result in 70 to 90% alert correlation efficiency and reduced tickets resulting from alerts. Beyond that, triaging and auto-closing tickets based on observation and automatic remediations can help significantly improve operational efficiency and save effort.
DIVIDER
Incident management:
AI copilots can significantly reduce engineers' time spent on incident troubleshooting and incident reporting. When an incident occurs, the AI agent can open group communication channels, bringing together the right teams and experts. It can also provide automated log summaries and analytics. The following use cases indicate AI's potential to help faster incident management.
- Automatic identification of probable root causes
- Summarize historic patterns related to similar incidents
- Intelligent and automated search in troubleshooting knowledgebase
- Automatic generation of incident summary and incident reporting documentation
- Copilot that could open the collaboration space in teams or Slack, notify and call required individuals based on major incident support roaster and provide incidents from historical incident data, run health checks when required
DIVIDER
Service desk:
L1 service desk teams often interface with users, identify the intent, and do the initial triaging and remediation as per the runbook. Many high-volume service request types can be automated, given that the user intent and information capture are accurate. Advancements in conversational intelligence, large language models, and generative AI services can bridge this gap. There is a vast potential for agentic AI-like solutions that can elevate the levels of automation and adoption of automated service request processing. Here are a few indicative use cases that leverage AI for service desk:
- Agentic Conversational AI solutions for service desks: An AI agent can engage a user in a natural language conversation, identify the intent or issue, and trigger the right automation workflow for remediation.
- Intelligent Q&A co-pilots for SoP and Policies search and self-service enablement
Unlocking value: Four key areas where AI drives ROI in IT operations
The ROI from the spend on AI can be from the following four areas.
- Improved observability resulting in incident avoidance
- Boost productivity through automation by streamlining and covering more comprehensive, end-to-end processes.
- Better incident management resulting in overall responsiveness and reduced MTTR
- Faster insights from operational data for better objective decisions and actions.
Although some use cases may not deliver immediate results, slowing down AI investments in enterprise IT operations could jeopardize long-term success and competitiveness.
Foundry and UST surveyed 100 IT leaders to explore the current challenges, priorities, and expectations that IT operations teams face in modernizing IT incident detection and response. "Read the results: AIOps Infographic Survey."