Ramping-Up Enterprise Resiliency with Site Reliability Engineering


By Alok Uniyal, Vice President and Head of IT Process Consulting Practice, Infosys

Enterprises are on the fast-track to digitalisation and the cloud. Their larger objective is to stay relevant in the digital world by meeting changing customer needs promptly and keeping them engaged with superior experiences.

In this scenario, IT uptime means everything for CIOs – their new IT organisation must stand up to the challenge with respect to stability, availability, reliability as well as resilience. Enterprise resiliency tops this list because without it, the business could face serious damage – from reputation and money to customers and revenues.

Ramping up enterprise resiliency with Site Reliability Engineering (SRE) is a strategic approach with the focus on enhancing the reliability and robustness of complex systems within the enterprise. What sets it apart is in the way SRE brings together tried-and-tested software engineering practices and operations expertise to create a culture of reliability, scalability, and availability.

Tackling the to-do list for CIOs to strengthen system resiliency

Any business that has experienced system failures, outages, slow response times, platform instability, service non-compliances, or any other form of downtime will vouch for the consequent losses it had to incur.

To improve enterprise resiliency and move past standard business expectations, CIOs are firstly trying to understand how resilient their current systems are and then looking to improve their predictive analysis and service management capabilities.

They have a lengthy to-do list before the enterprise gets to the desired levels of resiliency. The tools provided by SRE help immensely by detecting issues, predicting failures, and resolving root causes – the primary focus is to create reliable, secure, and scalable systems offering built-in observability. As a result, these systems can withstand the test of time even under the most stressful conditions such as heavy workloads and peak demand.

SRE is built on solid principles that drive Service Level Objectives (SLOs), incident management, and automation of repetitive tasks. SLOs help benchmark service availability and performance and can be tracked closely to observe system performance in near real-time.

With SRE enabling process automation where possible and making systems a lot more dependable, the business can expect reduction in manual interventions, improvement in cost-effectiveness, and movement of their resources up the value chain as they can take up more strategic tasks.

Getting SRE right

By adopting SRE principles, enterprises can build the kind of systems that adapt to changing demands, reduce disruption impact, and deliver exceptional user experiences. The benefits go beyond DevOps engineering (CICD) automation and collaboration and extend to development and operations teams working together as a cohesive entity.

However, not every enterprise has the technical skillset or advanced toolset already available. Change management and establishing the right culture could be an overwhelming exercise. While the success of SRE is not straightforward, these setbacks should not throw off decision-makers. Instead, there are several practices they can establish beforehand to ease the SRE transition for the enterprise. For example, continuous observability of systems and applications is an in-built attribute that lends itself to proactive testing and identifying possible problem areas early and rooting them out before these lead to system issues. This can be achieved by enabling and encouraging joint efforts by the SRE and development teams.

Another proactive measure would be to use chaos engineering methods that typically create controlled failures in the systems, study the impact, identify possible vulnerabilities, and strengthen those areas, in a focused manner.

Enterprises have reached a stage where they must transform their operating models to sustain business growth and competitiveness. Current ways and models must be reassessed on their effectiveness and relevance to future needs. It could be the lack of alignment with what customers want or between business and tech teams, which function in isolation – the net result is that collaboration and innovation are impeded, and the business pays the price.

Adopting Agile, DevOps, and SRE practices will help improve collaboration between software developers and operations teams, their responsiveness to change and continuous implementation while ensuring their systems demonstrate resilience through the toughest of conditions. Above all, these practices will infuse the customer-centricity, product thinking, and experimentation and innovation that enterprises were missing earlier, and help accelerate their digital journey and get them to creating value faster for business.


Please enter your comment!
Please enter your name here