Site Reliability Engineering , along with DevOps , is on the rise in the financial services sector . And with good reason; the advantages are numerous.
SRE started in 2003, at Google, as an attempt to organise and optimise the system administration and the development in a better way, creating accountability and common objectives for the whole team by blending software engineering with operations. Since then, SRE has grown to a virtually indispensable field within the technology arms of firms, ranging from start-ups to global corporations.
With its system-critical services and low tolerances of service downtime, banks and larger financial institutions have generally held off moving to new ways of working, to mitigate the risk generally associated with early adoption.
SRE , however, is a slightly different kettle of fish, since it doesn’t directly concern the technology (codebase and environment), but working methodology. If anything, a transition to SRE has every chance of delivering an almost immediate positive impact on a system or product, as well as improve future iterations.
And as we’re seeing in the marketplace, this has now started to change in a significant way, with virtually all high street banks hiring aggressively for SREs, as are many hedge funds (Citadel, WorldQuant) and fintech start-ups like Starling Bank and TransferWise.
The key concept in Site Reliability Engineering, that makes it so compelling for these types of firms, is the relentless focus on the end-user, via the Service Level Agreement (SLA). Implemented correctly, it should offer a very high system availability, with a minimum of errors, along with an accelerated pace of deployment of new functionality. In, for example, a high-frequency trading scenario, even 99.99% availability can end up very costly.
So if the SRE approach is so much better than the traditional division between sys-admin and software development, why isn’t everyone doing it?
We think it’s indisputable that the world of software products, almost regardless of what they are, are moving to either a DevOps-type approach or SRE. There are of course particular cases where the traditional sys-admin approach makes more sense, maintenance of legacy software being one.
A major constraint that we see, is the availability of SRE or DevOps candidates. The field being a relatively young 15 years is one contributing factor and the speed of digital transformation another. The fact that, as is the case for SRE, the methodology was created by one company, and was designed for their particular needs, goals, and environment, is also likely to have played a role; the number of teachers (ex-Google SREs) who have first-hand knowledge of SRE, the tools, limitations, and opportunities have been limited.
Another issue is, of course, the different skill sets required for the different roles, and the likely career paths a candidate is likely to have had before moving into DevOps or Site Reliability Engineering. The main difference is that an SRE/DevOps is a hybrid role (as suggested by the moniker DevOps ). The candidate will spend some of the time managing a system, or stack, and some on software development, be it software to monitor, improve or act as a fall-back for the system itself, or, as it’s done at Google, create software for public deployment.
This is something that will likely be a real obstacle for someone who has a sys-admin background unless they have acquired software development skills elsewhere. To come from a pure software development background tends to be a more viable path, and the transition to an SRE-type role easier.
If you’d like to know more, give us a call to discuss SRE/DevOps more in detail, whether you’re looking to move into the space, or want to move to the next level or just want to see what’s on the market.
Some useful links for further reading: