Neither Devops nor SRE: PRS
We at Wondo did not know how to name the role in the organisation for the engineers that were in the past called system administrators. We feel that that term does not cover the extension of the tasks these engineers perform, and it clashes somehow with the devops philosophy that we happily embrace.
Some organisations actually use devops to name the role, but it does not feel good to us, since devops is a way to do things rather than a set of responsibilities.
The biggest names in the industry use another related term: SRE. Our team does system reliability indeed, but their role covers a bigger surface. I think SRE falls short for them.
We came up with the acronym PRS, which stands for Productivity, Reliability and Security. I think it sums up well their responsibilities, and I like that it focuses on what to achieve, not on how to do that. Let me expand on it sharing the guidelines that our PRS team use:
Repetitive processes must be as automated as possible, requiring no human intervention so people are dedicated to making the system grow.
Dev team should have tools to automate most tasks so they can focus on developing. The system should be abstracted to them, not hidden.
Developers must have all the tools they need for their job, and they should be able to operate those tools without assistance.
Systems should be available at all times. In case an outage happens, mechanisms must be provided to minimise its impact.
Systems must respond in as little time as possible, minimising latency and maximising throughput. Degradation in performance must be detected and acted upon.
Systems must be capable of attending any reasonably expected demand and should be able to anticipate future demand.
Systems should only allocate the needed resources, minimising consumption of idle resources.
Monitoring and Alerts
All operative information should be accessible by all stakeholders at users request. The only limitation is the personal data of customers, employees and users.
The system must be able to detect anomalies and faulty conditions and send alerts to relevant stakeholders.
The system should anticipate outages and send alerts to stakeholders so they can prevent the outage from happening.
All relevant parts of the system must be thoroughly documented so an outsider can gain an understanding of the system with little or no person to person interaction.
All relevant parts of the system must be understood by at least two members.
Provide traceability to the system so we are able to find out what happened after it happened.
Perform periodic checkups of the system by third parties and make sure to address issues that arise from them.
Minimum Access Policy
Regarding personal and security data, stakeholders must have frictionless access to the information they need, they must not have access to the info they do not need.
We were not able to find an industry accepted term covering all this area, but I would be more than happy to stand corrected if it actually exists and I missed it.