Grading your Service Ops Team

As an Ops manager responsible for a SaaS product, I’m always thinking about how I can do better as a manager and how my team can be better…

As an Ops manager responsible for a SaaS product, I’m always thinking about how I can do better as a manager and how my team can be better. How do I assess the current state of my team? How do I measure their progress and growth over time? How do I know if I’m doing a good job?

To keep myself honest and sane, I’ve come up with 3 useful buckets: people, practices, and technology. Nothing original here (100% common sense) and many are not quantitative or really metrics per se, but I find them to be a useful touchstone to see how I’m doing.

You’ll notice that traditional operational metrics things like uptime, reliability, recovery time, etc. are missing.

I consider those the outputs. These are the inputs.

People

Solid individual and team performance is critical. Yeah, understatement, I know. This is the foundation for your success or failure. Without the right people you will fail as a leader. The business will fail. You MUST hire the right people, grow the people you have, and make tough decisions to gently (or not so gently) nudge those out the door that aren’t a fit. As I’ve learned over the last year, it is much easier to build a team from scratch than transform an existing team.

Autonomy and Alignment — are individuals able to work independently? How well do they perform in the absence of direction or without clear guidance? I read a lot military history which provides some useful analogies. What happens when the commander is put out of action? Do the troops keep pushing forward into battle or fall back in retreat and disarray? Do they understand the mission and “commander’s intent” should the situation change or the plans change?

Accuracy & Consistency — are tasks done properly and correctly? How often is “human error” the root cause of a failure? Is there the necessary attention to detail? Are tasks done the same way manner regardless of who performs the job? This requires delicate care as you don’t want to stifle innovation and creativity, but you also cannot have people working in a completely random and ad hoc fashion. Of course this is where automation comes in.

Capacity & Readiness — are teams fully staffed to support the business? How full is your backlog and work queue? Are there skill or knowledge gaps? Where are the bottlenecks that block the flow of value?

Collaboration & Engagement— do team members actually work as a team across functions and specializations and geographies? How well do they work with external support teams, sales, and engineering to support your customers. Is overcommunication encouraged and enforced? How well are the hand-offs done with engineering and QA? If you have a global “follow the sun” organization how well are shift transitions performed. Is there constructive conflict?

Practices

I was going to say “processes,” the dreaded P-word but I didn’t. Processes are a necessary evil in Ops. However, the more senior your team, the better they communicate, and the better they understand and are aligned on principles and values the fewer processes that should be needed. And for each of these are are two considerations: how well/consistently are they done and how well are they documented?

Deployment & Validation —how do you push code, build environments, and update infrastructure for customers, or apply patches to address bugs? How long does it take for requests to be processed and fulfilled? What is the cycle time? Are your deployment processes siloed or shared? How effective are is validation of proper end to end functionality following a deployment. See automation.

Architecture —how effectively are problems defined and solutions developed for more complex problems? Can you actually build stuff? Ops “done right” is an engineering effort. Do you have people that can actually build stuff or just run things. See capacity & readiness.

Maintenance & Prevention — what proactive measures are taken to detect and prevent problems and ensure reliable and secure steady-state operation? You are doing maintenance right? See security.

Change Management — what is the process for tracking, approving, and communicating changes to production systems? How well are they followed? Are they flexible enough to handle times when you need to “move fast and break things” vs. other times when you need to exercise more caution. See security.

Inbound/Outbound Escalation — how well are issues passed to and from other support and engineering teams? How well do you share knowledge and “pass the baton” to other teams. See collaboration & engagement.

Troubleshooting & Problem Resolution — how effectively are problems and incidents and outages managed. Is there a reboot and return to work mentality? How quickly are they analyzed and resolved? Do you learn from mistakes? What is the team’s “operational memory?” Is there situational awareness? Is that awareness shared by developers or is Ops the only team that has a clue about production/customer-facing environments. See metrics.

Technology

In DevOps, tools get much of the attention because they are are “fun” and can be sold. These are fairly obvious so I won’t spend much time here.

Alerting — what is the signal to noise ratio? Do you have fatigue? What is the coverage of your stack? Where are there gaps? How good is on your on call? Do you rely on email lists or use PagerDuty or slack notifications? See Inbound/Outbound Escalation.

Logging — how are application and infrastructure logs indexed, searched, and analyzed. Are your logs parsed and unparsed? Are they actionable. Do you alert on logs? Do you create dashboards based on logs?

Metrics — how much of the infrastructure and application stack has proper metrics for measuring resource/component utilization, errors, saturation. How sophisticated are you dashboards? Is there a culture of measurement or do folks stumble in the dark?

Automation — configuration management, provisioning, source code control, build and packaging. The usual suspects. How does your team decide what to automate and when and why?

Infrastructure — how performant, cost effective, and elastic is your physical or cloud infrastructure and core lower-level services. Does it meet the needs of the business or does it constrain it?

Asset Management — how do you keep up with systems, customers, applications as you get into the hundreds to thousands of nodes. What are your sources of truth?

Security — See automation, infrastructure, asset management, metrics, alerting, logging, change management, architecture, maintenance.

I’d love to hear how other Ops managers and leaders grade their teams!