The Failures no one Celebrates

Photo by Jeromey Balderrama on Unsplash

Matt Franz

29 Jul 2023 — 8 min read

Failing, Fast or Slow

Over a decade ago, I was a new manager. I was also new to the world of software delivery, having only been in security roles (albeit for quite a while) previously. I had a lot to learn. It was the early days reporting to a new director that I was struggling to get aligned with.

Or at least that is how I would call it now. At the time I didn’t understand it and certainly didn’t have a language to describe it. I wasn’t even aware of the concept of “managing up.” All I knew is I’d had three bosses in the span of a year and it really sucked. I was in a new industry and I was confused.

I was also new to the world of business stakeholders making requests (scratch that, demands) and technical/non-technical leaders negotiating what could be delivered by when, why, and how much it would cost.

I was learning the hard way about the iron triangle of budget, quality, and cost — basically the “CAP Theorem” for product delivery. This was the time when I learned (and kept repeating) the phrase “least bad option” as what tech teams were stuck with. Now I pretty much accept this as normal. Negotiate what is possible under the assumption that it will be inadequate — and that you’ll have to iterate on on it when it cracks.

I can’t remember exactly what it was we were discussing, but my director said, “we need to let this fail.”

What? Why would you knowingly let something blow up? I didn’t get it.

Now, unlike in 2012, there is a canon of technical leadership books from leaders such as Camille Fournier, Michael Lopp, Will Larson, and Melissa and Johnathan Nightingale, Sarah Drasner and more.

At the time, the only real management training I had was the required course for timesheets and performance reviews when I first became a “people leader” at the company I had just left. Not helpful at all.

Failure is the only Option

Fast forward a dozen years. Since then, I’ve reported to four more Directors, two VPs, and the C-level twice. I’ve gotten a bit of coaching. I’ve done a ton of reading and self-reflection. I’ve seen a lot more failure. And yeah, some great successes although that is not the point or topic and hand. I’ve been part of the decision-making process that caused failures. I’m a lot better at managing up–and, more importantly, talking to my own teams about risks and managing up.

The blast radius of a failure is a question I raise with senior leaders when I see risks. Do we let this fail? How much? Why? I also have the discussion members of my team that raise concerns about the consequences of decisions or actions. I’ve seen customer escalations to the highest level of the business. I’ve felt the heat and frustration of angry customers caused by production outages in SaaS applications that had a negative impact on their customers. I’ve seen support cases from end-user balloon (and the call center light up!) due to a production change my team made that went badly.

Failures cascade from customer to customer in a technology ecosystem even if you aren’t a big cloud provider.

There have been failed migrations, deployments rolled back, and a few near misses on security events that could have been catastrophic had we not been lucky. This doesn’t count all the missed deadlines, delays, and products and features that never shipped–and one product rewrite that was canceled after it failed in the market because we never achieved feature parity with the original POC that went into customer’s hands way too soon.

I’ve seen senior leaders quit (or fired in re-orgs) and entire teams disbanded. I’ve had members of my own team resign causing 3–6 month delays in deliveries, despite heroic efforts. There have been so many shapes, sizes, speeds, and colors of failure.

I still don’t like it. It still seems wrong, but at least now I know it is the only way. It is unavoidable. Something is always broken. You just don’t know it yet — or know how the breakage will manifest itself and how bad it will be.

Failure align teams where they were incapable of getting on the same page before. Failure puts a focus on issues that previously didn’t seem important. Failures teach lessons that cannot be learned any other way because they are seared in your memory.

Where we are going, there is no Pre-Mortem

There is this nice idea of conducting a pre-mortem. I think some companies call this the “Lunar Lander Game.”

Basically, you game out the areas that you think things might go wrong, so you can plan accordingly. Maybe a couple of times I’ve tried this, but in the vast majority of failures you really have no control over the precipitating event. The car crash of a project or product is already happening. It is a question of how you steer, where you pump the breaks, and whether or not you have your seatbelts — or your car has airbags.

Also, in my experience there is already someone raising risks all along. It isn’t like nobody knows. It is just a question of where and when. We will go over budget. This is not sustainable. This solution is unsupportable. You get the point.

By the time you show up, the accident has already happened. The decisions have been made that set upon a course of action. The train has left the station. The deal must close by the end of the quarter. A key customer demands a certain feature, causing a reprioritization of your roadmap. It is always something.

It is a matter of how you respond to shape the events. Where do you try intervene? How much do you let natural consquences happen?

Failure Triangulation: Team, Business & Tech

During one of my turnaround gigs in the past decade, some time during the first year my boss told me, “Our customers are resilient. They can take a lot.”

This was profound and somewhat reassuring, which was the intent of sharing it with me. I think.

(I really don’t remember if this was before or after product development took a serious flubbing in a Summertime QBR and I was told that I wasn’t making enough progress on turning things around with my team. I needed to stop the bleeding, but that is a different discussion.)

The point is that is that in some Enterprise software segments that are sufficiently sticky, you have more far more leeway than you think. When a tool is embedded in critical business processes that may more take than 6–9 months (or more) to integrate, it is not so easy to fire you as a vendor. You really have to fuck it up repeatedly. Even the most truly hostile customer (and they are few and far between) cannot drop you as fast as they want to — or say they want to. They don’t want to change vendors. They want you to succeed.

Customers will of course threaten. I’ve done it myself. It their job. It is part of the vendor management game. They will complain to their board, which might have your competitors on it. They will threaten to go to that competitor. They will do a lot of things, but it is actually harder to rip and replace a core piece of software than you think. This doesn’t mean you shouldn’t communicate with them or be transparent. You have go there. You can’t freeze up and say nothing. You have to manage expectations. You have to be transparent as you can. Make progress.

But in some cases you have a lot more rope than you think no matter what the customer-facing teams are telling you — or what they are saying to the execs.

On one of my many trips to India I drew a triangle on a white board as part discussion about where I thought the team was at and what we needed to keep in balance (and consider) for success:

The Team — individually and collectively are we growing? do we feel safe? Do we have purpose, autonomy, and mastery? Is the compensation fair and appropriate? Are we firefighting all the time and spending 60–70% of our time on break/fix or work from other vs. building? Are we treading water, getting further behind or repaying technical debt and innovating?
The Business — is your department or function meeting the expectations of the stakeholders. This could be delivering a service. This could be building new capabilities that can be sold — or supporting the teams that are customer-facing. Are you at really risk of losing customers for any number of reasons?
The Tech — this is what engineers (and architects) are primarily concerned with. Is what you are building “done done?” Is it built the right way? Is technical debt sustainable? Or are you piecing something together that you know is garbage? Good engineers want to build good things. They don’t want to cobble shit together.

My point is if you think about a triangle of forces that you have to consider and keeping in balance, what often has the most force (either pushing or pulling) behind it, is the business. But you have to keep things in balance. Or else.

Protect your People

When determining how much you can fail, your first priority should be your people. What are the guardrails to the individuals when something goes wrong? Remember, the car crash anology a few paragraphs ago. Are there airbags? Are seatbeltbs buckled? What is the human cost? What is possible damage to organizational relationships. This is where continuing to use failure as a forcing-function to educate and transform is so critical.

Most technical or project or business failures have a human cost. They induce stress. They cause long hours of work. Worse, there is re-work. They force people to have difficult discussions or explore places where folks don’t often go — or want to go. Legacy software. Tools someone else wrote. Flaws in thinking or processes or technology are exposed. This causes discomfort. We are not robots.

But many people love a good firefight. I know I do. And if I’m honest, starting fires is something I have to be careful about, because I’ve seen the effectiveness of good controlled burn. Individual and group heroics are too often celebrated by leaders. Praise for those that went “above and beyond” to rescue a customer.

Maintain Credibility

Rough socio-technical situations that teams find themselves in are usually the result of poor business decisions. These could be sins of commision or sins of omission. Risks were accepted or not, but when you find yourself in a pickle, you aren’t getting out soon. Contracts have been signed. Expectations set. Solutions will not be quick. Things might not get better soon, but what you can do is have open and honest discussions about the current situation within the teams responsible. And some of the shit you are in might be because it isn’t clear who is responsible now. That needs to be figured out fast. Owning the situation with all its flaws. Not sugar-coating to diminishing the gravity of the situation is where you have to start. If it is a customer or business stakeholder, this means not debating their perspective. Acknowledging it. I’m ambivalent about whether you apologize or not, because sometimes these come off as performative — especially in written RCA’s. Of course everyone is “sorry” for the outage, but does that even matter? This is business. This isn’t about hurt feelings. This is about acknowledging reality and getting better. The most damage that can be done is not reading the room — or being tone deaf. Once a leader I reported to gave a way Sunnier-than-Necessary (and out of touch with reality) All Hands the day before we axed a bunch of contractors and embarked on another round of cost cutting. I got questions from my team. I don’t remember what I said (and whether or covered for him) but that was one of the “three strikes” for me in that role.

Weaponize Failure to Raise the Tech Stakes

In product companies, the customer may not be right but they are the most right. The same request from a customer (whether that is a resilience feature or security vulnerability) will often be taken far more seriously than if it was internally reported — or agitated for. Highly visible failure is a forcing function that, if managed appropriately, can bring light and not just heat. But, after all, you are playing with fire, so you need to understand how combustible the situation is. Hot fast can it burn? How much heat can the materials handle. Humans can only handle so much (you know, cognitive load and all) so keeping the fire under control and managable is critical to rebuilding and reshaping whatever the technical and organizational shortcomings that caused the conflagration in the first place.