Reliability and performance in software tools

We shape our tools, and thereafter our tools shape us — Marshall McLuhan.

It’s been a couple of years now since I discovered my deep passion for building developer tools. I’m particularly interested in tools that become an indispensable part of developers’ workflow. Most such tools have one thing in common: they provide exceptional developer experience.

I’m lucky that I work on developer tools at Netflix. As I grow into my career and gain more knowledge, I’ve started to develop mental models of software tools and what makes them particularly great.

Today, I’ll share some of those models, focused on software reliability and performance, which I think are the most important features of any software we build. I’ll also share some principles that I think could guide us in building reliable and performant software tools.

Why invest in tools

We developers love our tools and often spend hours and days customizing them for our workflow. We know it’s mostly justified, because it pays off (unless you’re me, who can spend a day choosing the right font for an editor). Anyway, every minute saved in our productivity compounds, in the long run, can have a huge impact, because it increases our velocity of delivering business value to users. Since tools that we create act as force-multipliers for developers and eventually for business, it should be of foremost importance to invest in our tools wisely.

Why reliability and performance are important

Since reliability and performance can have different meanings, depending on the type of software, we should define these terms in software tools.

To me, a software tool is reliable if it performs its intended function and, upon failure, provides clear and actionable steps for its user to recover fully. I mostly see performance as a function of one variable — speed. The tools we build should be fast, if our goal is to increase developer velocity. If the tool is fast, it's more likely to become an integral part in the developer's workflow. When the tool is slow and laggy, it's an expensive distraction for developers and can easily break their flow. Slow programs also tend to increase user-facing complexity, because users commonly route around slow workflows, which can often be frustrating.

To build tools that provide a delightful developer experience, we should treat reliability and performance as vital product features. Speed and reliability often go hand in hand, and speed becomes a good measure of quality and software craft. As with any product feature, we should improve these features incrementally, based on user feedback and defined principles.

Apart from these two software characteristics, other vital contributors to developer experience are accessibility, scalability, documentation, and maintainability. However, I intentionally left those for another blog post, and here I’ll focus only on performance and reliability.

Principles

I look at these principles more as heuristics, rather than rules that one should obey at any cost. In software engineering, tradeoffs are everywhere, so these principles can become mental models to apply when working with software tools.

Since the above ideas were relatively high-level, I thought breaking them down into a small set of guiding principles could make them more actionable for engineering teams.

Optimize for typical developer workflows

Tools should be fast for everyday developer tasks and workflows. If developers can complete repetitive tasks faster, they will ship features sooner and deliver more value to customers. For each tool, it's crucial to identify the most common developer workflows and try to optimize those workflows first.

One good way to identify critical workflows is to examine the tool's main objective and fit in engineers’ workflow. This can help to prioritize the right features in the roadmap and help developers to do impactful work.

Sometimes this requires eliminating unnecessary features from the tool that introduce mental clutter and degrade performance. Keeping our tools short and embracing the concept of doing one thing and doing it well, can become key for building high-performance workflows and tools. If we decide to add new features, we should always ask how they contribute to the main objective.

There should be two-way communication between tools and users. If we intend to make self-serve tools that are easy for developers to use, we should make them more interactive, with a fast feedback loop. This means that tools are responsive to every user action and provide clear next steps to complete their intended workflow.

To make tools responsive, we should prioritize performance and make them fast. If the tool is fast with a close feedback loop, it keeps the user within the current task context and doesn't require constant context switching. Of course, this improves the developer's velocity and experience significantly.

Measure reliability

The tools and user interfaces we build for users should be reliable in completing tasks. Without reliability, it's hard to build trust, and users lose their confidence and eventually stop using them. Of course, no software is bug-free, but we should be more proactive and less reactive when we approach bug fixing.

To get there, we should know what's broken and understand its impact on developers' workflow. It's okay, sometimes, to rely on intuition, but this can lead to oversight, and without any particular dataset, make it harder to measure the impact of the bug or incident. If we can measure reliability, we can better prioritize bug fixes and tasks.

It's futile to aim for 100% reliability, but we should define loose SLIs (Service Level Indicators) and SLOs (Service Level Objectives) for our tools to measure and improve reliability over time. We can use SLOs to set the right expectation for customers and use them to decide when to prioritize performance and reliability tasks over other features.