Measuring Engineering Performance in the Agentic AI Era

Over the years, I have seen many ways of measuring engineering performance. Some were formal, some were informal, and some were not really called measurement at all. They still shaped how people were perceived, and that perception often started long before a dashboard or a performance review entered the picture.

The first layer is usually peer judgment. A teammate reads your code, hears your ideas, or sees how you approach a problem, and forms an opinion. Sometimes that opinion is fair. Sometimes it is mostly a reflection of style, preference, or local context. If someone does not see the system the same way you do, it is easy for them to read your decisions as weaker than they are. Not because the work is necessarily bad, but because it does not match how they would have done it. That is already a kind of performance signal, even if nobody names it that way.

That opinion can become contagious. Developers talk. Teams build narratives about each other. Someone can slowly become seen as a low performer because a few people frame their work that way, and others start looking through the same lens. The team starts to carry a shared sense of who is strong, who needs help, who slows things down, and who can be trusted with difficult work. Sometimes that sense is accurate. Sometimes it is incomplete, biased, or based on work that is easier to see than work that is more valuable.

Formal measurement often arrives on top of that human layer. A Product Owner may look at burndown charts, estimates, carryover, or whether the sprint is trending late. A lead may look at delivery, pull requests, comments, production bugs, feedback from the team, and the performance of the team itself. Those signals can help balance perception with evidence, but they can also create a false sense of objectivity. Many of them can be gamed. Estimates can move. Tickets can be sliced differently. Pull requests can be made smaller or larger depending on what gets rewarded. Comments can become performative. Even bugs in production need context, because some teams own riskier surfaces than others. At some point, the question becomes less about which metric is right and more about what we are actually trying to learn. Are we trying to increase productivity so the team delivers faster? Are we trying to reduce risk? Are we trying to make work more predictable? Or are we trying to understand whether the team is healthy enough to keep performing over time?

From velocity to delivery health

I understand why engineering ends up under this pressure. In many companies, engineering is one of the largest cost centers, and many other teams depend on it. If marketing is waiting for a feature that might bring revenue, they cannot prove the idea until engineering ships it. If product has a roadmap commitment, it becomes real only when the system changes. That creates attention, and attention usually creates measurement.

Earlier in my leadership work, I leaned heavily into sprint visibility. Before COVID, I had dashboards on TVs showing sprint progress, stories going over estimates, and burndown charts drifting away from the plan. The intent was useful: unblock quickly, notice risk early, and challenge the team when the sprint was slipping. Those tools gave us visibility, but they also reinforced a narrow idea of performance. They made the visible work easier to discuss while leaving a lot of important work outside the frame.

Later, I started looking at DORA-style metrics. I wanted to understand how long it took for a story to move from active work to production, how long pull requests waited for review, and eventually how we might connect that to deployment frequency, change lead time, failed deployment recovery time, change fail rate, and deployment rework. We did not have all the instrumentation I wanted, and I was not able to push that as far as I would have liked. If production had been visibly on fire all the time, it probably would have been easier to justify. But our systems, even old ones that throw plenty of known and accepted errors, generally kept working.

DORA still felt like an improvement because it moved the conversation away from whether a sprint looked clean and toward whether the delivery system was healthy. It asks better questions: can we ship safely, can we recover, do changes create instability, and how long does value wait before reaching production? But even that view is incomplete. DORA is useful for understanding software delivery performance. It does not, by itself, explain the full experience of doing the work. A team can improve delivery metrics while accumulating cognitive load, weakening architecture, burning people out, or losing the collaboration habits that made the speed possible in the first place.

When visible output becomes less reliable

Agentic AI made that limitation much harder to ignore. Large pull requests became easier to generate. Code review changed. The cost of producing code started dropping, but the cost of understanding, validating, and aligning that code did not disappear. If anything, it became more important.

That is the part that worries me most. If a team can generate more code in less time, it can also push more code toward production before anyone has really understood it. The risk is not only that the code is wrong. It is that the developer approving it may not be able to explain the tradeoffs, the edge cases, the assumptions, or how the change fits with the rest of the system. When that happens, review can become a checkpoint in name only. The team may still have pull requests, approvals, and deployments, but the actual understanding behind those steps is weaker.

That is where traditional activity metrics become dangerous. Pull request count, commit count, tickets closed, and lines changed were always weak signals when used alone. With AI, they become even weaker. A developer can generate more code than before and still make the system worse. Another developer may produce less visible output because they are pairing, reviewing, unblocking others, clarifying architecture, or preventing a bad direction from spreading through the codebase. The second person may be creating more value, but the logs may make the first person look busier.

This is especially true for senior developers and leads. A senior developer may spend time helping someone think through a problem, answering questions, testing an edge case, aligning with stakeholders, or preventing two parts of the system from drifting apart. A lead may spend hours in meetings that keep the team pointed in the right direction. None of that shows up cleanly in pull request metrics. If the code itself becomes more generated over time, the invisible work around the code becomes even more important.

The risk is an illusion of acceleration. More code appears. More changes move. More activity shows up in the systems we already know how to measure. But if more unreviewed or poorly understood code reaches production, the delayed costs can appear later through inconsistent patterns, shallow abstractions, review fatigue, regression bugs, ownership confusion, or architecture that no longer feels coherent. That risk is no longer theoretical. A Guardian report about AWS incidents involving AI tools described outages where Amazon disputed that AI itself was the root cause and pointed instead to user error and safeguards. Even with that nuance, the example is useful because it shows why production access, peer review, context, and guardrails matter more when agents can act quickly. AI can amplify good engineering practices, but it can also amplify bad ones.

Why SPACE feels closer

That pushed me to look for broader ways to think about engineering performance. I came across SPACE, which stands for Satisfaction and well-being, Performance, Activity, Communication and collaboration, and Efficiency and flow. Around the same time, I also started looking into DevEx, or developer experience. The more I read about both, the more they seemed to support each other in what they try to measure.

What I like about SPACE is not that it gives a perfect answer. It does not. What I like is that it refuses to reduce engineering performance to a single number. DevEx then gives useful language for some of the signals that can feed that view. It focuses on the lived experience of developers and the friction they face while delivering software. Its three dimensions, feedback loops, cognitive load, and flow state, map naturally into SPACE. They can inform satisfaction and well-being, efficiency and flow, and even communication and collaboration. A slow build, a confusing codebase, an unclear requirement, an exhausting review process, or an AI tool that creates more cleanup than progress are not just annoyances. They are productivity constraints that SPACE needs a way to surface.

Satisfaction and well-being matter because a tired team can still produce output for a while. That does not mean the system is healthy. In the AI era, I think this becomes more important, not less. A tool can reduce cognitive load when a developer trusts it, understands the output, and can move faster without losing control. The same tool can add load when the developer constantly has to undo generated code, restart jobs, repair misunderstandings, or review changes that are too large to reason about comfortably. A monthly pulse on burnout, tool friction, cognitive load, and confidence could tell us something that delivery dashboards will miss.

Performance still matters, but I would rather measure it through quality and system outcomes than through raw production. Failure rate, regressions, production bugs, test reliability, system stability, and debt reduction all matter. Debt reduction is hard to measure cleanly, but there are useful signals: aging known-risk items, repeated incidents in the same area, dependency freshness, code hotspots, flaky tests, and parts of the system that only one person understands. Shipping fast is useful only if the system remains dependable. Producing more code can make a team look productive while the product becomes more fragile.

Activity can still be useful, but only as context. Pull requests, commits, deployments, tickets, AI usage, and review participation can help explain what is happening. They should not become the goal. If a metric becomes a target, people will eventually optimize for the metric instead of the outcome. At the team level, activity can reveal patterns worth discussing. At the individual level, it can become unfair very quickly.

Communication and collaboration may be the hardest to measure, but they are also one of the areas I worry about most with agentic AI. If everyone starts working mostly through agents, some team conversations can disappear. Architecture decisions can become private interactions between one developer and a tool instead of shared decisions owned by the team. That is risky. Systems need coherence. If every developer and every agent follows a slightly different interpretation of how things should be done, the codebase can drift even while everyone feels productive.

Efficiency and flow complete the picture. Waiting on reviews, blocked work, flaky tests, slow CI, unclear requirements, interruptions, and context switching all reduce performance without always showing up as failure. This is where DevEx becomes especially useful as a measurement input inside SPACE. Review wait time, review burden, build time, test reliability, interruption load, and the ability to get into flow all say something about whether the team can turn judgment into software without wasting energy. AI may improve implementation flow while hurting organizational flow if it creates more review pressure, more coordination work, or more uncertainty about what changed and why.

Measuring without flattening

I am still working through how to collect these signals in a useful way. I do not want a surveillance dashboard, and I do not want a single productivity score. I want enough visibility to understand where the team is healthy, where the system is creating friction, and where a metric is telling only part of the story. A practical version might use SPACE as the organizing frame, DORA-style metrics for delivery performance, and DevEx signals like feedback-loop delays, cognitive load, and flow as inputs into the relevant SPACE dimensions. None of these should stand alone. The value is in the tension between them.

That means being careful about what gets surfaced and to whom. A metric interpreted without context can do real damage. Someone may see lower pull request activity and assume lower performance, when the person was mentoring, reviewing, handling incidents, or doing alignment work that prevented larger problems. Someone may see faster delivery and assume improvement, while quality is quietly dropping. Measurement should help teams ask better questions, not give leaders a shortcut around understanding the work.

The point, for me, is not to maximize engineering speed forever. If AI helps a team deliver features faster than before, the next question is not automatically how to squeeze out more. It is whether the team is delivering what the business needs at a sustainable pace, while staying healthy, communicating clearly, and keeping the system understandable enough to maintain.

Maybe I will change my mind in a few months and look at another framework. That would not surprise me. But I am convinced that measuring carefully is better than pretending we can manage this only by instinct. Bad measurement can be worse than no measurement if it creates fear, gaming, or unfair individual evaluation. The work is changing too quickly to rely only on old signals. The easier code becomes to generate, the more important judgment becomes, and the more careful we need to be about what we call performance.

- Patrick