I spent some time going through the Glasswing post from Anthropic, and what stuck with me wasn’t the model capability or benchmarks. It was the shift in responsibility.
We’re not just building software anymore. We’re managing behaviour.
For most of my career, building felt predictable. You write a function, define inputs and outputs, test edge cases, and ship. If something breaks, you trace it back to a line of code. There’s a clear chain of cause and effect.
AI systems break that chain. With something like Glasswing, the core problem is not “does it work?” but “how does it behave over time?” And that’s a much harder question to answer.
The Illusion of Control
Traditional systems give you the comfort of determinism. Even in complex distributed systems, there’s still a sense that everything is ultimately explainable. Logs, traces, metrics. You can reason your way through failures.
With AI, especially large language models, you lose that neat boundary.
You can prompt the same system twice and get slightly different answers. Most of the time it’s fine. Sometimes it’s better. Occasionally it’s worse in ways that matter.
For example, imagine you’re building a tool that summarises financial reports. On Monday, it produces a clean, accurate summary. On Tuesday, with slightly different phrasing in the input, it misses a key risk disclosure. No errors, no crashes, just a subtle degradation in quality.
In a traditional system, that kind of inconsistency would be unacceptable. In AI systems, it’s expected. The question becomes how you detect it, and what you do about it.
That’s where systems like Glasswing come in. Not as a feature, but as a necessity.
You’re Designing Boundaries, Not Just Features
One thing that becomes obvious is that building with AI is less about adding features and more about defining boundaries.
You’re constantly asking:
- What should this system never do?
- What is an acceptable level of error?
- How do we detect when it crosses that line?
Take a customer support assistant as an example.
It’s easy to get something working. You connect a model, feed it documentation, and it starts answering queries. It feels magical in the first demo.
Then reality kicks in.
A user asks about refunds, and the model confidently invents a policy that doesn’t exist. Another user asks about a technical issue, and the model gives a plausible but incorrect workaround. Nothing obviously broken, but the cost of being wrong is high.
So you start adding layers.
You constrain responses. You add retrieval. You introduce validation checks. You log outputs and review them. You build internal tools to evaluate responses at scale.
At some point, you realise most of your effort is no longer about the “assistant” itself. It’s about everything around it.
That’s governance.
Evaluation Becomes a First-Class Concern
Testing AI systems doesn’t look like writing unit tests.
You can’t just assert that given input X, the output must equal Y. There isn’t always a single correct answer. Instead, you’re evaluating quality across a distribution of possible outputs.
This is where the mindset shift really hits.
You start creating evaluation datasets. You define what “good” looks like. You run the system against hundreds or thousands of examples and look for patterns.
- Does it hallucinate under certain conditions?
- Does performance drop for longer inputs?
- Does it behave differently for edge cases you didn’t anticipate?
And even then, you’re not done.
Because behaviour can drift. Model updates, prompt changes, new data sources. Small changes can have non-obvious effects.
So evaluation isn’t a one-time task. It’s continuous.
Glasswing, in that sense, is less about a tool and more about a philosophy. Treat evaluation as part of the product, not an afterthought.
Monitoring Is No Longer Just About Errors
In most systems, monitoring is about uptime and failures. Are requests succeeding? Are latencies within limits? Are there exceptions?
With AI, a “successful” response can still be wrong.
So you need a different kind of visibility.
You care about things like:
- Confidence or uncertainty signals
- Consistency across similar queries
- Deviation from expected behaviour patterns
For instance, if your AI writing assistant suddenly starts producing more verbose answers after a prompt tweak, is that a bug or an improvement? It depends on your product goals. But you need a way to notice it first.
This is subtle work. It’s not as obvious as a 500 error in logs.

The Role of the Engineer Changes
This is the part I find most interesting.
As engineers, we’re used to precision. We like systems where we can reason about every branch and every outcome. AI forces you to get comfortable with ambiguity.
You’re no longer just implementing logic. You’re shaping behaviour under uncertainty.
That means:
- Thinking in terms of probabilities, not guarantees
- Designing feedback loops instead of one-off solutions
- Accepting that some level of imperfection is inherent
It also pushes you closer to product thinking.
You can’t separate “engineering decisions” from “user experience” as cleanly anymore. If your model is occasionally wrong, you need to decide how that shows up to the user. Do you add disclaimers? Do you provide sources? Do you allow corrections?
These are not purely technical questions.
A Concrete Shift in How We Build
If I compare a typical React app I’ve worked on with an AI-powered product, the difference is stark.
In a React app:
- Bugs are usually deterministic
- Testing is about correctness
- Monitoring is about failures
In an AI product:
- Issues are often probabilistic
- Testing is about quality and alignment
- Monitoring is about behaviour
That changes how you prioritise work.
You spend less time polishing edge-case logic and more time building systems to observe, evaluate, and guide behaviour.
Where This Is Headed
What Glasswing hints at is a future where every serious AI product has a layer dedicated to evaluation and control.
Not as an internal tool that a few engineers use, but as a core part of the system.
Because the alternative is flying blind.
And that’s risky when your system is generating outputs that users might trust.