Don’t Let AI Invert The Testing Pyramid


Lately, I encountered a specific article titled “Quality Engineering with AI”. I have seen it shared with a dangerous level of enthusiasm within organisations. To the untrained eye, the article appears to validate a so-called “modern” testing strategy; to anyone who understands software economics and the mechanics of software delivery, it is plain disconcerting - causing one to feel unsettled.


‼️ Caution: Written in a moment of extreme professional clarity (and total lack of patience). Fuelled by equal parts whisky and disbelief. Handle with care: contains high traces of professional exasperation.


The article’s TLDR goes like this …

  1. Focus tests at the E2E level, as this will be more static than the code.
  2. Build your testing scope around the lines you will not cross (regressions, security, maintainability).
  3. You can’t review everything, so use E2E tests and AI reviews to find smells that you want to deep dive into.
  4. Use shift right testing to cover business context and pick up what your tests might have missed, feed back learnings as context to improve your AI-generated tests over time.

The idea to focus on end-to-end tests, in the first point, is corroborated by the notion that AI outputs are non-deterministic. The generated code will change continually, making it difficult for traditional tests (read Unit Tests) to continue working. The author assumes that because the code changes so often, it becomes ephemeral, given the amount of AI refactoring. However, the product and its features should remain static. Therefore, testing should focus on that level. As a result, we should invert the traditional Testing Pyramid to include more end-to-end tests using Behaviour Driven Development (BDD) to test the desired behaviours. Testing at this level should be more comprehensive, including errors and edge cases to ensure that AI code changes do not introduce any regressions. According to the author, this would be more effective, even though it would be slower and less efficient. WTF!

Point two is a bit messy. I had to re-read the article several times to understand what the author meant. Basically, it boils down to the fact that AI can modify large amounts of the codebase unexpectedly. Therefore, there is a risk of introducing regressions, data leaks and sloppy code resulting in a maintainability nightmare. Hence, the author advises again end-to-end tests as well as shifting right with observability to avoid regressions; alongside security scans, penetration testing and reviews to prevent security issues such as data leaks; and cyclomatic complexity and human reviews to spot code smells. Security scans, code quality scans, and penetration testing are certainly good initiatives regardless of AI. Reviews to prevent security issues are delusional. We will come back to the end-to-end tests and shifting right later on.

Point three comes down to the AI generating too much code for the team to keep up with reviewing. Thus, we will use the AI to review the code that the AI generated. That is going to work well. Luckily, we have those slow end-to-end tests to save us while suggesting that we also shift right when the team does not have the time to test. Sure, that sounds like a plan.

Lastly, point four essentially relates to the fact that the AI does not understand the intrinsic details of the business and domain, allowing the AI to generate the tests properly. To improve, the author suggests writing more verbose user stories to feed the AI, and again shifting right to observe where failures occur after releasing AI-generated code to inform future tests (generated with AI?).

This all looks especially eventual, and non-deterministic. Low-performing organisations already applied everything advocated here before AI even existed - except maybe for security and code complexity scanning, which was already advanced back then; however, today, it is not anymore and should be the default now - yet it still did not work. Imagine what will happen once these organisations start adopting AI. Yes, indeed, it will not look pretty. There will be casualties.

First of all, AI has not changed a single thing about the mechanics of software delivery. The principles are still valid.

There is a reason for the Testing Pyramid. It is feedback and economics. It goes together with Continuous Integration and the Deployment Pipeline from Continuous Delivery. We want plenty of Unit Tests. They need to be quick to enable engineers to execute them all the time for every change as part of the Local Build to inform them whether they broke something or not. It signals us whether we introduced regressions. Same for AI. With every code change by AI, these tests must pass. We can even inform the AI, as long as these tests do not pass, your job is not done. Start over.

It is economics, because most regressions should be detected by Unit Tests to give engineers the confidence that they can move on once these tests pass the Commit Build. Consequently, they must be extremely fast, allowing engineers to execute them many, many times as part of the Local Build and by machines remotely as part of the Commit Build, to uncover regressions quickly. It is downright impossible to achieve this level of fast feedback with end-to-end tests, because they lag, and not to mention, they will never achieve the same coverage as Unit Tests, therefore preventing them from uncovering some regressions.

When code becomes ephemeral because the AI alters the code continually, and hence we cannot use Unit Tests anymore, something is seriously wrong with the use of AI. Teams need to reconsider how they use AI in that case.

Research by Lund University and CodeScene, confirmed by early field observations from ThoughtWorks, indicates that AI-assisted coding performs better on high-quality code. Consequently, we can only take advantage of the AI benefits and thus avoid the maintainability nightmare when supported by strong engineering practices, which also involves having a well-layered testing strategy. Engineering excellence has always been a competitive advantage, and it will always be.

Using AI on low-quality codebases poses considerable risks, requiring a high level of human intervention and assistance from code quality assessment tooling. However, according to CodeScene’s Code Red white paper, only 10% of the organisations measure code health and technical debt, while engineers waste 42% of their time dealing with poor quality code. At the same time, from experience, most codebases are a facade: a decent-looking house, supported by seemingly working end-to-end tests, yet sitting on a foundation of rotting timber. The tragedy is that people often do not realise the danger; they have not seen what a solid foundation is, nor what good design and quality code look like. So, good luck with that!

Focusing testing only on end-to-end tests is a mistake. Most organisations I encountered in the past tried that, but, as mentioned earlier, it did not work. Well, somehow, they managed to make it work, but at the expense of lead times, reduced feedback, declining quality, and rising staff fatigue. Trying to test all errors and corner cases with only end-to-end tests is inconceivable and simply unmanageable. There are a plethora of permutations and combinations that one cannot foresee. Another aspect is the over-reliance on end-to-end tests - we can change any line of code, and the end-to-end tests will catch any regressions. However, this approach ignores the fact that end-to-end tests do not necessarily cover all modified code paths due to the infinite combinations. In that case, the end-to-end tests become the curb appeal, providing a false sense of security. So, that’s that.

However, we do need Automated Acceptance Tests (not end-to-end tests). We implement them using BDD. Yet, they only cover the happy paths. The errors and corner cases are handled by Unit Tests. Because it is significantly easier to reason about edge cases with narrow tests. We do have Smoke Tests, which are end-to-end in nature, to test the integration of the system after deploying, but before releasing. Is the backend there? Are third-party services available? Can we execute the most important transaction? However, we only have a couple of them, a maximum of five. Ultimately, Have a Vast Amount of High-Quality Automated Tests implemented in a layered approach is still a necessity with AI-assisted coding.

To train AI to understand the domain, we do not need longer, more detailed user stories. No! That is called upfront design. It is once more returning to waterfall. We still want one-line backlog items, as long as they are good conversation starters. As before AI, we still need precise and concise specifications. When picking up the story, we start a conversation using Example Mapping and/or Event Storming to clarify the intent of the story. There is nothing new here. We already did that before. What is new, though, is that now we can feed those specifications to the AI. Teams are already experimenting with this and are having good results, together with high-quality codebases.

I am all for Shift Right, with monitoring and observability, when there was first a Shift Left. Only Shift Right is an unmade bed that we will lose, for sure. I give you that. Shift Right is what happened with waterfall. We know how that went. It was no oil painting - it was, definitely, not pretty.

When I challenge this “end-to-end test-first” approach, the common defence is to reference the Testing Trophy and backing this with Martin Fowler’s On the Diverse And Fantastical Shapes of Testing.

Whilst reading Martin Fowler’s article, I did not immediately understand the point they were trying to make. Particularly, using Martin Fowler to uphold end-to-end tests is certainly a confident move. Until I understood they conflate Integration Tests with Sociable Tests to justify avoiding Unit Tests entirely. It is an assured misuse of authority to uphold a strategy that is, frankly, unsustainable and, eventually, will not scale.

Sociable Tests are Unit Tests focused on one single class in collaboration with other classes (that have their own Unit Tests). The fixture creates that object with all the necessary dependencies to exercise the object’s methods as part of the test. Solitary Tests, on the other hand, are mockist, London-style Unit Tests in which mocks replace the collaborators. So, Sociable Tests are definitely not Integration Tests. They are Unit Tests as defined by Michael Feathers. They do not test integrations. Integrations are tested with Contract Tests. In the end, Integration Tests are just a scam.

In Conclusion

As one SoCraTes France participant once said, in the IT industry, people do not know their history. That was the reason for Laurent Bossavit to publish The Leprechauns of Software Engineering. With every new technology introduced, the same mistakes repeat.

AI is not a “get out of testing” free card — it is, in all reality, the reason to double down on the basics. Again, the real shift is not AI. It is engineering discipline. It has always been.

Remember that writing code has never been the bottleneck. Analyse the bloody value stream. It will shout at us where the bottleneck is. What if the bottleneck is the market? If that is the case, we have been accelerating code delivery for something nobody wants. So, there we are, in all embarrassment …

Acknowledgements

Elizabeth Zagroba and Lisi Hocke for reviewing and providing insights to round the article. My best friend Martin Van Aken for removing the sharp edges.

References

Definitions

Commit Build

The Commit Build is a build performed during the first stage of the Deployment Pipeline or the central build server. It involves checking out the latest sources from Mainline and, at minimum, compiling the sources, running a set of Commit Tests, and building a binary artefact for deployment.

Commit Tests

The Commit Tests comprise all the Unit Tests along with a small, simple smoke test suite executed during the Commit Build. This smoke test suite includes a few simple Integration and Acceptance Tests deemed important enough to get early feedback.

Local Build

The Local Build is identical to the Commit Build performed by the Deployment Pipeline or the central build server, except for some additional actions only performed by the Commit Build, in particular uploading build artefacts. See Run a Local Build for more details.