You've built it, now you support it

Humans make mistakes, it’s a fact of life. Despite our best intentions, and no matter how many checks we put in place, sometimes things go wrong. It’s no different with software. To move at speed then it’s inevitable that production issues will happen. Sometimes it’s your code that doesn’t work as expected, sometimes dependencies break, sometimes infrastructure breaks. Sometimes that once in a lifetime event happens. Again.

This talk is a story about support. It’s a real life story of how what was a chaotic situation, with unhappy customers and developers, has been turned around to something that keeps our customers happy, and is even fun to be a part of.

single engineering team

Learning 1: you don’t need much when you start, short feedback loops are broken by too much process

two engineering teams

start to get chaotic

product teams have their favourite engineer to fix problems

-> introduce single slack channel for support

=> supporty: slack bot written in Typescript (by Bloom&Wild)

Supporty

file support tickets via Slack form, engineers reply, everything is handled in single Slack threads

gives a simple view of latest status
clarity of ownership
prioritisation
data to learn from

Clear priorities build trust

P1: Business Critical (SLA - 1 hour)
- customers unable to access the website
- production platform is inaccessible
- security issues
P2: Business Affected (SLA - 3 hours)
- issues with admin that block business usage but don’t delay deliveries
P3: Business Updates (SLA 24 hours)
- cosmetics

Learning 2: More Complexity = More Support, use tools at the right time to help bring order to the chaos

There is a trap when starting to use tools -> hide behind tools

we love it when you

flag issues
remember that there’s humons behind supporty
give as much detail
respond promptly

we’d rather you didn’t

slack your favourite engineer ..
…

The cost of an issue

P2 issue

reports the app not working in the US
why? we had put in some additional security restrictions
143 replies in the thread
12 different people
5 hours to identify and resolve

review: 1 star going to 5 stars after fixing (time well spend?)

=> statistics and data are your friend

Average open time vs Priority
Number of Requests vs Number of code changes needed: when no code changes needed indicates that part of the app is extremely complicated

Time and Cost of Support

Time and cost of support

=> Make it visible

Change Failure Rate = number of production incidents / the number of production changes, how often do we make changes that break production

simplisticly: number of deployments vs number of major incidents

Minimise Time to Recover

invest in good monitoring, alerting and observability
have runbooks for key scenarios
invest in fast deployment pipelines
run game days to practice
hold post mortems

Learning 3: Collect data so you can learn from it -> discover the cost of support

up until now everything was fine within business hours

after 6 pm what would happen? usually make its way to the CEO, then VP of Engineering receives a call, who tries to find someone

=> horrible situation for ones mental health

=> paid out of hours support

it’s fair
- not knowing when your phone will ring is not fair
- thinking your phone might ring in the middle of the night is not fair
it sets clear expectations
it’s good for mental health
it’s good for your company and your engineers
it’s not expensive really
- compared to the cost of a production incident

Learning 4: Make it easier and fairer out of hours support

But support is not just about user requests is it?

flaky tests
slow queries
…

=> gardening

even evil masters need a hobyy to escape the incompetence of their …

The Gardener

monitoring and fixing production issues
fixing staging
merging and testing security fixes
upgrades
…

advantages

enables engineers to understand areas outside of their domain
helps shared ownership
makes capacity planning clearer

makes a shared problem more clearly owned

Learning 5: supporting your system is not the same as supporting users, shared problems with single ownership

Scaling Challenges

less knowledge of the whole domain since it’s bigger
we find The Gardener isn’t working
- it’s weekly and therefore focusses on short term quick fixes
- doesn’t anchor responsibility
- when we really need capacity the gardner are often taken for high priority features

-> Each Squad has a Gardner

tied together in a Community of Practice
ownership and accountability
craft and capabilty
…

Making the investment visible -> Number of Stories: operational, tactical, strategic

Why didn’t we make a team of Gardeners?

horrible team to be in

Learning 6: …

You Built it, Now Support It

set clear expectations
introduce process and tools at the right time
collect data and learn
pay for on call
make ownership and autonomy your goal