Technical Debt Horror Stories
If your organization has a lot of technical debt, chances are something (or everything!) will go wrong one day. Do you keep those dumpster fires private or share them with the world? As it turns out, loads of people are happy to share them, so here is our collection of the scariest and the most entertaining tech debt stories. Enjoy!
Most people have encountered the silent company killer (tech debt) by now and luckily for us, they’ve been sharing their personal failures, complaints about the worst databases, and nuclear-level accidents that cost millions of dollars.
- How to lose $462 million in 45 minutes
- One Small Thing to Destroy a Bank’s Reputation and Ruin a Team
- Instagram Database Catastrophe
- The “wants-to-be-technical” manager nightmare
- The “first job” sins
Management Warning: This post may cause ostrich-itis. You might not want to see how bad it can get. Definitely don’t bring these up next time you’re debating new features vs technical debt during sprint planning!
How to lose $462 million in 45 minutes
By Simon Sharwood, APAC Editor at The Register
What happened?
Knight capital lost $462m in 45 minutes because of resurrected dead code.
In June 2012, the New York Stock Exchange (NYSE) received permission from the SEC to launch its Retail Liquidity Program. Designed to offer individual investors the best possible price, it was set to go live on August 1st. Thus, trading houses had roughly a month and a half to write code and take advantage of the new feature.
This feature was intended to replace unused code which had previously been used for discontinued functionality. They intended to delete the old code.
When did it go wrong?
“During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review.”
On the day when the software went live with the new feature, “Knight received orders from broker-dealers whose customers were eligible to participate in the RLP. The seven servers that received the new code processed these orders correctly. However, orders sent with the repurposed flag to the eighth server triggered the defective Power Peg code still present on that server.”
How did it end?
The un-patched server therefore kept making "child" orders for more shares than Knight or its customers wanted. So many orders, in fact, that some stock prices fluctuated wildly and Knight was left holding shares nobody wanted that it had acquired at prices nobody was willing to pay. By the time the market had moved on, it was left with $462m of losses.
One Small Thing to Destroy a Bank’s Reputation and Ruin a Team
By Willem-Jan Ageling, Co-founder/editor Serious Scrum
What happened?
The Tidal Wave team was getting ready for a big day - this Sprint they will make the product available for the complete international internal payment flow of the bank with millions of payments totaling billions and billions of dollars. But it did not go as planned.
When did it go wrong?
There was a small thing that Claire from the Development Team was worried about:
“Puneet thinks he found the cause of the issue in production. It has to do with the shortcut we choose to apply instead of a rigorous time consuming alternative. It doesn’t work properly. Fixing it will take about half the capacity of our current Sprint.”
Jesse, the Product Owner, disagreed:
“We really shouldn’t Claire. I don’t want to spend all this time on this. First analysis tells us that we can easily work around it. I propose that we ensure that we all know the work-around and do the fixing later. We have some very important new stuff in the backlog that requires our full attention.”
“It’s only a first assessment Jesse. We’re not 100% sure that we found the cause.”
“Well, Puneet did the analysis and he knows the code in and out. He is certain. I trust him fully on this. I will also put it on the Product Backlog.”
On the day of a big launch of Payment Router everything works smoothly.
At 11.30 PM when the party is over and the team is ready to have a good night's sleep, they get a call from the help desk: there’s a flat-liner on the monitoring screen, looks like the processing has stopped.
Puneet and Claire are trying to fix the issue but the problem is more complicated than they initially thought.
There’s an enormous delay in the processing of the payments. Everything piles up. The more Claire and Puneet dig into the code, the more they are at a loss what is causing the issue. On top of that they are getting really tired.
By 11 AM the team manages to get the Payment Router up and running again with a band-aid.
On one hand they are happy that the system works again, but the work-around appears to be rather shaky.
They decide to create a band-aid on top of the band-aid. By now they also decided to work in shifts. John, Carlos and Alice will work from 8 AM to 9 PM. Puneet, Karthik and Claire will do the 8 PM to 9 AM shift. With this they should be able to survive the first week, enough to resolve the issues.
How did it end?
3 months after the launch, the situation improved considerably. In the first week of the launch the team had around 20 ‘priority 1’ incidents per 24 hours (of which 6 arrived in the middle of the night). Now it is reduced to 1 ‘priority 1’ incident per day.
The team managed to save the bank’s reputation. But ignoring technical debt had cost the bank millions of dollars. Extreme efforts to repair the situation saved the bank from losing hundreds of millions.
The team got tired of working 90 hours a week. Puneet reported himself sick with burnout symptoms. Claire threatened to leave the company if she had to continue working on the Payment Router team. She moved to a different team.
The remaining four do the best they can to continue improving the stability of the system. Their only reason to remain on the team is their sense of duty.
👻 Tell us your own horror stories!
Instagram Database Catastrophe
By Falon Fatemi, Contributor & Mashable
What happened?
Instagram nearly fell prey to these growing pains early on. When the Instagram team launched its iPhone app in October 2010, they ran its operation off of a single server in LA. But after an onslaught of traffic nearly crashed the server, Instagram pivoted in three days to an EC2-hosted database.
When did it go wrong?
On October 6, 2010, Instagram launched its mobile photo sharing service for iPhone. In six hours, the back-end operation was completely overwhelmed.
Instagram officially went from a local server-run operation to an EC2 hosted shop in the wee hours of Saturday morning October 9, 2010. Co-founder Mike Krieger compared the transfer to open-heart surgery, and now he works to preemptively address technical debt before it leads to catastrophe.
"It was a really rough night," says Krieger. "The hardest part was the database transfer — the database was the heart and soul of the app."
Since the switch, Instagram keeps up with the pace of service activity by adding machines to the Amazon cloud when needed and preemptively troubleshooting the app and system activity.
"Amazon can make you lazy," says Krieger. "We made a lot of dumb mistakes early on. Now the science is going in there and analyzing individual actions, diving into exactly what's going on."
How did it end?
Krieger talks of no longer throwing more machines at scaling problems. Instead, the Instagram team has learned to improve application activity efficiencies in an effort to maximize the machines already in use.
Instagram set up data-logging graphs for nearly every in-app activity or system process using the open source package RRDtool. The graphs are tracking everything from network speed between actions, CPU activity and memory status to how long it takes for an Instagram user to "like" a photo.
The “wants-to-be-technical” manager nightmare
By Michael V Thanh, Owner of Static Void Academy | Software Engineer
What happened?
“A bunch of us working on a project were all young, junior engineers. That in itself isn’t a problem if you have strong oversight and management, but we also lacked that.
Our manager at the time was what you’d call a “wants-to-be-technical” manager. As a manager, your job is no longer to code—it’s to manage. But instead of managing they were always just coding, and subsequently, the rest of us didn’t have any supervision, direction, anyone to review PRs, etc.
With a lack of direction, every day was constantly a firefight. Sometimes we’d work on a task for a few days just to scrap all that work because it turned out not to be what a customer wanted. Comically disastrous.
Anyway, you had all of these junior developers sort of just slapping together features in an application that slowly ballooned out of control. Because we were rushing just to get stuff working, there was ludicrous code and bad practices popping in everywhere:
- Changing global variables from random places in code. Generally speaking, if you use global variables (strong if. Global variables are just bleh in general), you want to make them immutable. Otherwise, you get to do what I did, which is stay up until 3am digging through thousands of lines of code and trying to figure out where this one variable keeps getting changed and breaking everything else.
- Using pass-by-reference. It’s generally bad practice to modify a parameter within a function in Object-Oriented Programming, instead preferring pass-by-value (where the function returns a new copy of an object with desired changes). There were several functions where people passed a list into the function and directly modified the list. I had wasted hours and hours trying to figure out why values were changing or disappearing, seemingly randomly.”
How did it end?
“What was the culmination of the above? Near the end of my time at the company I tried to pick up an effort to refactor a bunch of this code.
How’d I do? Utter failure. I found a 500-line function that I tried to split up into multiple functions. Upon trying to refactor this one function, functionality in about 20 other places broke.
I abandoned that branch after poking at it for ~4 days.”
The “first job” sins
By Ian Wilson, Full Stack Developer
What happened?
“For my first largish project as a paid developer, I was tasked with putting together an admin dashboard. Not too complicated, mostly just CRUD operations. However, I committed a number of dev sins in the process.
When did it go wrong?
- I reused very little (i.e. I rolled most of the components myself). I thought I could handle it, but in retrospect I should've started with a React component set like semantic-ui or material-ui.
- Unpaginated requests to backend (fetch all 1000 or so resources on page load)
- Lousy use of constants (magic numbers and magic words)
- Lots of monkeypatching to fix bugs. The PMs drilled very hard that there should be fewer bugs putting pressure on me to fix them. My lack of experience had me hard-coding more to fix them though.”
How did it end?
“At this point it's annoying because I've had to live with my sins or slowly resolve them. If tasked with doing it again I think I could save a boatload of time by having good test coverage.
Though I'm slowly improving it now, it has persisted for so long because at least it worked.”
👻 Do you have your own horror story? Tell us about it!
How can you avoid ending up on this list?
How can you ensure that your company is continuously dealing with and reducing its technical debt?
That’s exactly what we developed Stepsize for, it’s a SaaS platform that helps your company manage technical debt by making it super easy for development teams to report and prioritise the most important pieces of technical debt that need addressing.
This helps software engineers and managers deal with technical debt without drastically altering their workflow. More importantly, it empowers them to quickly produce great software while fixing the most important pieces of technical debt.
If you’re looking for a sustainable solution for tech debt management in your engineering team, you need a tool that does this. For engineers, using Stepsize means no more context-switching and high-quality issues that can be prioritised based on impact and fixed. And because it integrates with Jira, your PM will love it too.
Learn how to get started with Stepsize here. It works with Visual Studio, VS Code, JetBrains.