You know the old saw about writing hard code and debugging it? It goes like this:
Senior programmer: “What’s harder? Writing code, or debugging it?”
Junior programmer: “Debugging it for sure.”
“If you write the hardest, most tricky code you’re capable of, are you then qualified to debug it?”
“Uh, I guess not.”
The before and after of this scenario looks like this:
And you know what? Sometimes we end up in the ‘I have no idea what I’m doing’ camp without even deserving it.
Sooner or later you’ll end up debugging someone else’s crappy code and find yourself sitting there like that dog, rolling your eyes and looking nervously over your shoulder.
I want to talk to you today about what to do when you’re in that situation.
You’ve seen this before, right?
There are two ways of solving this problem.
The first goes like this:
That’s fine (especially for learning new things), because hey – sometimes it works!
Or you end up with this:
So now your problem just got a whole lot worse.
What’s the other approach? You already know it. After yanking randomly for a while, we all settle down and understand where the crux of the tangle is, and work from there.
With some gentle adjustments the headphones are free and you’re ready for some tunes.
Well, when confronted with an overwhelming problem (remember that dog?), it’s really tempting to just change things randomly in the hope that’ll fix something.
Sure, sometimes it will. But you’ll never feel good about that code again. Chances are you won’t want to touch it again, and feel a little shiver of fear every time you hear about a problem with that part of the app.
That’s because solving a problem like that means we haven’t got to the bottom of it and understood the root cause.
But if you do work on understanding the problem completely, you’ll come up with better solutions and you’ll be armed with new information and tactics the next time something like that comes up.
But what have you got if you just fixed it by changing stuff around? Nothing, except the ability to try changing stuff around in the future.
So here’s a little case study.
This goes back to when I was building an API in Java. The company I worked for had been engaged to build the data layer to a large and complex web application, and to do it in cooperation with another vendor.
The other vendor was building a caching and business logic layer on top of our data layer, so I spent a lot of time working on integration with them.
One day I got a call from my contact with the other vendor. She had some bad news.
“We’ve been doing some load testing and found that the response times on your end have been blowing out to 30 seconds.”
Oh wow. The absolute longest response time we could live with was 300 milliseconds. 30 milliseconds would be better. But a call that took 30 seconds to complete? What a disaster.
I sighed and made some arrangements to coordinate the load testing with her later that day. I then sent around a heads-up email to the project lead and the company boss.
Then I received an email from the other vendor and she had a little more bad news for me – “Those 30 second response times are where we time out. They could actually be taking longer.”
And already I was getting emails back from the project lead and my boss, in harmony:
“How much memory does the JVM have?”
“It’s probably doing STW garbage collections.”
So they both thought the platform my code was running on was running out of memory, and then doing a ‘stop the world’ (STW) garbage collection to release every last bit of unused memory.
So the first thing I try is reducing the memory on my local JVM (Java Virtual Machine – it’s what the Java server and application run on) and hammering it with requests. Things slow down, but I can’t reproduce the 30-second timeouts.
By now it’s time to start load testing with my counterpart at the other vendor. We’ve updated our API configuration with:
More! More! This is pulling on different parts of the problem and hoping it makes a difference.
We’re watching the logs, the garbage collector, the CPU, everything.
Re-running the test, we saw response times hovering around 250-300 milliseconds until one minute had passed, and then watched them climb smoothly to 30 seconds and stay there.
That’s the client’s response times. Our code was still reporting 250 millisecond completion times.
What? We send a response and the client gets it 30 seconds later? Or not at all?
In fact, once those response times blow out, our code is sitting around doing mostly nothing.
Now what? What can we possibly improve on our side? Not much it seems.
But what about the library we’re using for communication? It’s a common library used for SOAP interfaces. What does it have to say?
We re-configure our logs to include the log output of the SOAP library so we can see what it’s up to, and re-run the test.
We’re shocked to discover that the SOAP library records our request and response in a little over 250 milliseconds.
But when requests start timing out, we see that the SOAP library begins opening connections and…leaving them open. What on earth could it be doing?
At this point watching log files wasn’t going to tell us anything further.
Finally, we ask our friendly co-vendor to re-run their test and when requests start timing out I sent a command to the JVM to ask it to dump a stacktrace of every single thread it was running at the time.
A thread is a job inside the JVM, and a stacktrace shows what command a thread was executing and how that command was called.
Threads do things like receive incoming connections, monitor database connection pools, run the garbage collector – like Shiva’s thousand arms they keep the server’s world running.
I spent an afternoon sifting through the threads to first of all find the threads responsible for accepting incoming requests, and then seeing what they were doing, if anything.
What I found was that every single one of them was responding to a connection request from the vendor’s code and waiting for something to come down the line.
But nothing did. Turns out they were starting a request to our API but never finishing it.
Now, we had something to send our friendly co-vendor with a question attached – “Hello, it seems your client is opening connections and doing nothing with them, why is that?”
After some to-and-fro it turns out our friends at the other vendor couldn’t answer our question. The framework they were using was so ancient they couldn’t configure anything to do with connection management or logging.
Our mutual customer had been watching all of this very closely, and proceeded to yell at our friends at the other vendor until they upgraded their framework. And then the problem never happened again.
So finally, frustratingly, the problem was resolved with another version of changing stuff and seeing what happens.
These are my rules for approaching problems when I don’t know what’s going on:
Let’s ignore that our co-vendor solved their problem without understanding it. I could at least rest easy and know that if there were future performance problems, they wouldn’t be landing in our lap 🙂
But let’s imagine that what we originally tried – increasing memory, connections and caching – had made a difference. Or that the problem had temporarily gone away after we’d done that.
Then that code, and the original problem goes into production and is sitting there, waiting to blow up.
It’s much better to understand a problem waaay down to the root cause, and sleep better at night because of it.