My Favorite Bug

Years ago, at a financial services company…

It started with some mild grumbling. Some people were complaining about getting signed out on our customer portal. But it was probably just user error, right? The reports seemed to happen more and more though. Then, employees who worked on the product started to get signed out.

At this point we have to admit that there is a problem. Every so often someone is using the site and they are suddenly booted out.

It happens seemingly at random. If you have a bug that you cannot reliably reproduce, your goal is to gather enough information until you can. “Seemingly at random” is a euphemism for “caused by some piece that we aren’t considering.”

Being able to reproduce a bug is important. First, you want to be confident that you have fixed the issue. If you can’t be sure you’ll trigger the bug, you can’t be sure your fix worked. Second, the factors that go into triggering the bug give you insight into where the problem is. Working toward a consistent reproduction is helping you triangulate the problem. So we started gathering information in hopes that we could reproduce it.

One thing you can do is ask yourself “what changed?” One of the things that changed was that we were rolling out a new version of our in-house PHP framework. This felt important, but also not, since the affected site was on the old framework. What’s more, both frameworks used the same way to delete sessions. Something like:

sprintf('DELETE FROM `sessions` WHERE `expires_at` < %d', now());

Which is a pretty reasonable way to expire sessions. The session creation code was similarly reasonable

sprintf('INSERT INTO `sessions` (`cookie`, …, `expires_at`)  VALUES (:cookie, …, :expires )', $cookie, …, now() + SESSION_LENGTH);

Jason, one of our analytics folks, was able to add HTTP monitoring. We figured out what the problem looked like from the browser side (a request with a session id setting a different session id in less than n seconds) and started logging how often it happened.

What we learned from that the frequency was accelerating. The longer it took us to fix, the worse it was going to get.

We still cannot reliably reproduce this at this point, so we have to keep gathering information. I was tempted to deploy changes out to see if they worked. This is dangerous. If you don’t understand the problem, you don’t know if a fix is going to inadvertently make things worse. Remember: if it seems random, you don’t understand it.

Adam, one of the devs on the team, added logging to our session delete code. Before it ran the DELETE statement, it did something like

debug_log(query(sprintf('SELECT cookie, expires_at FROM `sessions` WHERE `expires_at` < %d', now())));

so we could see exactly which sessions were being deleted. That’s when things started to get weird.

We would compare that to our HTTP monitoring and find the session and see that it definitely should not have expired based on the timing. But the session logging says that it definitely should be deleted based on the time it was logged.

At this point we reach the “Eureka” moment. For context, “Eureka” is what old Archimedes shouted as he jumped out of the bathtub. While the bathtub is clearly useful in realizing that water displacement can measure volume, I don’t think enough credit is given to the bathtub as a place of thought in that story. I consider showering (along with dog walking, driving, and sleeping) to be an important part of debugging.

And so I was, when I was trying to figure out how to reconcile the two facts:

  1. Sessions were expiring before they should be
  2. Sessions were expiring on time

I realized the disconnect, hopped out of the shower and sent an email to our Unix admins asking them to check the clocks on our production servers. Sure enough, one of the load-balanced servers wasn’t running ntp and it’s clock had drifted out of sync.

What was happening was if the session code was called on the rogue server then now() was wrong. That meant that sessions created on it would live extra long, but sessions deleted on it would die young.

Why was it accelerating? It turns out that in V1 of our framework the session-delete code was mistakenly not hooked up, but that was fixed in V2. The way PHP works is that every n requests it calls the cleanup function. We were moving more and more traffic to V2, which meant that the cleanup function was being called more often.

I like debugging. Well, I like having debugged. It can be maddening in the middle of it but there is a lot of satisfaction in solving a puzzle. A bug is a puzzle with stakes. If you are spending your time on a bug, that means its important. There’s also satisfaction in finally arriving at a simple, testable, theory of a bug.

I like this bug because it required us to understand so many pieces that make up the system: HTTP requests, session persistence, load balancing, and server clocks. It’s also nice when the fix is to adjust the clock instead of recode your entire app.

Image credit: Public Domain image NH 96566-KN The First “Computer Bug”

Keeping up with “modern web technology”

Chris has a pretty good point. When you sign up to be a developer, you are signing up to never stop learning. That is true no matter what you do in life, but doubly true in computers, and triply so in 2016 putting things on the web.

The fatigue is real. And the meta-fatigue. But each complication that we add to our toolbox needs to justify its existence. Especially new tools. Are we adding them because we need them or because we want an excuse to try them out?

When talking about “modern web technology” you could easily be talking about React, Angular, Angular 2, TypeScript, elm, async/await, Web Sockets, Web Workers, microservices, GraphQL. I can come up with some pretty good reasons for not using them.

When you find out what technologies Chris is trying to use–

–you may get a little more sympathy for his predicament. I did.

What can you do in a situation like this? Part of the job of a software engineer is to be able to justify your decisions. That means being able to point out the benefits that outweigh the costs, and being cognizant of both. The reason the build chain is so complicated is because each link in that chain adds value. If you want to use it you need to be able to explain the problems it solves.

That said, sometimes you just have to recognize when you’re in an environment that doesn’t want to change. Given that your career hinges on changing and learning, you should give a good hard look as to whether your current environment is long-term beneficial for your career.

One last thing on the subject of using work projects to try out new technologies – I kind of like the Choose Boring Technology camp. It’s may actually be better stated as “Choose Mostly Boring Technology” since it’s about limiting yourself to a one or few new technologies. But it’s critical to always be trying out new things. It’s critical that companies need to ensure that people have space to try new things. It’s critical to your career to have that space to try new things.

Photo credit Ashim D’Silva

Testing Code: Intentional vs. Incidental Behavior

Sometimes a maintenance programmer is another developer with an axe. Sometimes the maintenance programmer is ourselves after we have forgotten all about the project. I want whoever gets it to be happy and confident when they make changes.

Having inherited projects in the past (both from others and from past-George), I think it’s important to make sure that a project comes with good automated tests.

Hopefully, in 2016 that isn’t a controversial statement. I do want to explicitly enumerate some of the value tests provide the maintenance programmer:

  1. Testing forces us to simplify and decouple our code, which makes maintaining a system easier. If it’s hard to test, it’s going to be hard to maintain.
  2. Tests provide example usage of the code, so they serve as living examples of how to use the code.
  3. Testing infrastructure. It’s easier to add one more test when they already have tests for other parts. It also provides examples to mimic when someone is getting up to speed.
  4. Good tests define what behavior is intentional.

Let’s talk about #4.

Virtually all code has is intentional behavior and incidental behavior. Incidental behavior is what you get when a detail doesn’t matter to the functionality being developed. An example would be when your UI displays a list in whatever order the datasource provides it. You probably shouldn’t write a test for that; there’s no specified behavior.

In a codebase with tests for the intentional behavior, a maintainer can be confident that any changes they make aren’t unknowingly undoing past decisions. The tests answer the question “is it supposed to be that way?”

When retroactively adding tests (post-hoc testing) you run the risk of documenting the code instead of the requirements. If you find yourself writing tests with the goal of verifying the code works, you aren’t confirming that the code meets the requirements. You’re confirming that the code is the code.

Should you find yourself writing post-hoc tests, what can you do? Try to put yourself in the shoes of the original author. Use git blame to discover why the code was written. Write a test that verifies the original feature and no more.

Finally, I suspect that only testing intentional behavior is in conflict with 100% (or any) code coverage targets. If you have 100% coverage, either you are writing tests to cover code that isn’t part of the specification or you have a specification that is comprehensive to a fault.

Towards an understanding of technical debt – Laughing Meme

I’ve spent the last few years rather flippantly stating, “Technical debt doesn’t exist.”

What I was trying to say was, “I’m deeply uncomfortable with how our industry talks about this thing, I think it’s probably harmful, but I don’t know quite how to express what I mean. Also, given that everyone seems to be really focused on this tech debt concept, I’m kind of worried that the problem is me, not the term or the industry”.

When I first heard Peter Norvig say, “All code is liability”, that felt closer to right.

via Towards an understanding of technical debt – Laughing Meme

Ubuntu on Windows

Windows 10 Anniversary Update came out last week. I haven’t heard very many folks talking about  one of the most interesting features I’ve seen from the New Microsoft: Linux.

I think part of the reason people aren’t excited is because they don’t understand what this is:

  • It’s not a virtual machine
  • It’s not a container like Docker
  • It’s not Cygwin – you can run pre-compiled ELF binaries

This is basically WINE in reverse – it translates Linux system calls to Windows system calls at runtime. That means it can run pretty much anything that Linux can. To take it for a test drive, I installed nvm and node, then Calypso and then ran make run. It worked! (I had to sudo apt-get install gcc g++ make first, but that’s just more proof that this is Ubuntu)

I love developing on Unix-like systems, and OS X macOS has been a great balance of a Unix system with good consumer support.  However I’m concerned that so much of Apple is focused on iDevices that they will start to care less and less about computer. As of August 23, it’s been 462 days since they have updated the MacBook Pro, and they’ve discontinued the Thunderbolt display.

Meanwhile, Microsoft is open-sourcing things left and right. And now actually running Linux on Windows, something that 10 years ago would have sounded ridiculous. It’s not perfect – the terminal window still sucks compared to iTerm or even – but now you can use a real OpenSSH instead of PuTTY. Maybe soon I can live the dream of the same machine for development and gaming.

The funny thing is that IBM tried this with Windows 20 years ago. IBM had OS/2, which had a Windows compatibility layer so that OS/2 could run Windows 3.1 apps. This backfired for IBM – as a dev if I have to choose between writing an app for OS/2 or Windows, why not write for Windows since it will run on both? A comment on Hacker News claims that the Linux subsystem was originally developed for Windows phones to run Android apps. Maybe Microsoft chose not to go down that path for fear of the same fate as OS/2 Warp?

I strongly recommend anyone who likes Linux command lines to give it a try to see what works and what doesn’t (I had to manually install the update first). There are a lot of shortcomings but if it’s supported this could lead to more devs switching back to Windows.

Yahoo! Pipes 😭

I just deleted a draft I wrote about Yahoo! Pipes closing because Reflections on the Closure of Yahoo Pipes says it all.

looking back, it seems to me now that the whole mashup thing arose from the idea of the web as a creative medium, and one which the core developers (the coders) were keen to make accessible to a wider community. Folk wanted to share, and folk wanted other folk to build on their services in interoperation with other services. It was an optimistic time for the tinkerers among us.

The web companies produced APIs that did useful things, used simple, standard representations (RSS, and then Atom, as simple protocols for communicating lists of content items, for example, then, later, JSON as a friendlier, more lightweight alternative to scary XML, which also reduced the need for casual web tinkerers to try to make sense of XMLHttpRequests), and seemed happy enough to support interoperability.

The death of Pipes doesn’t sting because it was a tool I need to get through my day. It’s just a sign that the times have changed and API openness isn’t valued like it was a decade ago. “End of an era” kind of sting.

A world that used Pipes as often as Excel is a world I want to live in. Where every company I do business with says “of course we have an API!” Imagine what Siri or Google Now or Amazon Echo look like in that world. That’s the world that Pipes represented, the world I’m mourning.

Passing configuration to Angular

This is something that we got wrong on our project initially, then had a guy named Mark come on board and nail a very clean solution.

What we were trying to accomplish: we wanted to give our Angular Singe Page App some configuration data for initialization. Things like a CSRF token and API URL, so not necessarily things we could load from a web service.

The wrong way to do it:

We started off using an ng-init on our element. If you RTFM on ng-init they make it very clear that you should not be using it for that purpose. In our defense, the name “init” is right there in the name and the warning wasn’t as bright red in earlier versions of the documentation.

A better way to do it:

What we are doing now is putting this in our server-side template:

angular.module('ns-config', [])
    .constant('config', {{config|js}});

and then inject the ns-config module into our project. By using Angular’s constant() instead of value() the config object is available in our application’s .config() block.

Sprint length… do you think it is ever a good idea to go to 1 week sprints?

A friend sent me the above question about sprint lengths over Twitter but I think some other people might want to know my answer too. A question like that doesn’t have a clear cut answer, it’s more of an open-ended question that can’t be answered with a simple “yes” or “no.” But the answer is no.


I don’t think it’s valuable to perform all of the Scrum ceremonies in a single week; the planning:production ratio is way off with one 1 week. In an effort to seem like the kind of person to which you would ask this kind of question, I responded with a question of my own: “What are you trying to accomplish by shortening the sprint length?” and that was the real answer: “I am leaning towards 1 week sprints to help developers understand how long something really takes.”

A developer that is over-optimistic with their estimates? Now I have heard everything!

Shortening the sprint isn’t the first tool I would use to accomplish that. Instead I would be focusing on velocity. I’m not a huge fan of velocity (or pure Scrum for that matter) but it definitely can provide some perspective in sprint and long-term planning. I don’t care if you’re using an average or Yesterday’s Weather or what, as long as you are keeping good track of how much you are getting done.

First I would make sure that we are not half-assing Planning Poker. The goal is to come up with a size that everyone can agree on, and step 1 means using real cards or an app. Get some reference stories. I might be projecting a bit (why else would I have a blog?) but if you commit to a size and don’t have an easy out it forces a much better conversation about size. If I’m using my fingers it’s pretty easy for me to fudge that 5 to a 2 when I see what other people are saying. Everyone has a smartphone, use it.

The next piece is to have a good definition of “done.” The traditional goal of Scrum is to have a potentially shippable product at the end of every sprint, which lends itself to a decent definition of “done” but your team knows what’s right for them. Once you have that definition in place you do not count a single point until a story has reached that state. No “This 8 pointer is 75% done, so we’ll count 6 points this sprint and re-label it as 2.” No points until it is done.

Once you have that in place you should have a pretty clear picture of velocity emerging. At that point you can have the sprint planning session that you were waiting for all this time:

“Well we’ve committed to 50 points but we’ve never finished more than 20 points. Why is it a realistic amount for us to commit to this sprint?”

There will probably be some excuses and you can either push to reduce the sprint or let the team commit to it and bring the velocity up in the retro as a reason why the team couldn’t complete the sprint with a sustainable pace. (Aside: how did Scrum ever figure that you should “sprint” at a “sustainable pace”?) I would lean towards pushing the team to reduce the sprint size because I suspect the team is aware they are not finishing sprints on time, and another missed sprint would demoralize folks more. You can always offer the carrot “let’s commit to 20 and pull in more stories if we have time.”

Like I said, I don’t love velocity but I think that it’s the right tool for solving this problem. It isn’t about having a high score to beat, it’s about having a yard stick for understanding if a sprint is realistic.

Angular’s $http.error()

Earlier I promised (ha! PROMISE!) to explain why I don’t like Angular’s $http.success() and .error() but this guy beat me to the punch:

Don’t use $http’s .success()

First, I hate anything that creates inconsistency in my code. If we use success() we’ll still have to use then() in every other place that has a promise, which makes this syntax sugar more of a cognitive leak in my opinion than anything useful.

Second, it couples your code in a subtle way – you’re telling everyone you’re actually reaching to the network to get something.

These are big issues but I think it misses the biggest one: you lose most of the Power of Promises™ by using .success().

Promises are amazing. You can use them to make async code reasonable. You can use them to compose small async bits into a functioning whole. And $http.success() throws it all away. Take a look at this code:

app.controller('MainCtrl', MainCtrl);
function MainCtrl($scope, $http) {
  $scope.started = 'Started';
    .success(function(resp) {
      $scope.time = resp.time;
      return $http.get('');
    .success(function(resp) {
      $scope.ip = resp.ip;
    .finally(function() {
      $scope.finished = 'All done';

See the issue? Here it is on Plunker – the IP address isn’t getting filled in. Why? Because you can’t chain HttpPromises together like you can real Promises. What’s actually happening on the second .success() is that it’s calling the Date service a second time! If you were reading that code would you expect 2 calls to the Date service? Here’s the same code using .then():

app.controller('MainCtrl', MainCtrl);

function MainCtrl($scope, $http) {
  $scope.started = 'Started';
    .then(function(resp) {
      $scope.time =;
      return $http.get('');
    .then(function(resp) {
      $scope.ip =;
    .finally(function() {
      $scope.finished = 'All done';

That actually works like you would want it to. It works how Promises are supposed to work: if your .then() returns a Promise it will pass the resolved value to the next .then(). I’m not even getting into error testing. And if you were to pass that to some other system they could add to the chain however they wanted and it would Just Work™.

Then (ha! THEN!) there’s the issue of interoperability. Promises are in ES6 and anything that supports the Promises/A+ API should work together. That means Angular can create a Promise that Bluebird can wrap it in a timeout. Want to split 1 $http call into 5 $http micro-service calls because that’s the Hot New Architecture? If you were using .then() you could just wrap your calls in $q.all(), but $q.all() doesn’t have a .success() method. You lose all that power if you’re calling .success() and .error() all over the place.

So please please please stop using .success() and .error() in your Angular projects and stick to POPO: Plain ‘Ol Promise Objects.