My Favorite Production Software Bug

When I first graduated from college I worked for a small company doing custom development work in .NET 1.1.

Our largest client (coincidentally where our offices were) had a print shop and a web site for financial agents to set up and send mailings to folks inviting them to a dinner and telling them about the latest & greatest annuities that they should invest all their money in.

The system was pretty interesting.  With a batch job they'd print out letters, a bio card that showed the agent's photo on it, and other inserts, such as tickets to the dinner.  These would be collated, folded, and stuffed into an envelope that would be licked, sealed, and affixed with a real stamp. (People are 20 times more likely to open a letter if it has a real stamp -- and yes I just made that number up).  It was very impressive to watch it all work.

The website we built allowed the agents to place these orders (with optional inserts) and mail them to a set of folks matching a given demographic all online.

Often times the agents would purchase an upgrade to have a reminder card sent to each person a week before the event occurred.  These cards were special and even though we had a room full of expensive printers, we didn't have the ability to print these cards.  So we'd have to outsource it to another print shop across town.

The process went something like this:

  • We'd compile all the info, along with a TIF of the agent's photo and FTP it over to the other company
  • They'd print them all and drive them to the post office for mailing
  • They would charge us money

All of this just worked, and I never had to see the internals of this system.  That is, until my boss went on vacation to Mexico (at the time it was just me and him).

You see, an agent had sent a card to himself and a couple of his friends.  He never received them.  Since he had paid of for the upgrade he was understandably upset.  They asked me to look into it.

I was slightly familiar with the tables, and so I went looking.  There was a table along the lines of ResponseCardQueue.  It contained columns such as agent_id, recipient, address, city, state, zip, and date_sent.

There were tens of thousands of these records.  I issued this query:

SELECT * FROM ResponseCardQueue WHERE date_sent IS NULL

To find that there were about 2100 records returned.  For some reason these weren't being processed.

I finally found the code that was reading this, and it had some code that looked like this:

public void ProcessCards(Card[] cards)
{
  try
  {
    foreach(Card c in cards)
    {
      string tifFilename = @"\\SOME\NETWORK\PATH\" + c.AgentId + ".TIF"
      //copy details + tif image to some folder
    }
    //zip up folder




    //FTP the file to the other print shop
//mark date_sent to DateTime.now
  }
  catch
  {
  }
}

There are two things to notice.  One was that we were calculating the filename based on the column in the database.  The 2nd was the empty catch block, effectively allowing errors to go on unnoticed.

In this system an agent id was an identity column in another table, so the numbers were incrementing by 1 with each new account.  After much searching, I realized that the column type for the agent id in this table was defined as a char(4).  So as soon as we had our 10000th record in the system, it started looking for filenames that didn't exist on the network share.

It would be something like this:

agent id 10200 would get truncated to 1020, which in our system didn't exist (most of the numbers started in the 4000's.  So the filename didn't exist (and probably better that it errored out here rather than choose the wrong picture for the card!).  This code threw an exception and stopped processing future records.

And so the unsent records piled up.  For 4 months.

So I diligently made the column type int and updated the records that were below that threshold to correct their agent id numbers.  So guess what happened?  I fixed the clog and with one big TWOOOSH all of the records were processed.

I felt mighty proud.

 

Until.........

 

A few hours later I realized that the cards would actually now be mailed!  How embarrassing it would be to remind someone of an event that took place 3 months ago?

By the time I was able to explain all of this and someone jumped in their car and went to the post office just in time to grab the entire batch before it was about to be mailed.

We still were charged for the printing & postage of those cards, however we saved ourselves the embarrassment of explaining to all of our customers that we screwed up big time.

I learned a valuable lesson that a simple oversight can cost a company a ton of money (and in this case... reputation).

So what's your favorite production software bug?

#1 Andy Dyer avatar
Andy Dyer
8.21.2009
9:44 AM

I can't help but be reminded of this quote from Office Space:

"Ok! Ok! I must have, I must have put a decimal point in the wrong place

or something. Sh*t. I always do that. I always mess up some mundane

detail."

Using float instead of decimal and several other similar oversights. Yep, I've been there before.


#2 Brandon Ryan avatar
Brandon Ryan
8.21.2009
10:24 AM

I hope that wasn't the actual code... with "\\SOME\NETWORK\PATH\" in it, and the slashes not being escaped... :)


#3 benscheirman avatar
benscheirman
8.21.2009
10:26 AM

No, that was me typing without thinking :)


#4 Mohammed Nour avatar
Mohammed Nour
8.22.2009
6:05 AM

I had a similar production bug and it was as yours - resident till being alive at a certain record number reached in the table. I have called it "Resident Evil" :) I was feeling proud when get solved: innovativeperspective.wordpress.com/.../resident-evil


#5 Moran avatar
Moran
8.24.2009
10:24 AM

Great post man. Do you think that companies are doing enough to prevent bugs ?I think that many software development companies release software with bugs, becuase they think the public will accept this.

I wrote about it <A HREF="http://blog.typemock.com/2009/08/why-do-software-development-companies.html">here:</A>


#6 Trans avatar
Trans
8.24.2009
11:33 AM

Long ago I worked on a huge DOS-based compiled Basic accounting system. I had been put in charge of adding color to the system (ooh aah ANSI codes). I completed the work and the product shipping with much fan fair (on like 14 3.5" floppies), only to learn a few days later that half our user's installations were bombing! It took some time to track down the issue. Somehow, without my knowing it, I had truncated a single line of code. But if cut-off at just such a point that it didn't raise an immediate error. Rather it caused the system to crash if and only if the program's database was on non C: drive. The odds of that exact accidental mishap had be astronomical. Lucky for me QA got most of the blame for not catching it before the product shipped.


#7 JBland avatar
JBland
8.24.2009
3:03 PM

I worked at a small educational software company (3 developers), and was responsible for developing an online testing app. We got a frantic calls one afternoon about students not being able to take their tests. This was back in the bad old days of classic ASP, and there was little in the way of instrumentation.

It took us a few hours of Response.Writes and peeking at Sql Query analyzer to figure out the problem: The system was about a year old, and with an increase in customers, it took that long to overflow the int16 PK column of the table holding student responses to tests.


#8 anon avatar
anon
8.27.2009
2:01 PM

I freakin hated Postmark