1. 34

Alternatively, what embarrassingly simple bugs have you fixed recently?


  2. 21

    We were running into a bug in the field where certain, very simple operations were taking 1,000,000x longer than they should. It was only happening at certain sites and we just could not replicate it locally.

    Turns out that, under certain workloads and after a certain amount of data set growth, the query planner in the embedded database engine we’re using would start picking exactly the wrong index for a certain commonly-run query, resulting in what amounted to a full table scan on a billion-row table dozens of times a second…

    The fix was a simple “FORCE INDEX” hint in the query.

    Another interesting one was in statistics reporting for traffic capture. The stats would report that we were processing hundreds of gigabits per second of traffic on a 10Gbps box. Turns out that the packet validation code was correctly handling corrupt IP headers but not until after the size was recorded for the purposes of stats gathering, so a malformed IP header could cause a calculation to go negative and overflow. The packets would later get dropped correctly, but it was a bit of a puzzler for a day or so because we were convinced that it was the stat calculation that was wrong (i.e. we trusted the packet data that the stat calculation was using).

    (What’s embarrassing is that I wrote both the stats calculation and the capture engine with the header validation, so I really should’ve found the bug sooner…)

    Now for a stupid one:

    Our traffic capture engine in our test rig was reporting 30% malformed packets during tests. I was pulling my hair out trying to find where things were getting parsed/validated incorrectly. Turns out someone had accidentally replaced the “clean” test data set with one of the data sets that had, you guessed it, 30% bad packets to test the validation code…

    1. 15

      under certain workloads and after a certain amount of data set growth, the query planner in the embedded database engine we’re using would start picking exactly the wrong index for a certain commonly-run query

      I had a fun one like this a while ago. We use base 36 IDs for a lot of stuff internally (so counting looks like 1,2,…,9,a,b,c,…,z,10,11, etc). Sometimes postgres would do full table scans for this very simple query using psycopg2 (a common python postgres driver)

      transaction.execute("SELECT * FROM users WHERE id=?", [id36_to_int(the_id)])

      Here id is the primary key. You shouldn’t ever have to do a full table scan to do = queries on it. So what was happening?

      After a lot of debugging we saw that instead of the_id being a string like “d54q6”, it was sometimes human-readable strings like “lorddimwit”. It turned out that sometimes our callers thought they should be passing us a user name instead of a user ID. We only spotted this because of seemingly unrelated errors about not being able to convert strings containing - and _

      id36_to_int(something_reasonable) should return a Python int which psycopg would map to the int64 type of the column, which Postgres can look up in the index. But id36_to_int(something_huge) might be very large and psycopg2 very helpfully sends it to postgres as a bigint. Postgres can compare int64s to bigints, but it can’t use the index to do so. So it would scan the whole table, individually casting every id from int64 to bigint to do the comparison.

      We added an if too_big(the_id): raise "hell no" and the problem went away.

      1. 4

        Hah! We had something similar happen too: the character encoding for some column was set specifically and differently from the rest of the DB (why I don’t know; it was like that when I got here).

        So doing a join or a query involving that column would cause the DB to do a full table scan, convert that column to the other encoding, do the equality check, and then go to the next row…for millions of rows.

        1. 2

          I do not understand this behaviour - if the coercion is built-in and the column is int64, postgres should cast the lookup key to int64 and return an empty resultset if the conversion overflows. seems like a bug in their code too.

      2. 12

        I was working on an operating system project for school. The teachers provided us with boilerplate code to get started, and we had to implement basic features such as writing to VGA memory, handling interrupts, or writing a round robin task scheduler. They gave us a version of malloc named dlmalloc and written by Doug Lea, which allocated an initial chunk of memory on the heap like so:

        extern char mem_heap[];
        extern char mem_heap_end[];
        static char *curptr = mem_heap;

        The memory addresses corresponding to the start and end of the heap were set using a linker script. While the project was compiling and running fine on the school’s computers, I was getting page fault exceptions on my laptop. Using GDB, I figured out curptr was initialised to 0, causing malloc to write to the first memory page and raise an exception as it was protected. I then used objdump to check if the value of curptr was correctly set during compilation:

        Disassembly of section .data.rel:
        001156e0 <curptr>:
          1156e0:       ec
          1156e1:       e7 11

        It was! I had never seen a .data.rel ELF section before, but a quick Google search revealed it can appear in executables with position-independent code. The executables compiled on the school’s computers didn’t have this specificity. readelf then revealed that this section was located after the .data one:

        Section Headers:
          [Nr] Name
          [ 0]
          [ 1] .text
          [ 8] .data
          [ 9] .got.plt
          [10] .data.rel
          [11] .bss

        I dug deeper into the boilerplate code, and found the following snippet in the kernel entry point:

            /* Blank all uninitialized memory */
            movl    $_edata,%edi
            xorl    %eax,%eax
        0:  movl    %eax,(%edi)
            addl    $4,%edi
            cmpl    $mem_heap_end,%edi
            jb      0b

        _edata was the first address after the .data section, which means all memory between it and the end of the heap was zeroed. This included the .data.rel section and curptr!

        The solution to this issue was to disable the generation of position-independent code using GCC’s -fno-pic flag, which put curptr back in the .data section.

        1. 11

          Not that interesting to fix, but certainly interesting to find. We had an operation on one of our Ruby process worker servers that would get some data from one service, write it to a file, perform a few brief operations on it, and then send the data to another service. It always worked fine locally and in all development and testing environments, but would occasionally fail on being unable to find the file in Production. Seemed really strange - how in the world would a file get deleted with no explicit delete steps only a few lines of code after it was created?

          It turned out that we were using the Ruby Tempfile class, creating an instance, and then returning the path of that tempfile to be used by the next step. Turns out that the Tempfile Ruby class silently deletes its file on finalization, and instead of returning the whole object, we were just returning the path from it, and letting the actual object go out-of-scope when the method returned. Most of the time, that object would happen to stick around for a while, and no problem would be seen, but in the tighter memory conditions on Production, the GC would sometimes run between the steps, collect that object, and delete that file before we were done with it. D’oh. Lesson learned - pay close attention to the scope of your Tempfile objects, and don’t pass their paths around separately from the object itself.

          1. 5

            Had a similar issue! We learned that Tempfile.create[1] is what we wanted — it doesn’t automatically delete the file.

            [1] https://ruby-doc.org/stdlib-2.6.5/libdoc/tempfile/rdoc/Tempfile.html#method-c-create

            1. 2

              I tend to prefer:

              Tempfile.create do |file|
                   #do stuff with file

              That assures you the file will exist as long as you’re in the block, and that it’ll be disposed as soon as you leave it. Leaking tempfiles in case of errors has caused us problems too.

          2. 11

            Braintree doesn’t use globally-unique ids for their transaction records. 48 hours I’ll never get back.

            1. 1

              Isn’t that a payment system? This seems pretty worrying…

              1. 1

                Yes… and I have plenty more horror stories where that came from.

              2. 1

                Hi, I work at Braintree and we do and have always used globally unique ids for transactions. You may have run into a recent issue we had, though (https://status.braintreepayments.com/incidents/n1hf4hj89lks). We have since cleaned up the duplicate ids and put measures into place to help prevent issues like this in the future. If that timeline doesn’t line up with what you saw, we’d love to know more.

                1. 1

                  No, this was caused by switching a dev environment to a different sandbox account. Your API was giving us transaction IDs that were colliding with ids we received from the previous merchant. Thankfully, this never impacted production.

                  I sincerely doubt that your ids are globally unique because they’re far too short for that, and because of this blog post.

                  1. 1

                    That post is referring to the fact that a transaction id might be the same as a customer id, and that given only an id, we don’t know what type it is. Transaction ids are unique against all other transaction ids.

                    1. 1

                      Nevertheless, your sandbox did repeat transaction ids. I have two days’ wasted time to show for it.

              3. 11

                I helped discover a hardware bug, of all things! Not even in computer hardware. Though I wasn’t the one who fixed it, I confess.

                I was working on a device with a rotating sensor platform. In the course of doing some unrelated stuff I saw it magically stopped rotating quite suddenly, with nothing from the driver or control software gave any indication why. It has some fairly sophisticated software and motor control stuff behind it because it has to rotate at a very well-known rate, and if something gets wedged or tangled in the spinny bit it will detect that it’s stuck and stop in place instead of just trying to keep turning and breaking the sensor, so that seemed the obvious place it was going wrong. After some work a coworker and I managed to reproduce the mysterious issue, just leave the device running long enough and the spinny bit would eventually stop spinning. Try to start it again and generally it worked fine, at least for a few minutes. So we went through the interface software, then through the motor driver, then called the motor controller vendor and got a class on the firmware as well. No dice. As far as we could tell the software was fine.

                With some chagrin we gave the thing up to the tender mercies of the hardware people, and they took the thing apart and started going at it with multimeters and oscilloscopes and other such scary tools. They couldn’t find any problem either… until one of them touched part of the motor assembly inside the thing after it had stopped, and said “Dang, that is toasty.”

                Turns out a bearing in the spinny bit had not been manufactured to spec, and was rubbing against its housing. Not enough to really be felt while turning the thing by hand, unless you really were watching for it, but over a few hours it would keep rubbing and heat up and expand, and seize up. The control software correctly detected this and stopped the motor spinning… but also told the motor to hold the platform in place, so after trying to turn it by hand and not being able to you said “oh the motor is holding it”, went to the console to turn it off, and by the time you were done it had cooled down enough that the thing had un-seized.

                Robotics is fun! Makes it so clear that software engineering and any other kind of engineering are exactly the same process, with the same kind of issues and the same ways of troubleshooting. Reproduce the problem, isolate every possible thing that can go wrong, check them all systematically one by one… and sometimes, get a little lucky.

                1. 8

                  This is both interesting and embarrassing. At work we’re heavy users of Redis’ HyperLogLog type for analytics and it runs on AWS ElastiCache. However we haven’t been the biggest fans because it’s used a lot of memory. One day our ElastiCache cluster ran out of memory. We didn’t have an alert so we didn’t notice until customer data started looking funny because Redis was deleting keys to make room for new keys to be written. So we spun up a new ElastiCache instance with more memory, which was also two major versions newer from the old one which hadn’t been upgraded in two years. I thought, well there are no breaking changes to Redis for our usecase so this should be fine. Then we pointed our batch job at the new ElastiCache instance, ran it, inspected the data, looked ok, and then cut over to the new ElastiCache instance. Everything was was back to normal. At this point I poke around in CloudWatch and notice that the memory usage is 5x lower than I expected and I become very worried. I wonder if something was wrong with the batch job and it didn’t do a full rebuild of the data? Then I looked at the number of keys in the new instance and it was slightly more than the old one. Ok, what if the keys don’t have the correct values? I check again in our product the data looks correct, there’s no anomalies. Finally the thought that upgrading Redis caused some kind of memory usage improvement crossed my mind. I do some googling and find two big improvements to Redis:

                  The first was technically a bug in Redis where merging HLLs would use more memory than necessary. The second is a new algorithm within HLL itself which is an amazing improvement. I don’t think the postgresql-hll extension has this yet. Since we upgraded we were able to scale down our ElastiCache cluster to a much smaller size than previously used saving a lot of $$$.

                  The main lessons learned:

                  • Have usage monitoring for all your production data stores. It’s embarassing but we didn’t have it for Redis even though all of our other databases had these checks in place
                  • Stay tuned on changes in new versions of your datastores, even if they’re as boring as Redis. Someone could contribute something really useful
                  1. 6

                    My compiler had the craziest bug for a while. If you passed an array to a function and then indexed it specifically using *(p + offset), the resulting binary would segfault. Every other use case worked fine - if you used it in the original function, if you used p[offset], it was just this specific form that broke. The craziest thing is they’re handled by the same function in the compiler!

                    It turned out that I had like 3 different errors that all cancelled each other out except in this exact case. I couldn’t believe it for a while, every time I fixed a bug it would break everything else. The way I figured it out was just writing down all the steps and convincing myself every step along the way was accurate, then going from there. That finally let me figure out what the original bug was - I was treating *(p+offset) the same as &p[offset], not the same as p[offset]!

                    I don’t want to say how many hours of my life this cost me but you can kind of see it from the bug report: https://github.com/jyn514/rcc/issues/123

                    1. 4

                      Another bug I fixed by writing things down: For one of my classes I had to write a MIPS program to calculate the square root of a number in fixed point. I had the sqrt algorithm itself working, but the number was stored in hexadecimal format and for the life of me I couldn’t figure out how to convert it to decimal. We’d done something similar previously, but it only worked for integers.

                      I had the idea to convert one part at the time: convert the integer part of the number to decimal, then convert the fractional part, then do the proper shifts and OR them together. This works well when the fraction is small but it breaks down when the fraction gets too big: 0xfffff (the largest possible fraction in our format) is more than 5 decimal digits, so when I OR’ed it it would overwrite parts of the integer.

                      The professor gave us a different algorithm, but I had no idea how it worked. It went like this:

                      • Multiply value by 10^5 (100000)
                      • Shift 13 bits to the right
                      • If bit 0 is set then add 2
                      • Shift one bit to the right
                      • Use the algorithm from the last lab

                      I tried implementing this and it gave me the wrong answer every time, I had no clue what was going wrong. I wrote up a very lengthy email that looked something like this:

                      I’ve tried multiplying by 10,000 both as fixed point (shifted left 14 bits) and without shifting, but neither works. Do you have any idea what I’m doing wrong? I know that it’s going wrong somewhere between here and bin2dec because $a0 is 0x0003a120 and it should be hex(5_00000) == 0x7a120. Actually now that I look at it it seems to only be missing the highest digit?

                      I never sent that email. The second I compared the actual value to the expected I saw the problem - the highest bits after the multiplication were getting cut off! I drew some diagrams and after that the lab was easy :)

                      Here are the diagrams:

                      # keep the overflow!
                      # |     hi (t2)     |       |     lo (t1)      |
                      # |_________________|       |__________________|
                      #   18 - O  | 14 - I         4-I| 14-P | 13 - E
                      #            --------------------------          <-- this is what we've currently got in $v0
                      # We've kept I and P, but since we're shifting right 13 bits, we need to get those overflow b
                      its out of hi
                      # |     hi (t2)     |       |     lo (t1)       |
                      # |_________________|       |___________________|
                      #  14-Z | 5-X |13 - O          18 - I  | 14 - P
                      #              -------------------------- <-- this is what we want to keep (1 bit of P)
                      # Z - zeros
                      # X - discarded
                      # O - overflow
                      # I - integer
                      # P - original precision
                      # E - extra precision
                    2. 5

                      Interesting only in a “groan at the technical/organizational debt way”:

                      A health test for something mysteriously started failing. After looking at it for an obnoxiously long time, I saw that my coworker had done a mass replace of all instances of System.out and Exception.printStackTrace to use a proper java logger. Unfortunately, the health check was actually looking for the output of a process, and parsing it. The relevant code was previously:

                      print “here comes data”
                      ...print some data...
                      print “all done now”

                      …the logger put [INFO] in front of each of those lines, and parser didn’t expect it.

                      1. 5

                        I recently spent an hour root-causing and an hour face-palming after we “simplified” some code by replacing a struct with a single integer.

                        #include <iostream>
                        #include <vector>
                        class C {};
                        int main() {
                          std::vector<C> m_cvals{100}; // old
                          std::vector<int> m_ivals{100}; //new
                          // Something non-trivial but captured by this.
                          std::cout << "m_cvals.capacity " << m_cvals.capacity() << "; m_ivals.capacity " << m_ivals.capacity() << std::endl;
                        $ ./inits
                        m_cvals.capacity 100; m_ivals.capacity 1

                        In retrospect, this is obvious - initializer list constructor takes lower priority than a copy constructor. This is covered in C++ Guru of the week - 1 . Still, when it hits, it takes ages to narrow down to a line.

                        1. 4

                          Ever had one of those bugs where the more you look at it, the more you have a cringy feeling you’re going to be saying rude words to yourself when you spot it?

                          I made a tiny change to booting a device…. device went deaf and ignored me….

                          Close inspection reveal my changes were only after we first said “hello” to the device… so how could I have broken it?

                          After much sorrow….. I remembered I had also made some changes to booting the device that talks to the device that had gone deaf….. Mentally I was thinking of them as independent entities.

                          I had assumed that wasn’t the problem because that device had booted successfully and seemed to be working…. even though actually it was totally mad.


                          Experience is that which teaches you when to cringe.

                          1. 3

                            I just spent the past hour debugging an “error: unreported exception FooException; must be caught or declared to be thrown” message coming from a test.

                            I thought the exception was being thrown at runtime and was trying to figure out what faulty logic was throwing it.

                            It’s a compile error.

                            1. 1

                              I did the same thing switching from Eclipse to Intellij IDEA :*(

                            2. 3

                              In the embarrassingly simple category:

                              Reset password (via email) links breaking because a non url-safe parameter in the link wasn’t urlencoded.

                              A more interesting one:

                              HAProxy returning a “no servers available” response with a very specific combination of chained PROXY calls (i.e. from a load balancer to an app server, both running HAProxy), also with chained PROXY calls on each box (i.e. doing some routing ‘internally’ in HAProxy between proxy components), and setting the “src IP” for the request to an ipv6 address from a header field (i.e. the XFF field from a trusted CDN server).

                              If the XFF specified address was ipv4, it worked fine. If we didn’t set the “src IP”, it worked fine. The “fix” (for now at least) was to be explicit about a port number, where the config had been implicit previously. Some chats with the HAProxy project revealed that (a) our config has a non-optimal/non-ideal setup in terms of the internal routing (not completely fixed yet), but that it’s not even clear why the issue presented the way it did.

                              1. 3

                                Can you keep a secret? … So can I ;-)

                                1. 1

                                  Zerodium would’ve paid a million for you to not debug that. I admire your willingness to turn down money. ;)

                                  Note to security folks: iOS payout dropped down to $1 mil with Android going up to $2.5 mil. Wonder what that means.

                                2. 2

                                  I am working on a Lua interpreter. When doing the lexing, I generate a list of symbols + keywords + identifiers. Standard stuff.

                                  I get decently far into the parser, and for the life of me I can’t get the parsing of even simple fragments to work. I mess with it for hours on end.

                                  Turns out that in the parsing stage I wrote a helper (kwd) and it expected keywords. But I ended up using it for symbols as well. I wrote kwd("for") but also kwd("+"), all over the parser. Like dozens of instances of this mistake. My lexer was giving me a bunch of symbols, but the parser was expecting keywords only.

                                  So I just wrote a thing at the entrypoint to turn all my symbols into keywords. I’ll clean it up later. Maybe.

                                  1. 1

                                    In what language are you writing it? :) are you planning to go with the classic Lua bytecode format or not? I have a soft spot in my heart for Lua, so pretty interested :)

                                  2. 2

                                    Embarrassing: I had a list of discount values (positive floats) and I subtracted them all from each other (why? why did I do this?) and then cast the result with a negative, creating a positive. This is for a tax reporting XML document that should show a negative value for discounted prices ($5 off for example needs to be -5.0).

                                    So I summed all the values instead and then cast the sum as a negative float.