1. 36

Share your mysterious bugs. Was the cause surprising, or was it just you? Was it rewarding to track down? How long did it take? Did it change your view of what you were working with?

  1. 47

    A good few years ago, I was working for a company that made a proxy with deep protocol analysis, as in, it looked inside the protocols, analysed them, and so on and so forth. Pretty exciting stuff. I was L3 customer support, we got the bugs that baffled L1 & L2, and required diving deep into the code. One such bug was from a customer who had trouble SSHing into some of their machines when using our software. It only happened when targeting certain machines, from certain clients, and wasn’t 100% reproducible. Below is my recollection of the bug hunt, years later, so some details might be slightly wrong, but I believe that it is mostly accurate nevertheless. Perhaps not in every detail, but close enough.

    First, we narrowed down that the target system was always SunOS/sparc, and the client usually Linux. So we set up a test environment and tried to reproduce it locally, and that initially failed. After a bit of back and forth, we figured out that they were using a SunOS older than what we had in our test system, so we switched to the same, and lo and behold, bug reproduced! So whatever the bug was, it was likely fixed or worked around in later versions of the OS, which strongly suggested the bug wasn’t in our software. Nevertheless, the same SSH client could always connect directly, so there’s something we were doing differently, and that needed to be fixed, or worked around.

    The symptom was that when connecting through our proxy, sometimes we got immediately disconnected. Looking at the proxy logs, it looked like the server side just dropped the connection. So I went looking server side, and it turns out, it dropped the connection, because the sshd crashed. Oopsie. Not the main sshd, mind you, but the process it forked off to handle the new connection.

    Now, why does it crash, and how can we prevent it from happening? After a lot of hair pulling, and diffing the exchange between a good connection and a failed one, we narrowed it down to key exchange. In particular, to which key exchange is agreed upon during handshake. There was one which made the server crash. Nice, we can work with that! We patched our proxy to blacklist said KX, and as far as the customer was concerned, the problem was fixed.

    I wanted to know more, though, so I tried to dig deeper: I compiled the same version of SSH, with debug symbols and whatnot - but I was unable to replicate the problem with that. Weird. No matter, I can do blackboxing! I enabled debug logging in the system sshd, and followed the source going by the logs, which led me to a block of hand-crafted assembly code. Okay, that’s scary territory, I don’t speak sparc assembly, but this is a good opportunity to learn some! There wasn’t anything particularly bad in that code, and newer SSH versions did not have a change there, either. So it wasn’t the asm. What else could it be?

    The compiler, of course. There was a bug in the compiler that miscompiled that particular asm code. If using a different optimization level, or a different version of the compiler, the generated code was correct. But one particular version of the compiler did miscompile it, and if the agreed KX was this particular one, with a hand-coded asm part, we had a high, but not 100% chance of corrupting memory and crashing. Luckily, the miscompilation was pretty mild, as in, the generated binary differed in about 5 bytes, so we could even provide a binary patch for the customer. They opted not to use it, though, and just go with the workaround of blacklisting said KX.

    Why didn’t it happen when connecting directly, though? Because the KX priority was different. You see, with our proxy, you can override the priority, and we shipped our own (more modern, at the time) defaults than the system ssh clients. When we asked the client to connect directly with the problematic KX, the problem became reproducible without the proxy too.

    This took me quite some time to hunt down, find a workaround, and then dig deeper for the root cause. But it’s one of my favourite bug hunts of all time.

    1. 24

      I once worked on an automated “getout” trading system. When a human would buy stock at the exchange, the program would time the sell to get a good price.

      We were not allowed to have a computer on the exchange’s network, so one of the software engineers came up with a plan: connect a Y-cable to the printer port. Back then we had to keep hardcopy records of every trade, so this let us automatically know when a trade has occurred.

      I had written a program to monitor the connection; if we didn’t see a trade in 30 seconds, an alarm would sound. One day, after the system had been running for weeks, the alarm started going nuts.

      I called the head trader, who told me everything looked fine. I asked if everything was plugged in. Yes. I asked if they were still trading. Yes. Try a different cable. Nothing. We went through everything that we could think of. We had stopped getting data, and we had no idea why.

      Hours later I got a phone call. Are you getting data now? Yes, what did you do? The printer was out of paper, so I added paper? You.. what? Turns out if there’s no paper, the printer tells the computer to stop sending data. Since we were sipping from the same straw as the printer, so to speak, if the printer stopped drinking, so did we.

      To this day, when someone describes a system designed Rube Goldberg style, In shrug my shoulders. Little surprises me any more. If you told me all that all of Google’s servers are powered by genetically enhanced hamsters, I might believe you. The world is full of crazy designs, and many of them even work.

      1. 11

        Not my story, but one I wanted to share in response to your “system designed Rube Goldberg style” comment.

        A sysadmin was hired at a small data center where the previous sysadmin had left rather unceremoniously. As a result there was barely any documentation or sense to the configuration of the racks. The only information that had been inherited was the set of passwords to all of the servers. There was random equipment just left on the ground and everything was messy. The new guy decides to go through and redo the whole thing, writing meticulous documentation while he physically cleans up the data center rack by rack. The owners of the data center had fixed operating hours, so he was free to mess around after the work day.

        At one rack, the sysadmin stumbles upon an open laptop sitting on top of a pile of networking manuals. It was plugged into a nearby outlet and seemed to have been forgotten for months. He scoffs at the previous guy for being so stupid as to leave a logged in laptop in the middle of a data center (and to forget about it!), cleans up all of the manuals and brings everything back to his office.

        The next day, the sysadmin gets woken up by an emergency call from management. No client payments were going through their POS system! The sysadmin rushes to work and narrows down the issue to one server in one rack which appeared to have frozen overnight. SSH didn’t work and neither did any other port, nor did a serial connection to console. Left with no other choice, the sysadmin decides to physically restart the machine which brought it back to a working state. Given that this problem had never happened before, the sysadmin chalks it up to a bizarre bug and continues on with the work day.

        The next day, the same problem occurs! The server had frozen overnight. Management is getting angry and demanding answers he doesn’t have. He quickly restarts the server and elects to spend the rest of the day tearing its software apart to determine why it’s freezing so suddenly.

        Problem is, the software was written by a contractor, in an esoteric language (think FORTRAN but weirder), on a proprietary operating system that was no longer supported by the vendor, who had gone out of business. Now, the obvious choice would be to upgrade to a better setup, but given that this server held up the POS system for the entire company, it would take months or even years to build a replacement that met all the requirements. Our sysadmin had until tomorrow. You’d think the fix would be simple: just set up a cronjob that restarts the server automatically. However, looking at logs, our sysadmin sees that the server crashes at random points through the night. The crashes aren’t consistent! If the server freezes before the cronjob is set to run, then of course it won’t restart itself.

        Our hero spends the entire night trying to decode this strange system. Hours pass and the server freezes again. He goes back to its rack to restart it. He’d been here all night and still had no idea how to fix the problem. If the server froze during operating hours again, he’d lose his job!

        The sysadmin decides to take another look at everything the previous guy had left. Surely he had to have run into the same freezing issue and figured out how to fix it…? More time passes until there’s only a couple of minutes before opening time. One thing he hadn’t looked at was the strange laptop on top of the pile of manuals, so he boots it up, logs in, and is presented with a desktop with a single generic-name executable file.

        While he’s busy exploring the file system, the CD-ROM drive opens up. And then closes. Confusion sets in. He launches the executable on the desktop. The CD-ROM drive opens, and closes again. But he knows the first time, the executable launched on its own. Is this malware?

        He opens up Task Scheduler. Sure enough, there’s a recurring task set at just a few minutes before the business opens, and it launches the executable that toggles the disk drive. What the hell is this?

        The sysadmin refers to his documentation and goes back to the rack where he found the computer. He sets it up just like the previous guy had, networking manuals and all. What was this laptop supposed to do with its disk tray every morning just before opening?

        Hit the power button to the POS server, of course.

      2. 23

        Many years ago when I worked for (what was then) RIM I was working on a team that was creating a new hardware device (spoiler alert: it was never released). I was the system software person and one of the hardware guys ran into what seemed to be a compiler bug with memcpy.

        We take a look and sure enough, copying this chunk of memory from place A to B resulted in the copy not being the same as the original. A few more tests were run and it certainly seemed to be miscompiled. We look at the assembler output, hoping to be right and… it’s fine! Everything makes sense. But we can see it being wrong. So what’s up?

        After a bit of sheer confusion it’s time to break out the hardware examination kit. Upon close examination with a magnifying glass, we find the culprit: the RAM connection was wired wrong, so bytes are being swapped when they are moved from one place to another.

        Unsurprisingly, that sample run of devices was deemed to be no good.

        1. 19

          A long time ago I worked on an iOS application for a large airline that travelers could use to download and use mobile boarding passes (it was novel at the time!). We received a bug report that whenever the user sat in a certain part of a particular Swiss airport that all of their downloaded boarding passes would disappear. The flew the same route every week and sat in the same place and each week the same thing would happen. The app worked fine during all of their other travels.

          We didn’t really believe it initially but after a second user reported exactly the same thing the bug became too interesting to pass up. The initial debugging wasn’t interesting: logging was added, more edge cases were unit tested. The bug would not reproduce in our office. We only eventually worked out what the issue was by flying someone to the airport with a laptop and reproducing the bug ourselves.

          There was a misconfigured WiFi access point in one corner of the terminal that would respond with a normal redirect to a captive portal if you requested HTTP, but would respond with an empty JSON array if you requested JSON. Our app would try to sync its list of boarding passes, get an empty response, and dutifully delete all of the user’s boarding passes. This was before TLS was widely in use for anything other than login pages (a dark, dark time).

          We added TLS to the network connections and improved the sync protocol so that the server would have to actively tell the device to delete boarding passes rather than serving up a desired state that should be blindly converged upon. Last I heard this was all still being used to this day!

          1. 1

            This leads me to wonder what the hell kind of bug the captive portal people were trying to fix, where requesting JSON had to return an empty JSON response instead of a redirect!

          2. 17

            The fuses at our office kept blowing, we checked everything, turned everything off … they kept blowing. Checked the neighbouring office and their did the same thing. We went downstairs to the veterinary office and checked everything as well. We were mystified. Then it turned out that one of the dogs at the vet had been slobbering all over an extension cord in a corner somewhere.

            Okay, not really a bug I suppose, but still surprising and somewhat amusing :-)

            1. 5

              That Doggo was lucky….

              My house had a door that would swing close by itself… Doggo entered room, couldn’t get out.

              Doggo got desperate and made a wee….

              ..on a power board.

              We heard a YELP, the lights went out, silence.

              Look all over house, no doggo, no problem, reset the trip switch…


              From under the cupboard.

              Doggo having had his precious bits zapped, had hidden under the cupboard. (A very very desperate move, it was very very low).

              When we reset the trip switch, power board full of dog wee began sizzling and boiling and making a horrid stink.

              Doggo decided power board from hell had come alive again and was after his sensitive bits, and had assumed the End of Times had come!

            2. 11

              So I have a few books worth of these kinds of stories (and taught courses on the subject of ‘systemic debugging’ many years ago), and worked on and off as a debugging firefighter when things were costly, bugs were insane and the devs involved did not know what to do and factories were literally at a stand-still.

              Many of those are still in the ‘lawyers or people with access to lawyers’ would be angry if I told, but some of the distilled lessons can be found on www.systemicsoftwaredebugging.com - it’s closing in on a decade old by now though some parts are still reasonably relevant.

              The story to be told is in the low-to-mid-range here, but can be told as the guilty party was me (young me wrote much of the software involved). The software in question was a disk cloning tool as part of a grander thing. This is at least, 17 years ago.

              To set the scene: In order to manage experiments in the security end of a distributed laboratory (think botnet research for critical infrastructure) we had a suite of tools to manage experiment life-cycles: booting, cloning, collection, post-mortem analysis and so on. Hardware relays that flipped the electron supply, and nodes that, when booted, got a custom built PXE image chosen based on VLAN-tag and MAC address with an OS that fulfilled some specific task.

              The imaging system in this scenario was dumb and smart in the sense that since it was a ‘lab’ you could afford to prepare “the sample”. When an image was built we started by 0ing out the disk, installed “whatever” and then the thing tar:ed and compressed pretty fine without having to be too clever about file systems, which was what the Norton Ghost and similar products of the time wasted their time in being. The tactic worked fine for a variety of Windows versions, linuxes, proprietary SCADA stuff (heck, even OS/2 Warp) and so on.

              The OS maintenance kernel sent over TFTP/PXE was a FreeBSD 3.x:ish (and into the 4.x transition, which matters here). When applying a disk image, the code was rather simple – though had to be optimised for size and env; it was statically linked, used curl. The basic task boiled down to “open disk device, stream image from server, DEFLATE in memory and write to device”. There were few reliable channels to communicate here, the system didn’t have distributed tracing etc, you were running blind. Granted we’d have data corruption in the past from how fragile PXE boot, custom boot chain-loaders and TFTP is in a pseudo-hostile network - this wasn’t any of the sort and when that happened stuff typically didn’t get passed the boot loader.

              The code had been tested and running for months, my home lab used the same setup (though Linux+NFS instead because I was even more of an idiot at the time) and it wasn’t that ugly to begin with. Seemingly out of the blue, we started to get sporadic failures in the lab when before, we had none. Not real serious “I couldn’t extract the image” kind of failures, but rather booting to a partially corrupted OS kind of a failure. This did not reliably reproduce.

              Some things made it worse - when cloning 5-10 machines and booting them at once seemed to make it worse, but not always. I physically removed the hard drives from the broken machines, extracted the images, and sat there staring at tens of gigabytes of hex dumps looking for anomalies. This was COTS (common, off the shelf) hardware - but intermittent hardware failure didn’t explain this. A rough estimate would land 2-5% failure on heavy network load, 1% otherwise.

              Now - for time/wear/cost saving reasons, disks weren’t always 0:ed out when written to; only when a base image was created or when an operator had explicitly set “give me full guttman 32-pass wipe whatever I did from everything”. This actually masked some of it.

              To make a long story slightly shorter, the thing is that there is such a thing as a natural sector size for disk writes to a raw disk device - 512b at times, but if you were dealing with a SCSI or newer ATA it could be 1k, 2k, 4k, … in the kernels this code had been tested on, if you were writing a short block (i.e. < natural sector size), the OS would first fetch, write-partial and write full sector back.

              FBSD stopped doing that somewhere 3.x to 4.x (or 4.small_x to 4.larger_x). I’d like to blame PHK (both in general and for this) but I am slightly dimm about the attribution here – but the behaviour had changed and that uncovered this assumption. A non-sector aligned write silently failed (and shortly thereafter failed with an error but this wasn’t an env where write()=-1 fails propagated or even had somewhere to be logged, printf() was not a thing!) and what was initially occupying the block remained there and the intended write was lost.

              So why did this happen or was triggered? this ties into ‘buffer bloat’ and at a time before that was a word. In the normal lab env things >naturally< converged on % some size (which with curl at the time matched sector size) so the issue was masked further. The streamed output across the network got binned by the TCP stack and binned by libcurl and binned by libz and very likely, the output buffer sizes on each callback (in a callback in a callback) ended up % 512 == 0. When the network was congestion and you got partial smaller reads however, they turned out less than that, well corruption ensued. If the disk wasn’t wiped, many times, the OS recovered, but not always.

              This is why I drink. Memory is not safe. Wear a helmet.

              1. 10

                The “Classic” Mac OS (pre-OS X) was a hotbed for super-mysterious bugs. The entire OS had no kernel, no memory protection … a stray write could corrupt not only your own process’s state but also another app’s, or system state, or even interrupt vectors. Or for extra credit, it could corrupt filesystem or driver buffers and persist the problem across reboots!

                Apple’ “develop” magazine used to have a column “KON and BAL’s Puzzle Page” where Konstantin Othmer and Bruce Leak discussed some horrible bug and eventually figured it out. There was a score shown in the margins, so the sooner you figured out the bug while reading along, the higher a score you got. Only, I could never figure out the problem before the end. I have no idea how anyone could.

                I know there’s some nostalgia for old systems. But when I think about this stuff, I have no desire to ever go back there again. It’s like people who feel romantic about the Middle Ages but don’t think about plagues, lack of sanitation, famine, warfare…

                1. 4

                  I wrote a bug once where I overflowed a stack based string in a program running on Windows 98. Win9x did have some level of memory protection, but not much by modern standards. When I compiled my code through an optimizing compiler, the layout of the stack meant that the overrun landed on unused variables and everything worked. When using a debug build, it overran the function return address, causing execution to jump to some random place, the machine would trap and immediately reboot.

                  Debugging was painful, just because halfway through stepping through the code in the debugger the entire machine rebooted, long after the actual corruption occurred.

                  Like you, I don’t miss those systems.

                  1. 2

                    I’m not fully clear on what the story was with memory management at the time of OS 9, since I feel like I’ve read conflicting accounts. I know memory protection never happened, but I’ve read that at least per-process memory isolation was available starting from OS 7, similar to the Windows 95/98 situation at the time – so was the ability to just nuke any part of physical memory by abusing pointers a thing that was limited to OS 6 and before?

                    1. 4

                      No, there was no memory protection in the classic OS up through 9.x. It was all one shared address space and anything could write anywhere. Well, except for executable code being memory-mapped read-only on PowerPC.

                      You could dereference null pointers with impunity, which was a real PITA for debugging.. As a workaround, there was a system extension you could install called Mr. Bus Error, later EvenBetterBusError, that would write an invalid address to locations 0-3, causing double-derefs (Handle derefs) to crash. It also polled those addresses and dropped you into MacsBug with a warning if their contents changed, indicating something had written to null in the last few milliseconds.

                      1. 1

                        Memory management requires cooperation on the side of the hardware (e.g. processor) in order to work. It can’t be done entirely in software because there’s no concept of privileged or unprivileged instructions until you’re on a CPU that supports it. However, it does obviously have to be configured by the OS before it begins working. Given that virtual memory and segmentation and paging all came out by 1985 when the i386 was released, I would say it’s a good bet that memory protection only started being used by operating systems by the time processors with memory protection started outnumbering those that didn’t (Bill Gates infamously called the i286 a “brain dead” chip because of its woes with memory protection and quality control).

                        1. 2

                          Right, I was eliding the software/hardware distinction in the case of Apple because I figure the fact of their controlling the whole stack meant that they could introduce both a computer with an MMU and an OS supporting virtual memory at the same time. Windows seems to be a different case — as I understand it, there wasn’t real memory protection (in the sense of distinguishing between privileged/unprivileged accesses, as opposed to just memory isolation with each process having its own address space) well after CPUs supporting it became ubiquitous, because the non-NT model effectively had everything being run as root.

                    2. 9

                      This was for an application used to personalize smart cards when issuing credit cards, government IDs, passports, etc. It was used all around the world but for one customer it would crash on startup. We tracked it down to an database error. Our installer created the database schema and inserted the current version into a field called VERSION. When our application would start up and it checked the “version” field to confirm that it was the expected value (to ensure it wasn’t running against the database schema version it was written against).

                      Manually checking the value showed that it was fine. And it worked fine everywhere except this one customer. The error was that there was no column named “version”, but looking at the schema there clearly was. The only difference was that the scheme writing code was in UPPERCASE and the SQL in the app was lowercase. But SQL isn’t case sensitive and we confirmed that the customer had set their SQL Server instance to non-case sensitive.

                      I was two days away from an overnight flight to Ankara when I figured it out. Luckily for my employer, I had spent a few weeks in Turkey doing the post-college backback around Europe thing back when recent college graduates could still afford that kind of thing, so I knew a little bit about Turkey and specifically the Turkish alphabet. Which has two ‘i’s - one with the little dot and one without. So, “select version from app” doesn’t uppercase to “SELECT VERSION FROM APP” when using the Turkish alphabet. It becomes “SELECT VERSİON FROM APP” (look closely).

                      A quick google today shows that lots of people have been bitten by that Turkish ı, but this was in the 90s and was the first time I had heard about it.

                      1. 8

                        This one isn’t necessarily unheard of but it was particularly novel at the time as I had encountered it at my first job and no one was able to track down the cause.

                        We would get builds that would fail for no particular reason but intermittently with no obvious pattern.

                        I think we shrugged it off for like 2 months before someone got annoyed and spent an entire day trying to figure it out.

                        Basically, we were using a commit sha as the version for a Docker image and not stringifying it. I believe the combination was that when you had all numbers and a single e, it would be interpreted as scientific notation.

                        I think Helm 2 was the culprit where it would interpret what we assumed was a string and turned it into a number, which was not a valid value.

                        I’ve seen this pop up a few times since in projects but it’s one of those things that is rare enough to shrug off for a long time

                        1. 5

                          That’s a “feature” of yaml not specific to helm. The Yaml spec is quite large and there are a lot of surprising things in there that crop up when you least expect it.

                        2. 8

                          Algol had a curious, umm, feature. Call by name.

                          If you passed an expression as a parameter, it evaluated that expression when (and if) that parameter was referenced.

                          The “and if” part of that is actually quite a good idea if the expression was computationally expensive or liable to trigger recursion.

                          However, the “when” part meant it was re-evaluated every time it was referenced.

                          Which did surprising things if the expression had side effects.

                          The case that boggled my mind the most was passing “rand()”.

                          It took me ages to realised every time I referenced that parameter, for example in a debug print… it evaluated to a different number…

                          A semi-related one was in C….

                          Spot the bug in my ex-colleagues code…

                          a( i=0, ++i, ++i);
                          1. 2

                            Ha, somewhat related, a while ago I wrote a nested for loop to draw an image pixel by pixel – pretty typical stuff, the first loop did rows and the second loop did columns for each row.

                            Spot the bug:

                            for (x = 0; x <= img.length; x++) {
                                for (y = 0; x <= img[x].length; y++) {

                            Spent the entire afternoon on this, couldn’t figure out why it would freeze whenever I clicked the button to render the image…

                          2. 5

                            I was implementing a programming language with multi-threading, and I was heavily testing it when I started having weird crashes that I could not understand. It took about a week to discover that the problem was linked with my reference counting mechanism. Sometimes the reference would jump to 0 and the object would be deleted while being still in use with another thread. Sometime the reference would be 2 and the object would never be destroyed, creating massive memory leaks (it was implemented in C++). Eventually, I managed to understand that it was a concurrency problem and I implemented the reference counter as an atomic value, which saved the day.

                            1. 4

                              It’s weird to read now, but I wrote this up a couple of years ago: https://ltriant.github.io/2019/03/23/debugging-ev-and-perls-signal-handlers.html

                              I stopped and came back to it a couple of times over a period of a few months before finally figuring it out. I finally figured it out a few weeks before I finished up at that job, and then left Perl behind for prod systems for good.

                              1. 3

                                I am currently fighting a bug in my memory profiler for Python (github.com/pythonspeed/filprofiler/issues/149). Basically the profiler tracks every allocation, and keeps a hashmap from address to allocation size + callstack. Somehow, an allocation is disappearing.

                                After a lot of thinking and debugging, I found one cause (race condition—Rust doesn’t help with “my cached copy of this data got out of sync with original data” bugs). But there’s more… and plausibly it’s the reentrancy prevention code. If the tracking code does malloc() you can end up infinite loop as you track the allocations from the tracking code, so there’s reentrancy flags to prevent that… but doing it wrong can result in memory disappearing. Currently trying to figure out ways to log just enough to get the info I need, without spewing massive amounts of data.

                                Doesn’t help the reproducer is a complex Python program that doesn’t always reproduce the problem.

                                1. 3

                                  The context was unifying data from a lot of reports which all had enough data to match the entities described in different reports… but in a different way each time. Oh, and there were some corrections in the reports coming once in a few days.

                                  I quickly decided that I am not careful enough to maintain the correspondence by hand in presence of corrections. So the plan was to write a kind of a limited inference engine: rename the columns in a consistent way, see which combinations of fields were reasonable identifiers, and track what we can find out about them from each report and what implications can be drawn. All went well until ~10% of entries in one file failed to match. The goal was of course to achieve 100% match, so sudden failure to match some records with no clear reason was annoying. Of course it was not the first report with matching problems, but it was the first to have a majority of entries OK with a non-negligible minority not matching what it should have matched.

                                  After a ton of debugging of the inference engine, it turned out that the problem was never there. The file was incorrectly preprocessed — I had mistyped the header of the most convenient column for identification of entries! Turns out the inference engine was not worse than I expected but better: even without the key column it managed to find the correspondence for most record, but for some rows no «approved» combination of fields was sufficient to draw any conclusion. This was not what I expected from a mislabeled file, so I did not recheck the file attentively enough to notice one small typo…

                                  1. 3

                                    I don’t remember what the symptoms were, but I remember the cause was an unsigned integer field being initialized to -1 (in C) and then checking for a negative value with < 0. This was on an embedded device; a CNC machine controller board, in particular. We had no help from the compiler on that one.

                                    1. 3

                                      A client was unable to log into their CMS that I was apart of building and migrating. The only way they could log in is if they submitted a reset-password request, and went in through the email to reset the password and subsequently log them in. After a week or two trying to figure it out - I realized my mistake was in the character width of the VARCHAR column in the MySQL database for user’s encrypted passwords. MySQL was silently truncating the data. The reset password hack worked because it did not have to verify the newly set password to start a session for the user - it just stored it and truncated it every single time.

                                      1. 3

                                        I blogged about my favourite bug during my early years learning to write C code. Years afterwards I even named my company after the experience. https://martinrue.com/zzuy-a-lesson-in-perseverance/

                                        1. 1

                                          Funny story, even if I did guess the ending. Because that’s the sort of thing that I’d do.

                                        2. 2

                                          The port of BIND (the DNS server package) to NT had an issue which, due to a subtle interaction with the Windows high-resolution timing facility, would cause it to peg every core/thread when it would execute some periodic/scheduled tasks. I was only able to reproduce it because I happened to start bringing music to work and foobar2000 also used mmtimers.

                                          1. 2

                                            I lived in an apartment building that had fiber to the basement, and then Ethernet delivered to the apartments. From time to time, Facebook and some other sites would stop working. Sometimes it was CDNs that were blocked, sometimes it was DNS servers like Basically intermittent random packet loss on specific IP addresses. Cutting 1.5 months diagnosis short, I learnt to use wireshark, led the ISP tech through the issue to confirm that it was a bug in the router’s pppoe stack / firewall, which they had to rely on the manufacturer to fix. The reward was a working internet connection (useful when working from home). No idea what the exact issue was to this day.

                                            1. 1

                                              Well there was this one time with 41 ms of unshakeable latency…

                                              But then I took an arrow to the knee.

                                              Jokes aside, I’m loving this question and the answers to it.