1. 3

    I looked at that and went “oh, that’s weird, must be a graphite bug”, and continued on.

    That’s an odd perspective…

    1. 7

      Graphite is flaky as hell, in my experience. I have some charts in production right now that are very wonky - look like randomly scattered pixels. The same data goes to newrelic and the graphs there look fine.

      1. 4

        Yeah. I was in the middle of other things at the time (as always), and just glanced over it. I’ve had metrics sit at 0 latency before for other reasons (often in the metrics stack), and it was a dumb mistake ;) That’s most of the point of the post though - missing simple things like that can lead to nasty problems down the line. These days I keep a list in a text file under “incuriosity killed the infrastructure”, and try and write things down in there if I notice something even slightly odd.

        1. 2

          It’s extremely healthy to question the accuracy of your time-series data, and try to prove it wrong. 0 values are not indicative of anything other than the fact that the front-end has not been able to draw any pixels higher than that. This may be because the backing DB is having transient issues with write volume. This may be because you value reliability of your metrics over best-case latency and your machines buffer outbound requests in a local persistent queue before forwarding them to the metrics system, and your implementation adds a bit of latency. Maybe you have an aggregation layer before it hits the metrics system, and your aggregation window isn’t up yet.

          It’s not uncommon for users of metrics systems to learn to ignore the leading edge. There’s a ton of skew for how long this may be. I guess it becomes “an odd perspective” when it’s been 0 for longer than you’ve learned to discount the system, and the author definitely discounts the system pretty heavily.

          1. 1

            Yeah, the story seemed mostly reasonable until he ignored the flat 0 successful hits to the database.

          1. 9

            Just yesterday I was reading The byte order fallacy which seems it could apply to dates and times as well. It shouldn’t matter whether the server is running UTC or not. Instead any software running on it and communicating with other processes either needs to include the timezone in it’s messages, or convert to/from UTC. Assuming UTC is like assuming everything is little endian and saying “don’t plug any big endian machines into this network”.

            I don’t know what version of cron the author is using. The man page for “vixie” cron (the one true cron) says this in a section about DST:

             If time has moved forward, those jobs that would have run in the interval
             that has been skipped will be run immediately.  Conversely, if time has
             moved backward, care is taken to avoid running jobs twice.
            
            1. 1

              Yeah, I wasn’t too explicit there, but there are a bunch of programs that emulate cron like functionality inside your vm (e.g. resque scheduler, quartz etc). I think it’s better to just avoid the whole headache of DST entirely.

            1. 2

              Figuring out how to get yellerapp.com (the exception tracker I built and run) tracking which users were hit by an exception, on top of riak. Going well so far, but now have to write docs a bunch, and change the client libraries a little bit, which is always an annoying part.

              1. 3

                Whole bunch of UI improvements to my product, yellerapp.com (an exception analytics SaaS). I’m focussing on three areas:

                1. Speed - Yeller’s real good at server side speed (average page render is 20ms or so), but less good at client side perf. This update is working towards fixing that.
                2. Density - I want to fit as much on the screen as I can, so you’re less likely to miss crucial information
                3. Understanding - Whilst I care about density a bunch, a big focus is making it so you understand your exceptions, so that you can fix them faster.
                1. 2

                  Density - I want to fit as much on the screen as I can, so you’re less likely to miss crucial information

                  This may not be the right solution to “making information visible” problem, though you probably know more about the problem space/problems with the existing UI, than I do.

                  1. 1

                    Yeah, it’s tricky. My description here wasn’t really right on by what I mean by density - the real thing I’m aiming for is

                    “everything you need, at a glance”

                    at least on the individual exception page

                1. 2

                  Author here. Yeller does drop messages on the floor if it has to, it just tries really, really really hard not to. But it doesn’t apply backpressure on the clients (which is the other choice), because that means they get hit with unexpected latency.

                  1. 1

                    What kind of latency are you worried about? Couldn’t you just signal, “not now” and have the clients send it later? I think this is how scribe does it.

                    1. 1

                      Sure. The worry there is that my client libraries run in other people’s production systems, and there’s tradeoffs with how complex they get. I’m wary of messing up that code in some way such that my users get issues with bugs in those libraries. I also don’t wanna have impacts on user’s memory/disks if Yeller goes down - I’m pretty certain they’d prefer to miss some error messages than to have downtime or lots of GC induced latency spikes.

                      There’s also a bunch of maintenance cost. I currently maintain 4 libraries for different languages, and that’s only going to get larger as time goes on.

                      This isn’t saying I’m not going to do this eventually, just I’m worried about the tradeoffs right now, so I’m holding off.

                      1. 1

                        It might be worth looking into seeing how zipkin deals with this problem. It exposes a scribe interface, so clients can either use scribe, or write to it directly using thrift. Scribe is also not the only game in town, but this method works for other strategies, like kafka, too. Then your clients can decide how hard they want to work for their exceptions, and it also enables having two classes of exceptions–one class of which is “please work really hard to not lose these”, and another is “best effort” (maybe fatal vs. nonfatal exceptions).

                  1. 2

                    Writing a course and set of posts on debugging production web apps. Covering stuff like common shortcuts to fixes, methodologies for dealing with the really hard bugs even though your hair is on fire (e.g. differential diagnosis), using metrics effectively for debugging and so on.

                    I’d seriously love some proofreading/feedback from folk who run (or have run) server software in production.

                    I’d also love to hear/read your favorite production debugging stories/postmortems - I want to compile a big ass list of good examples of that kinda stuff somewhere.

                    1. 2

                      I’m building a regex enabled indexed, text search index on top of riak (am using roughly the same techniques as http://swtch.com/~rsc/regexp/regexp4.html). Going pretty well so far.