1. 1

    I’m glad to see I’m not alone in the camp of running multiple backup methods on my Mac.

    I have my desktop and laptop configured to have Time Machine back up to my FreeNAS box. It works fine for a while but like the author after some time I get the dreaded popup that there’s a problem with the backup image and it needs to start from scratch. There are sometimes ways to fix it but it doesn’t always work. As far as I know this error only happens when using Time Machine over the network - where it uses a sparsebundle - and not when using local disk.

    I also have my machines back up using Arq and it’s been great. On my desktop Arq backs up my entire $HOME to FreeNAS and a subset of valuable data to Backblaze B2. My desktop also has a USB disk which SuperDuper! clones to every night as a full bootable backup.

    I’ve contemplated ditching Time Machine multiple times, but the thing is if you get a new machine or reinstall the OS, Time Machine is the only thing which can do a full bare metal restore at OS install time. I get all my applications, user config, etc. restored by Time Machine. Arq has all my user data but I can’t do that kind of restore.

    1. 1

      As far as I know this error only happens when using Time Machine over the network - where it uses a sparsebundle - and not when using local disk.

      That’s what I’ve seen. I think that the underlying cause is grabbing the laptop and leaving the house while it’s in the middle of a backup.

      1. 1

        That’s been my presumption too, although I haven’t proved it yet.

    1. 4

      We had a test instance at work sometime early-mid last year. It was mostly set up as a hobby/trial thing so we weren’t paying for it.

      When that instance was too much of a burden for the couple of people who were running it to maintain it got shut down - and suddenly everyone realised just how valuable it was and begged for it to be set up again. We’ve now got a enterprise license (I think) for it and a full team who owns it.

      I think it’s great. Our internal VCS is self-hosted Bitbucket, and the search in that is awful. Sourcegraph is fast and seems to just work, from an end-user perspective. We have thousands of internal repos and it returns search results much faster than I would expect.

      1. 1

        I quite like it as well.

        My only gripe is that it is absurdly expensive for what it is.

      1. 3

        I’m sticking with Mojave for now purely because of Lightroom. I’ve got one of the last standalone (non-CC) versions. AFAICT they don’t do standalone anymore so I can’t upgrade - CC isn’t worth the monthly fee as I use it so infrequently.

        Until I can find something to replace LR (and it was a huge pain to migrate from Aperture way back when), I’m kinda stuck.

        1. 5

          Last job: kibana - hated it. For some reason I’m incompatible with the syntax. I often tried to just narrow it down to minutes/hours and then still used grep and jq on the structured logging json files.

          Current job: Nothing except ssh and grep - not ideal, but still beats kibana for low number of hosts. Also not my or even our choice, customer’s boxes.

          TLDR: No idea what’s a good solution, really. I’d probably give graylog a spin and then maybe write my own thing. Only half-joking.

          But to elaborate on what I see as your actual question: Logging isn’t enough. Honeycomb sounds like a mashup of logging and metrics. I liked Prometheus when I used it, but no good mix-in with the logs.

          Also the “typically 300-400 dimensions per event for a mature instrumented service” is not to be taken lightly. Seeing as who posted the linked blog post and that I seem to be disagreeing with most of their angry tweets… This probably is a solution for a specific type of problem that I usually don’t find myself having.

          1. 4

            Honeycomb sounds like a mashup of logging and metrics.

            As a (very happy) Honeycomb user (on a simple rails monolith): It’s really not.

            I’ve used Splunk and NewRelic extensively; they are not the same kind of thing. The best summary of the difference I can offer is this:

            • Splunk and NewRelic are designed to answer a large-but-finite set of known questions.
            • Honeycomb is designed to answer new questions.

            IME, the ‘fixed set of questions’ approach will get you a pretty long way, but they often lead to outages getting misdiagnosed (extending time-to-fix) because you have to build a theory based on the answers you’re able to get.

            • Honeycomb lets you formulate a question to confirm your theory during the outage
            • Splunk lets you add a new indexing strategy so you could answer that question during the next outage, and
            • NewRelic tells you to be happy with the answers you’ve already got.

            Also, from a cost perspective the orders of magnitude don’t really line up. If you already pay for splunk, the cost of Honeycomb is a rounding error.

            1. 1

              If I had written “a mashup of logging and metrics and more”, would you have agreed more to that? I could only guess as I have never tried it.

              I haven’t used splunk - for me it’s “that process that runs on the managed machines we deploy to and that sometimes hogs CPU without providing any benefit because we don’t have access to the output” :P And no, don’t ask about this weird setup.

              1. 3

                Still not quite. “Nested span” really is a separate type of thing.

                The failure of logs is that they are too large to store cheaply or query efficiently.

                You can extract metrics from logs (solving the ‘too large’ issue), but then you can only answer questions your metrics suit. If you have new questions you have to modify your metrics gathering and wait for data to come in, which is not an option during an outage.

                Nested spans are (AFAIK) the most space-efficient and query-efficient structure to store the data you would usually put into logs, but they require you to modify your application to provide them (which is a nontrivial cost). In practice, nested spans typically also combine logs from multiple sources (eg load balancer, app server and postgres logs can all go into the same nested span).

                The core observations IMO:

                • It is possible (with clever tricks) to pack all useful info from a log line into a key/value map without increasing the storage required. This is the ‘arbitrarily wide’ term.
                • Most events are part of one parent event (eg: one DB query is part of responding to one http request). This is the ‘nested’ term.
                • All events which can be logged have a start and end time, even if the duration is nearly 0. This is the ‘span’ term.

                Taken together, this gives you term ‘arbitrarily-wide nested spans’. Several optimizations are possible from there - eg the start/end times of a child span tend to be very small offsets from the parent event, whereas storing as logs requires the full precision timestamp.

                When your data is stored in this sort of nested-span structure, it’s both smaller than the original logs (thanks to removing redundancy in related lines) and more complete (because it can join logs from multiple sources).

                You can further shrink your data by sampling each top-level span and storing the sample rate against the ones you keep. For instance, my production setup only stores 0.5% of 200 OK requests that completed in under 400ms. Because the events which are kept store the sample rate, all the derived metrics/graphs can weight that event 200x higher. This means that I have more storage available for detailed info about slow/failed requests.

                This is all sensible stuff you could build yourself, but honeycomb already got it right and they’re dirt cheap (their free plan is ample for my work, which is a pretty damn big site). The annual cost for their paid plan is less than a day of my time costs.

                I haven’t used splunk - for me it’s “that process that runs on the managed machines we deploy to and that sometimes hogs CPU without providing any benefit because we don’t have access to the output” :P And no, don’t ask about this weird setup.

                This is par for the course at most places dysfunctional enough to pay for splunk (at a past employer it cost more than a small dev team, and it took me months to get access to it).

              2. 1

                Could you give some concrete examples of the difference between Splunk and Honeycomb? Specially, what do you mean exactly by “known questions” vs “new questions”?

                1. 1

                  When you configure a splunk install, you setup indexing policies which determine which queries can be efficiently answered.

                  Most of the time, when I wanted to find something out in splunk, I wrote a query and got a result in a second or two. These are ‘known questions’, and splunk uses its index to come up with answers quickly. For instance, if I wanted to see a breakdown of HTTP status code by request route per hour, our splunk install had zero trouble answering instantly.

                  If you want to answer a question which isn’t handled by those indexing policies (a ‘new question’), Splunk has to read the log files in full to answer it. This results in splunk taking a minute or two (dependent on your hardware and load) to give you a results page.

                  For instance, I recently wanted to compare browser share stats between google analytics (blocked by trackers) and honeycomb (records bots as well as real users). So, I did a breakdown by browser + version + IP address, then manually excluded IP addresses with abnormally high request rates and repeated with just browser+version. Took me a minute or so to put the final query together. If I’d tried that with splunk, I’d have had a visit from the operations team asking me how long I planned on generating that much load on the splunk server.

              3. 1

                Observability - and the service Honecomb provides - is about being able observe or introspect your system, to ask unknown questions without having to recompile and redeploy.

                Metrics and logs can be a part of observability, but for example metrics usually only give you a hint at what’s going on at a given sampled period and are based on known questions. Similarly for logs, you can add as many log lines as you want but they might not help you answer specific questions at runtime - and you have the overhead of having to process and store all these logs. Inevitably people downsample logs and metrics for long-term storage, which means you lose data.

                If you have an observable system you can, at runtime, ask questions about what the system is doing at a given point. This is generally in the form of some kind of distributed tracing system. In tracing, while a processing a request, each component of a system (including external dependencies) starts a span connected to a trace. The sum of all spans and the metadata attached to them let you see more detail about what’s going on, such as durations for a certain type of request. This would be too expensive in a timeseries system.

                Going back to the original question, ELK still works fine. For small setups like a home environment it’s relatively straightforward. Structured logs (e.g. JSON formatted) does make ingestion to Elasticsearch easier, but it’s not required (but it is recommended).

                1. 1

                  I have the nagging feeling you’re disagreeing with me as if I had tried to define observability wrongly, but imho I’m not even trying to.

                  If it’s simply about the sentence with mashup, that was simply my impression from the linked article, as I don’t know about Honeycomb.

              1. 4

                Nice to see mention of the Palm Pre. I had one and absolutely loved it. WebOS was an amazing platform and Palm encouraged using developer mode and modifying the OS. As you note, it’s a real shame it didn’t take off.

                Prior to the Pre I’d had a T-Mobile G1 which was awful, partly because early Android was awful but as a device it got frustrating to keep having to physically rotate the device to use the keyboard (Android had no virtual keyboard then). I quickly grew tired of the G1 and went back to my Nokia E51 until the Pre came out.

                I held out until the iPhone 4 before switching to Apple. Prior iPhones seemed inferior to the Pre - and Apple eventually “borrowed” the webOS task switcher. I’ve tried Android a couple of times since and never got on with it.

                I too would like to see other than Apple and Google in the ecosystem again.

                1. 4

                  Fond memories of my G1. I really used and abused that phone. The keyboard was super annoying for sure.

                  1. 3

                    pray tell do you still have that E51 laying around?

                    1. 3

                      I do!

                  1. 1

                    I was hoping to catch up on sleep as this week’s been pretty bad, but work-induced stress is getting the best of me.

                    Otherwise I’m carrying on with the Algorithms course I mentioned last week. Staying on top of deadlines is hard when I don’t have the time or mental capacity during the week.

                    1. 4

                      I’ve got a couple running CoreDNS to run as the internal DNS servers for the home network (deliberately separate from the EdgeRouter and Microserver so I can do maintenance), and a third with a temperature sensor attached so I can monitor the temperature in my office.

                      1. 2

                        I started the Princeton Algorithms I course on Coursera last weekend, so this weekend is finishing off the first assignment.

                        I’ve not done any Java before so there’s an extra learning curve to getting the coursework done on time.

                        1. 9

                          I’ve been using NextDNS since they launched, and a local recursor upstreaming to it using DNS over TLS. It works very well and the devs are very responsive to bugs and feature requests.

                          1. 3

                            I was relying on unbound on my router along with Steven Black’s scripts for generating/consolidating bad hosts, as well as keeping my own black/white lists. My setup worked great, but cname cloaking and automated 3rd party tracker domains meant that ads would still poke through on some sites. Blacklisting these domains as I encountered them works but it gets tedious. NextDNS is pretty good about this, and I’ve been using it for a couple weeks without complaints.

                          1. 3

                            Is there some tmux trigger key that doesn’t interfere with readline? Maybe this is a bad habit on my part, but I frequently use C-a to go to the beginning of the line. This is mostly why I’ve been hesitant to use tmux. Similarly with C-b to go back a single character, though I use that much less frequently.

                            1. 3

                              I use the backtick (`) character. I unbind C-b, bind backtick, and then set a double-backtick to produce a literal one. The character comes up infrequently for me, and double-tapping it to make a literal one isn’t much of a challenge when it happens. The key position is close to the escape key, which I enjoy as a vim regular. (I also rebind most movement keys to vim-like as well)

                              Here’s the code that sets my leader

                              1. 2

                                You’ll get a ton of different answers here, but I like M-a

                                1. 2

                                  I’ve been using screen and then tmux with the same keybindings, and typing C-a a to go to the start of a line is now second nature to me. So much so that I get tripped up outside tmux

                                  1. 2

                                    I’ve been using ctrl-v in both screen and tmux for as long as I can remember for exactly this reason. Ctrl-v is only rarely used (it’s for inserting a literal control character).

                                    1. 2

                                      I use C-o but it could be that it only makes sense with Dvorak as keyboard layout. On the other hand I tend to always have both hands on the keyboard.

                                      1. 2

                                        I use C-z.

                                        There’s a huge discussion of that in this superuser question.

                                        1. 2

                                          This SU question may be related: https://superuser.com/q/74492/18192

                                        1. 4

                                          The Pony project is primarily using:

                                          We still have a few older TravisCI, Appveyor, and CircleCI tasks around, but those are all being deprecated.

                                          GitHub actions are very nice for us as it is much easier to automate portions of our workflow than it was with “external” CI services.

                                          CirrusCI is awesome because… we can test Linux, Windows, macOS, and FreeBSD. But most importantly, we have CI jobs that build LLVM from scratch and CirrusCI allows use to get 8 CPU machines for the job. The only CI services we tried that have been able to handle the jobs that build LLVM are CircleCI and CirrusCI.

                                          1. 1

                                            I switched a project from a self-hosted Buildbot to CirrusCI and I’ve also found it to be great. It’s so useful that they can spin up builders on demand and do so on your own GCP account.

                                          1. 1

                                            This was a great read, thanks!

                                            1. 1

                                              That’s neat. I hadn’t heard of Deployment Manager before. Glad to know there’s an alternative to Terraform, even if it’s GCP-specific.

                                              1. 2
                                                # modern configuration
                                                ssl_protocols TLSv1.3;

                                                Am I the only one that thinks that these tools are really toxic? Folks will just copy-paste all of these things without realising that they’re precluding their users from being able to access the sites. There’s a good reason most real companies (Google included) are still happy to serve you over TLSv1.0. Mozilla markets such configuration as “Old”, with a note that it “should be used only as a last resort”. I guess Google is using a last resort. ¯\_(ツ)_/¯

                                                1. 6

                                                  But it defaults to “Intermediate” and there are short explanations of each on the radio box. “Modern” does say “[…] and don’t need backward compatibility”.

                                                  1. 3

                                                    Which up-to-date browsers do not support TLS v1.3? Sure, you could run IE7 or FF 3.0, etc, but I’d want to do everything in my power to discourage folks who are running outdated browsers from using them to browse the web, including denying them access to any website(s) I am running.

                                                    Google has different motives: show ads to and collect info from everyone.

                                                    1. 3

                                                      It seems to be a common misconception that the internet’s sole reason of existence is now to deliver content to Firefox and Chrome. While this is perhaps true for some people - and may be true for you - it’s certainly not a base assumption you should operate on. There are still TLS libraries out there who don’t support TLSv1.3 (such as libressl) and thus there are tools which can’t yet use TLSv1.3. There is - as far as I’m aware - little need from a security POV to prefer TLSv1.3 over v1.2 if the server provides a secure configuration. If you want to discourage people from using old browsers, display some dialogue box on your website based on their user agent string or whatever.

                                                      Removing support for TLS versions prior to 1.2 is most certainly a good idea, but removing support for TLSv1.2 is just jumping the gun, especially if you look at the postfix configuration. If you want to enforce TLSv1.3 for your users, fine. But to enforce it when other mailservers try to deliver email is just asking for them to fall back to unencrypted traffic, effectively making the situation even worse.

                                                      On a completely unrelated note: It’s funny that server side cipher ordering is now seemingly discouraged in intermediate/modern configurations. I guess that’s probably because ever cipher supported is deemed “sufficiently secure”, but it’s still a funny detail considering all the tools that will berate you for not forcing server cipher order.

                                                      1. 1

                                                        Thanks for the reminder that some libraries (e.g. libressl) still do not support TLS v1.3. Since practically every browser I use (which extends beyond the chrome/FF combo) supports it, I hadn’t considered libraries like that.

                                                    2. 1

                                                      I was also surprised when I noticed this. I’d used this site before, but then “Modern” meant only supporting TLS 1.2+, which I think is suiting.

                                                    1. 3

                                                      I’m thinking of redirecting https://cipherli.st/ to the Mozilla generator. Did it once before to the wiki, but that was disliked.

                                                      1. 3

                                                        Ah, I love cipherli.st! Thanks so much for providing it, it’s been a handy reference on several occasions.

                                                        The Mozilla generator is good but cipherli.st is more comprehensive.

                                                        1. 2

                                                          Quick question for an upcoming project: is there a rather canonical, maintained list of cypher considered “state of the art” that doesn’t come in the form of a webserver config around somewhere?

                                                          1. 2

                                                            No! cipherli.st is more comprehensive than Mozilla’s ssl-config, if includes more services (e.g. dovecot, etc)

                                                            1. 2

                                                              Disliked by whom? Want me to put you in touch with the author of the config generator?

                                                              1. 1

                                                                I like cipherli.st! I’d be sad if it just would be a redirect to the Mozilla generator.

                                                              1. 2

                                                                Preparing to move house!

                                                                Going through all the junk I’ve accumulated over the years is… not the most fun way to spend a weekend.

                                                                On a related note, if anyone reading is based in London, UK and knows of a good electrical equipment recycling company please do let me know!

                                                                1. 4

                                                                  Useless fun fact: we piped a line to tr in an example above to apply a ROT-13 cypher, but Vim has that functionality built in with the the g? command. Apply it to a motion, like g?$.

                                                                  This is amazing. Stupid, but amazing.

                                                                  1. 1

                                                                    Given the one-time dominance of ROT13 as a means of, for example, hiding spoilers in Usenet postings about works of fiction, it’s always seemed like a natural enough thing to me. (Of course you can filter through /usr/games/rot13 if available, but then not everyone is going to have that ready to hand.) At any rate, it saves me effort at least a few times a year…

                                                                  1. 6

                                                                    In the initial moments of the outage there was speculation it was an attack of some type we’d never seen before.

                                                                    I am as guilty as anyone, but it’s really interesting that we create these far-fetched explanations when something goes wrong instead of assuming “the last deploy broke something,” which statistically is far more likely.

                                                                    1. 7

                                                                      Well, I don’t know. Cloudflare is seeing so many attacks, of considerable scale, everyday, that it’s probably more common for them to get monitoring going crazy because of attacks than someone pushing code.

                                                                      Although, I do agree that not considering this option right away means that they’re biased in some way towards blaming external actors.

                                                                      1. 4

                                                                        This is a great point! Their statistics for incident causes are likely not the same as most companies, but I’d still expect to see “bad deploys”/human error heavily represented.

                                                                        I guess the other important point is that they are probably used to bad deploys being caught earlier with gradual rollouts. The size of the impact for this incident is atypical for a bad deploy…

                                                                        1. 10

                                                                          I’m an SRE at Cloudflare. Most deploys to our edge are handled by us so we generally know what’s been deployed and when. As the post mentions we use the Quicksilver key-value store to replicate configs near-instantly, globally, and these config changes are either user data or changes made by us as part of a release. The WAF is unique here in that it’s effectively a code deploy performed through the key-value store, and is performed by the WAF engineers directly, via CI/CD, not us.

                                                                          So yeah, when this outage started we weren’t immediately aware that a WAF release had recently gone out, but generally we wouldn’t want to know - they do it frequently so generally it’d be noise to us if we had notifications for it. This is one of the things that lead to a few minutes’ delay in identifying the WAF as the cause of the issues, but we had excellent people on-hand who used tools like flamegraph generators to identify the WAF in only a few minutes.

                                                                          It’ll be interesting to see how we change the deployments.

                                                                    1. 3
                                                                      • et-see or ee-tee-see
                                                                      • lib (liberty)
                                                                      • char (broiled)
                                                                      • eff-ess-see-kay or “fussuk”
                                                                      • skeema (and I don’t recall pluralising it… but maybe skeemas?)
                                                                      1. 5

                                                                        One thing the author missed with 1Password is if you use 1Password for Families or 1Password for teams, members of the family or team can be granted a managerial status which lets you help recover other users in the account. This only applies if you use the hosted version, though, not the standalone one.