1. 9

    Writing tests can prevent problems before they happen.

    1. 4

      I enjoyed reading this - though I read it because your one sentence summary gave me the “but waaait cleanroom” reaction.

      Another summary is “reflecting on your code, mechanically or not, improves quality”.

    1. 5

      I swear by org-mode. https://orgmode.org

      1. 2

        What do you do when you’re not at a computer?

        1. 2

          I’m new to the emacs crowd, but just today I’ve installed Orgzly on my Android phone, syncing is a little bit odd though. I’m not sure though if I prefer the built-in calendar/reminders system or rather go with some emacs <=> CalDav integration (if such a thing exists).

          1. 1

            That post seems a bit like an overkill to me. I personally prefer to use the built-in sync with Dropbox (disclaimer: only built in in the Google Play version, not the F-Droid one), but people that keep it clean from closed code recommend Syncthing to do it

            1. 1

              You can call it overkill, but right now it’s the only way of syncing with this tool - I don’t have Play Store and I also don’t have Dropbox. I think Dropbox is acting in bad faith.

              1. 5

                Have you considered using Syncthing? It’s a peer to peer file synchronization utility that doesn’t rely on Google, and doesn’t store your data anywhere but your devices.

                1. 1

                  Syncthing is mentioned in the thread I’ve linked to in my initial comment. I’ll still give it a try, since I haven’t considered it at all. Note: I haven’t used Syncthing in the past two years, maybe it has improved.

                  1. 1

                    Syncthing is pretty terrible on Android, regularly was out of sync, and took my battery from ~28 hours to ~4. Wondering if there are specific setups that use less cpu for syncthing.

                    1. 1

                      I must have randomly stumbled into a working configuration, since my Keepass database stays pretty well-synced and my phone will usually last a day without needing charging. Sorry it doesn’t work for you, though.

                  2. 1

                    I keep my org-mode files in my Nextcloud instance, and in the Android app mark all the files to be kept in sync. Orgzly auto-syncs them now, no need for Tasker or anything.

          1. 5

            Notebook folks want many things:

            1. A way to share table/csv data along with a program that runs, excluding data that they can’t share.
            2. A way to view their data & visual analysis through a browser that can execute.
            3. A way to share code, with documentation built in.

            This is the push & pull of executing/running a program vs. publishing a program. Sounds like literate programming problems, and can be addressed with better literate programming tooling for python folks. There are many better formats, Org-mode even lets you share data between languages :)

            But there is hope! Notebooks are the single largest group pushing for better literate programming environments. While the current iteration is rather poop, looks like all of these issues are pretty fixable, even using a notebook-style UI.

            1. 4

              Full Catastrophe Living - take care of yourselves out there folks.

              1. 2

                raco exe + chez will be wonderful!

                1. 1

                  what’s the advantage in having that?

                  1. 1

                    Chez has a better compiler than Racket. It’s faster, and has a simpler architecture (from what I can tell). Some more details here: https://blog.racket-lang.org/2018/01/racket-on-chez-status.html

                    1. 1

                      Have you tried compiling chez in a repeatable manner, that is delivered as an executable? I got stuck for quite a while dealing with all their boot files - raco exe working would mean I don’t need to do that.

                  1. 20

                    Kinesis Advantage. I’ve been using them for almost twenty years, and other than some basic remapping, I don’t customize.

                    1. 2

                      Ditto, I’m at a solid decade. I cannot recommend them enough.

                      1. 2

                        Also Kinesis Advantage for over a decade. On the hardware side I’ve only mapped ESC to where Caps Lock would be. On the OS side I’ve got a customized version of US Dvorak with scandinavian alphabet.

                        I’d like to try a maltron 3d keyboard with integrated trackball mouse. It’s got better function keys too, and a numpad in the middle where there’s nothing except leds on the kinesis.

                        1. 2

                          Me too. I remap a few keys like the largely useless caps-lock and otherwise I don’t program it at all. It made my wrist pain disappear within a couple weeks of usage though.

                          1. 2

                            My only “problem” with the Kinesis, and it’s not even my problem, was that the office complained about the volume of the kicks while I was on a call taking notes.

                            So I switch between the Kinesis and a Apple or Logitech BT keyboard for those occasions.

                            1. 1

                              You can turn the clicks off! I think the combo is Prgm-\

                              1. 2

                                Yeah, its not that click, it’ the other one from the switches :-)

                                I can be a heavy typer and for whatever reason, these keys stand out more than I expected to others behind the microphone.

                            2. 2

                              I prefer the kinesis freestyle2. I like the ability to move the two halves farther apart (broad shoulders) and the tilt has done wonders for my RSI issues.

                              1. 2

                                similar, largely I like that I can put the magic trackpad in between the two halves and have something that feels comparable to using the laptop keyboard. I got rid of my mouse years ago but I’m fairly biased on a trackpad’s potential.

                                I’ve sometimes thought about buying a microsoft folding keyboard and cutting/rewiring it to serve as a portable setup. Have also thought of making a modified version of the nyquist keyboard to be a bit less ‘minimal’ - https://twitter.com/vivekgani/status/939823701804982273

                            1. 1

                              The license of this article doesn’t allow derivative works.

                              It seems in conflict with arguing for right to repair if you don’t want the right to remix..

                              1. 7

                                The article is was freely given. I have no doubt you can offer the author something for a license to remix their content. The bike was already paid for.

                                1. 7

                                  Could you tell me why one would one to remix an article? The current license lets you quote the whole thing or sections of it for use in your own work. I’m not sure what extra would be gained by derivatives but I am happy to change it if there is reason.

                                1. 11

                                  Nice! For the uninitiated, what’s the difference between this project and corrode?

                                  1. 14

                                    Corrode is completely implemented in Haskell and handles a smaller set of code. This project uses Clang to handle parsing, pre-processing, and type-checking the code. It also takes advantage of Clang’s libtooling library so that we can understand all of the clang command-line flags when processing a C file. This allows us to handle a lot more code.

                                    1. 3

                                      I ran across your tweet but would have mentioned you in the post if I’d known you were a lobster! (I probably should have checked the user list first)

                                      1. 5

                                        No worries, I joined after you posted this :-)

                                        1. 1

                                          glad to see you here, glguy

                                        2. 3

                                          glguy came in to IRC asking for an invite after seeing your post.

                                        3. 2

                                          Makes sense, thanks for the reply! (In fact, I believe this is the same method that rust-bindgen uses, and, as far as I know, rust-bindgen works pretty well, so it’s a good idea…)

                                      1. 4

                                        These vary pretty heavily in quality. Many seem to be missing proper quoting. Use with caution.

                                        1. 4

                                          Use bash with caution.

                                          1. 1

                                            Yeah, but its the same as any script you find online, don’t run it if you don’t understand it. The benefit here is that some of the better one are explained or corrected by other users.

                                          1. 8

                                            I had been vaguely aware of Copperhead OS but never looked into it or used it (I used Cyanogenmod before they imploded, and Lineage OS thereafter). I don’t know anything about the context for this other than the reddit and hacker news links here. Everything I’ve seen so far makes me feel inclined to be sympathetic to this Daniel Micay fellow, so I can’t help but wonder if there’s any information from his former business partner’s side of the story that would make me feel less sympathetic.

                                            1. 12

                                              He’s a fellow Arch Linux Trusted User. He seemed like a pretty ok dude in my interactions.

                                              1. 8

                                                I also chill in a few old irc channels with strncat post my major arch days, he has a lot of people in the open source community that respect his contributions. My bet is he’ll come out ahead of this if he can get untangled from the copperheados company.

                                              2. 16

                                                Daniel Micay was a prolific Rust contributor. (In fact, he is still in the top 20 even if he has been inactive since 2015.) In his Rust work, I found him to be a straight person.

                                                1. 2

                                                  I have a good impression of Daniel Micay after talking with him om IRC. He’s also an unusually knowledgeable programmer.

                                                1. 2

                                                  Getting XOAUTH2 to work with isync / mbsync.

                                                  If not, I’ll be writing yet another program to scrape mail out of Google’s email walled garden.

                                                  1. 7

                                                    Folks - take notes!

                                                    I am shocked at the number of developers/engineers I work with that are debugging an extremely complex problem, and force themselves to keep so much state in their head. If you can write out the debugging steps more like a journal / record of every action you took, it’s much easier to reinflate your subconscious state.

                                                    Make it refined enough someone else could reasonably follow along, and you’ll be able to as well. Lots of coworkers in other functions take detailed daily notes as a habit to show their progress to management, software gets lucky as there is an “output” on a small granularity of work.

                                                    As I get more and more reprioritizations & interruptions in my work, I’ve found it’s helpful to have confidence that all but maybe the last 30 mins of work are recorded in a decent fashion (org-mode!).

                                                    1. 2

                                                      I took notes on running a specific regression test at work. It’s something like 50 steps just to set it up [1]. And even then, others that have tried running it have had to fill in information I’ve neglected. It is hard to know at times what should be written down and what doesn’t have to be written down. And that changes over time, unfortunately.

                                                      [1] Why not automate it? Not that easy when you have to set up the programs and associated data across four machines. And then when it’s automated, it’s an even harder issue to debug if the automation breaks down [2].

                                                      [2] About half the time the test fails anyway because the clocks are out of sync. I had to code a specific test for that in the regression test, and yes, for some reason, ntpd isn’t working right. The other times the test fails is because the Protocol Stack From Hell [3] fell over because someone looked at it funny.

                                                      [3] Six figures to license a proprietary SS7 stack that isn’t worth the magnetic flux used to store it. This is the “best of breed” SS7 stack, sadly.

                                                    1. 4

                                                      I hacked up a small tool the other day that would buffer output from a command into memory until it receives a signal to reconnect to stdout, when it would dump everything that was output in the interim. I want to integrate this into dtach so emacs can have resumable shell sessions on remote hosts for TRAMP workflows.

                                                      Let’s just say it’s a huge distraction from the work I actually need to do and I hope I don’t make too much progress on it.

                                                      1. 2

                                                        The link to your tool is currently 404ed

                                                        1. 1

                                                          Oops, had no http on it: https://github.com/codemac/sigbuffer

                                                          It’s a dumb tool, but it was just a proof of concept that I knew how to use dup2+pipe again.

                                                      1. 3

                                                        I always love thesis dedications. It reminds me how much human life goes into each of these papers I tuck into ~/docs/pdf.

                                                        I dedicate this thesis to you, NH. Your continuous support and love throughout the writing of this thesis and also within my own life helped me in more ways than you probably realize. In the vastness of space and immensity of time, it is my joy to spend a planet and an epoch with you.

                                                        1. 2

                                                          I remember reading one dedication where it was obvious that the author was not pleased with the support of his advisors or something. Basically they said “my parents are awesome, my wife’s wonderful. My advisors were Bob, Sue, and Joe.”

                                                          1. 2

                                                            And don’t forget Olin Shivers’ acknowledgements section for Scsh (1994).

                                                            Who should I thank? My so-called “colleagues,” who laugh at me behind my back, all the while becoming famous on my work? My worthless graduate students, whose computer skills appear to be limited to downloading bitmaps off of netnews? My parents, who are still waiting for me to quit “fooling around with computers,” go to med school, and become a radiologist? My department chairman, a manager who gives one new insight into and sympathy for disgruntled postal workers?

                                                            My God, no one could blame me – no one! – if I went off the edge and just lost it completely one day. I couldn’t get through the day as it is without the Prozac and Jack Daniels I keep on the shelf, behind my Tops-20 JSYS manuals. I start getting the shakes real bad around 10am, right before my advisor meetings. A 10 oz. Jack ‘n Zac helps me get through the meetings without one of my students winding up with his severed head in a bowling-ball bag. They look at me funny; they think I twitch a lot. I’m not twitching. I’m controlling my impulse to snag my 9mm Sig-Sauer out from my day-pack and make a few strong points about the quality of undergraduate education in Amerika.

                                                            If I thought anyone cared, if I thought anyone would even be reading this, I’d probably make an effort to keep up appearances until the last possible moment. But no one does, and no one will. So I can pretty much say exactly what I think.

                                                            Oh, yes, the acknowledgements. I think not. I did it. I did it all, by myself.

                                                        1. 2

                                                          Thanks for posting this paper - really excited for what these types of tools could do for testing and verification.

                                                          nickpsecurity: do you regularly review publications? if so - which? As a storage nerd I read a smaller subset than I shuold.

                                                          1. 1

                                                            I just run through dozens of them at a time with my Google-fu (now DuckDuckGo-fu) to find the most interesting or practical along many areas of application. I submit some of those regularly to places where people enjoy or can use them. I also keep an eye out of folks building usable tools that might benefit from seeing specific papers. I try to get it to them.

                                                            I also constantly look for connections between it all for new methods of doing things. Ive spotted some decent ones recently that could simultaneously boost productivity and code confidence by about an equal amount. Usually inverses of each other. So, if not that valuable now, I hope to make something useful later that builds on stuff you see me submit here.

                                                          1. 5

                                                            This a fascinating case. It’s very unfortunate that the cyclist had to die for it to come before us. However, had the car been driven by a human, nobody would be talking about it!

                                                            That said, the law does not currently hold autonomous vehicles to a higher standard than human drivers, even though it probably could do so given the much greater perceptiveness of LIDAR. But is there any precedent for doing something like this (having a higher bar for autonomous technology than humans)?

                                                            1. 13

                                                              Autonomous technology is not an entity in law, and if we are lucky, it never will be. Legal entities designed or licensed the technology, and those are the ones the law finds responsible. This is similar to the argument that some tech companies have made that “it’s not us, it’s the algorithm.” The law does not care. It will find a responsible legal entity.

                                                              This is a particularly tough thing for many of us in tech to understand.

                                                              1. 25

                                                                It’s hard for me to understand why people in tech find it so hard to understand. Someone wrote the algorithm. Even in ML systems where we have no real way of explaining its decision process, someone designed it the system, someone implemented it, and someone made the decision to deploy it in a given circumstance.

                                                                1. 11

                                                                  Not only that, but one other huge aspect of things nobody is probably thinking about. This incident is going to probably start the ball rolling on certification and liability for software.

                                                                  Move fast and break things is probably not going to fly in the faces of too many deaths to autonomous cars. Even if they’re safer than humans, there is going to be repercussions.

                                                                  1. 8

                                                                    Even if they’re safer than humans, there is going to be repercussions.

                                                                    Even if they are safer than humans, a human must be held accountable of the deaths they will cause.

                                                                    1. 2

                                                                      Indeed, and I believe those humans will be the programmers.

                                                                      1. 4

                                                                        Well… it depends.

                                                                        When a bridge breaks down and kills people due to bad construction practices, do you put in jail the bricklayers?

                                                                        And what about a free software that you get from me “without warranty”?

                                                                        1. 4

                                                                          No - but they do take the company that build the bridge to court.

                                                                          1. 5

                                                                            Indeed. The same would work for software.

                                                                            At the end of the day, who is accountable for the company’s products is accountable for the deaths that such products cause.

                                                                          2. 2

                                                                            Somewhat relevant article that raised an interesting point RE:VW cheating emissions tests. I think we should ask ourselves if there is a meaningful difference between these two cases that would require us to shift responsibility.

                                                                            1. 2

                                                                              Very interesting read.

                                                                              I agree that the AI experts’ troupe share a moral responsibility about this death, just like the developers at Volkswagen of America shared a moral responsibility about the fraud.

                                                                              But, at the end of the day, software developers and statisticians were working for a company that is accountable for the whole artifact they sell. So the legal accountability must be assigned at the company’s board of directors/CEO/stock holders… whoever is accountable for the activities of the company.

                                                                            2. 2

                                                                              What I’m saying is this is a case where those “without warranty” provisions may be deemed invalid due to situations like this.

                                                                            3. 1

                                                                              I don’t think it’ll ever be the programmers. It would be negligence either on the part of QA or management. Programmers just satisfy specs and pass QA standards.

                                                                        2. 2

                                                                          It’s hard to take reponsability for something evolving in a such dynamic environment, with potentially used for billions of hours everyday, for the next X years. I mean, knowing that, you would expect to have a 99,99% of cases tested, but here it’s impossible.

                                                                          1. 1

                                                                            It’s expensive, not impossible.

                                                                            It’s a business cost and an entrepreneurial risk.

                                                                            If you can take the risks an pay the costs, that business it not for you.

                                                                      2. 4

                                                                        It’s only a higher bar if you look at it from the perspective of “some entity replacing a human.” If you look at it from the perspective of a tool created by a company, the focus should be ok whether there was negligence in the implementation of the system.

                                                                        It might be acceptable and understandable for the average human to not be able to react that fast. It would not be acceptable and understandable for the engineers on a self-driving car project to write a system that can’t detect an unobstructed object straight ahead, for the management to sign off on testing, etc.

                                                                      1. 2

                                                                        A lot of good stuff, reading C++ errors is like learning an extra language on top of the C++ language, I can’t wait to try it out.

                                                                        How do I get at some private field?

                                                                        That one is interesting! I wonder how smart it is.

                                                                        For instance in the following example:

                                                                        class foo
                                                                        {
                                                                        public:
                                                                          std::pair<int, int> get_coordinates() const { return std::make_pair(m_x, m_y); }
                                                                        
                                                                        private:
                                                                          int m_x;
                                                                          int m_y;
                                                                        };
                                                                        
                                                                        
                                                                        void test(foo *ptr)
                                                                        {
                                                                          if (ptr->m_x >= 3)
                                                                            ;// etc
                                                                        }
                                                                        

                                                                        I wonder if the compiler would be able to figure out that m_x is accessible via ptr->get_coordinates().first ?

                                                                        1. 2

                                                                          Hah, you’re also cross-posting to HN as I am.

                                                                          1. 1

                                                                            :) yes, the author was able to reply on HN and even took time to open a suggestion on GCC’s bugzilla

                                                                          2. 1

                                                                            According to godbolt’s trunk gcc, it is not smart enough:

                                                                            <source>: In function 'void test(foo*)':
                                                                            <source>:20:12: error: 'int foo::m_x' is private within this context
                                                                               if (ptr->m_x >= 3)
                                                                                        ^~~
                                                                            <source>:13:7: note: declared private here
                                                                               int m_x;
                                                                                   ^~~
                                                                            Compiler returned: 1
                                                                            
                                                                          1. 3

                                                                            I’ve found when I’m initially developing something that it’s hard to find 3 use cases for everything, just very common things.

                                                                            This has lead me to do two things:

                                                                            • Find places in older / unrelated code where my new abstraction might be useful, because I like it so much. This has been a great motivation for me to refactor.

                                                                            • Throw out abstractions with confidence. If I cant find 3 use cases, then throw it away. The moment a third use case is added or discovered, it’ll be obvious it should be abstracted.

                                                                            1. -2

                                                                              Here’s my issue: if you’re asking me for an estimate, you’re communicating that what I’m doing isn’t that important. If it were important, you’d get out of my way. A deadline is a resource limit; you’re saying, “This isn’t important enough to merit X, but if you can do it with <X, I guess you can go ahead”. If you ask me for an estimate, I have to guess your X (and slip under it if I want to continue the project, or exceed it if I want to do something else). If that seems self-serving or even dishonest, well let’s be honest about the dishonesty, at least here: estimates are bullshit anyway, so why not use the nonsense game for personal edge? Of course, if you’re the one being asked for an estimate, the system and odds are against you in nearly all ways, and you probably made some career-planning mistakes if you’re my age and still have to give estimates, but never mind that for now….

                                                                              There are projects that are nice-to-have but not important and might be worth doing, but that aren’t worth very much and therefore should be given resource/time limits from on high. I just don’t want to work on those. If it’s not doing with an open deadline, then assign someone at a lower skill level, who can still learn something from low-grade work. This isn’t me being a prima donna; this is me being realistic and thinking about job security. If having it done cheaply is more important than having it done well, I can (and should be) replaced by someone else.

                                                                              1. 3

                                                                                Businesses regularly put resource limits on investments, I don’t see why software engineering salaries are exempt from this.

                                                                                1. 0

                                                                                  I don’t see why software engineering salaries are exempt from this.

                                                                                  It might have something to do with the fact that the top 5% of us, at least, are smart enough that we ought to be calling the shots, rather than being pawns on someone else’s board.

                                                                                  1. 2

                                                                                    Unless you are literally on the board of a privately held company, you are pawns on someone else’s board. This isn’t hopeless, it’s just being honest with where actual final financial votes are cast.

                                                                                    How “smart” you are doesn’t mean you deserve to call any shots, as much as anyone who owns the company doesn’t deserve to either. Building relationships, managing expectations, cost analysis and collecting requirements are all part of making engineering estimates, and they are tools for you to exert influence over someone who has ownership/authority.

                                                                                2. 2

                                                                                  What if all work is estimated? These inferences depend on selective estimation.

                                                                                  1. 1

                                                                                    I’ll disagree with you here a bit–I agree with your last paragraph’s approach, but I think you are leaving out a little bit.

                                                                                    It’s worth it to send overqualified engineers into certain projects exactly because they are more likely to know how to fix problems preemptively and because they are more likely to have a narrower distribution on the time taken to achieve the task. If you want something with a known problemspace done correctly and to a guaranteed standard and in a timely fashion, you shouldn’t send people who are still learning.

                                                                                    “This isn’t important enough to merit X, but if you can do it with <X, I guess you can go ahead”.

                                                                                    Unfortunately, this is a lot of business, right? Like, scheduling and organizing coverage and resources for projects often means that, say, a full rewrite would take too many engineers off of customer-facing work, but incremental cleanups are possible.

                                                                                    From the employee side, it is arbitrary, but there is at least a chance of method to the madness.

                                                                                  1. 22

                                                                                    Working on polishing up my new backup tool. I set out to solve a set of problems:

                                                                                    • Client side encryption.
                                                                                    • Deduplication of similar backups to save space.
                                                                                    • Efficient server side pruning of backups.
                                                                                    • Write only mode so a compromised device does not compromise your backups.
                                                                                    • Work over ssh and use ssh keys for access control.
                                                                                    • Trivial user interface that integrates well with unix: accept directories, arbitrary streams (e.g. pipe in pgdump) or tar streams.

                                                                                    approaching something i’m happy with people reviewing and using, though there is lots of testing and stabilization that i want to do.

                                                                                    1. 11

                                                                                      This sounds like the tool I’ve been looking for, plus some features I didn’t know I wanted :-D

                                                                                      1. 3

                                                                                        Sound’s interesting! Did you, by chance, try borgbackup and could elaborate on the differences between borg and yours? I am not sure about your last point but at least the others all seem to be supported by it as far as I understand.

                                                                                        1. 4

                                                                                          I was unsatisfied with borg for a few reasons, which I will probably elaborate more on in a post somewhere, in general I am highlighting ease of use and I think I have a more user friendly design. I will see if anyone agrees with me once I get it out there.

                                                                                        2. 3

                                                                                          Write only mode

                                                                                          Yes!! Thank you!

                                                                                          I have been so jealous of Borg users for so long, but can’t switch because only Duplicity has this feature.

                                                                                          1. 1

                                                                                            Isn’t borg serve --append-only what we are talking about here?

                                                                                            1. 4

                                                                                              No. Borg only supports symmetric encryption, and closed the public key encryption issue as wontfix: https://github.com/borgbackup/borg/issues/672

                                                                                              By implementing public key encryption, you allow data sources to operate in what @ac calls “write only mode”, because if a compromised device only has your public key, it cannot compromise your backups (there is also the issue of data destruction by overwriting, but even raw S3 can be used as an append only store if you enable object versioning).

                                                                                              My use case is installing backup software liberally on every device I use (and I use more devices than I have sole control over). For example, with Borg, you could not back up your home directory on a shared server without giving the administrator of that system the ability to decrypt your entire repository.

                                                                                              1. 3

                                                                                                My implementation is currently not exactly as you described, but perhaps I can accommodate this with not too much difficulty. edit: I am sitting in a cafe thinking carefully about how to do it without affecting usability for less advanced users right now.

                                                                                                1. 2

                                                                                                  Good points, thanks for the explaination!.

                                                                                                  1. 2

                                                                                                    If you trust the server to not leak data a next best approach is to have a symmetric key per device and then use ssh access controls to prevent access.

                                                                                                    1. 1

                                                                                                      If you trust the server just use TLS or SSH tunnels to encrypt in motion. If that’s really your threat model there is no need for additional complexity.

                                                                                                      1. 2

                                                                                                        For example, with Borg, you could not back up your home directory on a shared server without giving the administrator of that system the ability to decrypt your entire repository.

                                                                                                        You have to backup to a different machine with a different administrator, it is true the first admin can decrypt your data, but he cannot fetch it because the ssh key can be granted write only access, even with borg via append only. a random key that is encrypted with a public key then discarded by the client is probably better though, still thinking how to do it well.

                                                                                              2. 2

                                                                                                Looking forward to test this! Much struggling with actual backup solutions!

                                                                                                1. 1

                                                                                                  Awesome! Can’t wait to hear more about this.

                                                                                                  1. 1

                                                                                                    I’d really like to know about how you tackle the intersection of client-side encryption and de-duplication.

                                                                                                    1. 2

                                                                                                      Its relatively straight forward using a https://en.wikipedia.org/wiki/Rolling_hash function. The ‘shape’ or ‘fingerprint’ of data guides you in finding split points, and each split chunk is encrypted independently. There is potential that size of chunks may give some clues about potential contents, but there are a few mitigations you can do such as random padding, keeping your hash function secret and a few others.

                                                                                                      Another sticking point is allowing the server to do garbage collection of chunks that are no longer needed while at the same time not being able to read the user data. I came up with a solution I hope to get reviewed around layering trust.

                                                                                                      1. 1

                                                                                                        I know about splitting a file into chunks, but how do you derive a repeatable IV/key for a given chunk without leaking the contents of it, or opening yourself up to some form of chosen-plaintext attack?

                                                                                                        1. 3

                                                                                                          I use a random IV, and random encryption key, but the content address (i.e. dedup key) generated is repeatable by the client as HMAC(DATA, CLIENT_SECRET). AFAIK the attacker cannot recover the secret or decryption key even if he has a chosen plaintext, and has no way to derive the data without the secret. An attacker also cannot forge chunks because the HMAC address will be obviously wrong to a client.

                                                                                                          There is also a write cache that prevents the same data from being uploaded twice with the same content address but different IV. Though that is more a performance thing than security, I could be wrong. I hope people can shoot down any flaws in my design which is why I need to get it finalized a bit.

                                                                                                    2. 1

                                                                                                      That sounds like a really useful tool! I’d seen a reference to Convergent Encryption today/yesterday, which “deduplication of similar backups to save space” sounded like. https://en.wikipedia.org/wiki/Convergent_encryption sounds like there are fundamental security implications to using it btw; deduping sounds pretty orthogonal to the rest of what it does, and I’d be excited to see a Show and Tell post :)

                                                                                                      1. 3

                                                                                                        Yeah, I rejected that specific approach for the reasons described. My keys are random, but some of the ideas are similar from a distance.

                                                                                                      2. 1

                                                                                                        This sounds similar to something I’ve had a couple of stabs at (one such stab “currently”). What language are you writing in?

                                                                                                        My approach is built around two basic “processes”.

                                                                                                        A “collection” phase during which a series of scripts do source specific “dumps” (eg dump mysql, ldap, etc, identify user generated file system data, etc) into service backup directories

                                                                                                        A “store” process which compares a hash of each raw file (created by the collection phase) to an existing hash (if found). If no match is made, the hash is stored and the file is gpg encrypted using the specified public key. Once this process finishes, the hash files and gpg files are stored remotely via rsync, using the --link-dest option to create “time machine style” snapshots.

                                                                                                        The heavy lifting is obviously all “written already” in terms of shashum, gpg and rsync. The glue layer is just shell.

                                                                                                        I’d be keen to see how our approaches differ and if we can take any ideas from each other’s solutions.

                                                                                                        1. 1

                                                                                                          Mine is written in go currently. A big difference is It doesn’t sound like your approach deduplicates within files or across similar but not identical files, a tool such as mine could easily be hooked into your store phase to deal with cross file deduplication.

                                                                                                          There are similar tools currently out there such as ‘bup’, ‘borg’ and ‘restic’ you should look into. I feel like mine is superior, but those all work and are ready today.

                                                                                                          1. 1

                                                                                                            No, it doesn’t attempt any kind of de-dupe except for not storing dupes of the same file if it hasn’t changed.

                                                                                                            That’s part of why I’m not using those other tools - I want pubkey encryption (as mentioned elsewhere here, it means eg two+ devices can share backup space without leaking data they don’t already possess to the other) and I’d prefer if, when all else fails, I/someone can restore data from a backup by just running regular shell commands.

                                                                                                            This part can of course be built into a companion tool, but being able to do ssh backup-space -- cat backups/20180210-2300/database/users.sql | gpg | mysql prod-restore-copy is a huge bonus to me. No need for the remote end to support anything beyond basic file access, no worrying about recombining files. No worrying about whether I have the same version of the backup tool installed, and/or if the format/layout has changed.

                                                                                                            So possibly we have not as many overlapping goals as I originally thought, but it’s always nice to hear about activity in the same space.

                                                                                                            1. 1

                                                                                                              Yeah, tbh I wouldn’t release a backup tool as without fully documenting the formats and having them re-implementable in a simple way, e.g. as a python script. You need a complexity cap to protect you from yourself. I agree using public/private key pairs is a good idea.

                                                                                                              Your system seems decent, though you don’t really have access controls protecting the machine from deleting its own backups (perhaps a worm that spreads via ssh). Do you deal with backup rotation?

                                                                                                              1. 2

                                                                                                                So, the original version of this was built to store on an internally controlled file server, and the “store” process finished by touching a trigger file, which (via inotify) caused a root-perms having daemon to run on the storage host and remove write accesss to the last backup from the ssh user rsync connected as

                                                                                                                The same daemon also handled pruning of old backups.

                                                                                                                The new version is designed to work with offsite storage like rsync.net/similar so for now it relies on remote end functionality to protect previous versions (eg zfs snapshots).