1. 2

    I think that what you’re referring to is named “Single-instance storage”, where the data is unique at the file level. Data deduplication is done at the block level, using chunks of data. This means that even if multiple files differ, but have a common portion, that portion will only be stored once.

    I was a bit confused when reading your article and was waiting for the actual “deduplication” to happen as I was reading it. The write-up is pretty cool and the tool as well though ! Nice article, besides the confusing title for me ^^

    1. 1

      I see how “deduplication” is confusing and isn’t quite the right word to use. I guess I was thinking about it in terms of the layman’s intuitive definition of deduplication rather than the technical term. Is “single-instance storage” correct? It seems that it means something similar to (the technical term) deduplication, retaining multiple ways to access the file while having a single instance on disk, so e.g. replacing all duplicates with hard links would qualify. I’m not sure what technical term describes what I was going for…

      1. 1

        To be fair, I don’t know ! I pointed out the meaning of deduplication because I worked on such tool a few month back, and was expecting a similar topic here.

        If I had to name what you did I’d say “remove duplicates” ? But you’d loose the pun with the periscope then 😉

    1. 5

      Hi Lobsters! Recently, I decided to take care of a task I had been procrastinating for a while: to organize and de-dupe data on our home file server. I was thinking of it as a mundane task that needed to get done at some point, but the problem turned out to be a bit more interesting than I initially thought.

      There are tons of programs out there designed to find dupes, but most just spit out a huge list of duplicates and don’t help with the work that comes after that. This was problematic (we had ~500k dupes), so I wrote a small program to help me. The approach, at a high level, is to provide duplicate-aware analogs of coreutils, so e.g. a psc ls highlights duplicates and a psc rm deletes files only if they have duplicates elsewhere.

      I thought it was a somewhat interesting problem and solution, so I wrote a little write-up of the experience. I’m curious to hear if any of you have faced similar problems, and how exactly you approached organizing/de-duping data.

      1. 1

        Very interesting and looks great. I have been wanting a system like this but for 3D layout of woodworking projects because I also dislike clicking around repetitively in visual software, and I don’t really enjoy SolidWorks or SketchUp; I want to just define my furniture in language. Nice to know that Z3 works well, when I started to think about this I got bogged down with trying to understand linear solvers but SMT is easier for me to grok…

        1. 2

          Thanks :)

          Totally agree with you on certain solvers being hard to work with – it always causes a bit of friction to have to express an optimization problem in the form that the solver expects. If you haven’t already, I’d highly recommend trying out Z3’s Python API. I’ve been super happy with it. Z3 doesn’t require you to specify which theories you are working with, or anything like that, and you don’t have to write down your equations in a certain way. You just write what look like regular equations in a fairly natural style, using their DSL, then run solver.solve(), and it just works!

        1. 7

          Hi Lobsters! I’ve been frustrated with standard GUI-based design tools, because they don’t match the way I think: like a programmer, in terms of relationships and abstractions. (Does this resonate with the community here?)

          So I’ve been hacking on this new DSL for design that allows the designer to specify figures in terms of relationships, which are compiled down to constraints and solved using an SMT solver. This also leads to good support for abstraction, because constraints compose nicely.

          The system is still a WIP, but it’s usable enough that I made all the figures and diagrams in my latest paper and presentation with it, so I thought I’d write up a blog post about it and share some results.

          This post is about how I think about design and how the programming language implements this philosophy. It also contains a number of case studies of designing real figures with the tool.

          There are also some (probably useless?) but really amusing results, like what happens if you solve figures via gradient descent (the result is oddly satisfying).

          I’d love to hear what you all think about the ideas in the post. Do you share any of my frustrations with existing design tools? How do you think when doing design – does your model match mine, or is it something different? Do you currently use any design tools that you’re really happy with, that I should know about?

          1. 2

            I was reading the post and thinking how awesome it would be to have this in racket, and then I saw at the end that you were doing exactly that :) looking forward to seeing it!

            1. 2

              Hmm. I might actually use Basalt for something, someday. Thanks for sharing it! I make a fair amount of diagrams in my work, but I’ve found it’s usually easier to just draw them freehand. There are definitely exceptions, though.

              Just skimming through your post, I’m a little surprised to see no mention of TeX, from which TikZ/PGF has sprung. There are quite a few TeX packages and macros which use the constraint-solving machinery of the system to good effect, e.g. asymptote and other more domain-specific packages. I can understand why TeX may not fit your use cases, but it might be worth looking through CTAN for ideas anyway. I think having interactive sliders and instant feedback is very helpful, since (in my experience) modeling the visual optimization problem in sufficient detail is often more work than it’s really worth. Even if you’re going for a fully automated solution eventually, having a ‘visual REPL’ is very helpful for development.

              As for iterative ‘force-directed’ (effectively gradient descent) graph layout, it seems to be a very common feature of web-based graph rendering libraries nowadays. GraphViz of course does constraint solving of some sort, but I’ve never looked into the details.

              1. 1

                I’ve used TikZ before, but not the other TeX-based drawing packages. Thanks for the recommendation, I’ll look into those!

              2. 1

                Have you looked at Apple’s auto layout system or the underlying Cassowary incremental solver? I’m not sure it is powerful enough for your application but it is fairly fast: http://www.badros.com/greg/cassowary/js/quaddemo.html

              1. 3

                I’d been using https://www.anishathalye.com/2015/02/07/an-asynchronous-shell-prompt/ on a large monorepo (~60,000 files) but last night I found that gitstatus (even in synchronous mode) is still faster.

                1. 3

                  Glad you found that post useful :)

                  For anyone else considering following the advice in my post – I think async prompts are super nice (and I’m still using basically that same code today), but I think the implementation in the submission above is cleaner than mine. It relies on zsh-async, which I think is less hacky than the file and signal-based approach I am using.

                1. 2

                  For anyone who wants to see a demo of a difference this makes in practice, here’s a GIF comparing a synchronous prompt with a async one when cding into the Linux kernel source (with an empty buffer cache).

                  1. 1

                    OpenAI’s blog post has a nice high-level summary of the research areas discussed in the paper.

                    1. 5

                      I’m Anish, and my blog is here. I mostly write about open-source projects, research (in systems, security, or deep learning), and hacks (in this sense of the word).

                      1. 2

                        @anishathalye did you run into Beamer by any chance? It was the hot thing a decade ago I think. There is an extension for it for posters. I see that you have :)

                        1. 3

                          Yeah, beamer / beamerposter is awesome! I just didn’t like the way the default themes / existing third-party themes looked, so I made my own.

                        1. 17

                          Hi Lobsters, author here.

                          I wanted to give a little bit of background on the motivation behind this post. For a while, I’ve been making academic posters using PowerPoint, Keynote, or Adobe Illustrator, and while it’s possible to get a high-quality result from these tools, I’ve always been frustrated by the amount of manual effort required to do so: having to calculate positions of elements by hand, manually laying out content, manually propagating style changes over the iterative process of poster design…

                          For writing papers (and even homework assignments), I had switched to LaTeX a long time ago, but for posters, I will still using these frustrating GUI-based tools. The main reason was the lack of a modern-looking poster theme: there were existing LaTeX poster templates and themes out there, but most of them felt 20 years old.

                          A couple weeks ago, I had to design a number of posters for a conference, and I finally decided to take the leap and force myself to use LaTeX to build a poster. During the process, I ended up designing a poster theme that I liked, and I’ve open-sourced the resulting theme, hoping that it’ll help make LaTeX and beamerposter slightly more accessible to people who want a modern and stylish looking poster without spending a lot of time on reading the beamerposter manual and working on design and aesthetics.

                          1. 4

                            Yes, I use LaTeX or ConTeXt for most of my writings, apart from notes in plain text.

                            No, I just don’t think TeX is a great way for posters. Probably because I am a control freak in making posters, I really want my prominent content/figures exactly where they are supposed to be and how large I want them to be on a poster. Sometimes I ferociously shorten my text to just be able to get the next section a little higher, so the section title does not fall off the main poster viewing area. So, yes, I still use pages.

                            I guess the difference is whether I am more focused on explaining things, which I use LaTeX, or I am more focused on laying out text blocks and figures, which GUI-based tools excel.

                            1. 2

                              I often want something in between. Like I want to click and draw arrows and figures but have that turned into LaTeX code so I can still style around that.

                          1. 4

                            Hi Lobsters! I’m one of the authors of this resource.

                            Adversarial machine learning is a relatively new but rapidly developing field. It’s easy to see why people are excited about this research area: ML systems are being increasingly deployed in the real world, and yet, they’re very easy to fool with maliciously perturbed inputs. There have been dozens of proposed attacks and hundreds of proposed defenses against malicious inputs to machine learning systems. To help researchers keep up with developments in this field, we created this community-run reference for state-of-the-art adversarial example defenses.

                            Unlike most subfields of ML, security is a negative goal: the goal is to produce a machine learning system that can’t be fooled. Showing that a system can’t be fooled is really hard.

                            Measuring progress in traditional machine learning can be done through a monotonically increasing objective: if a paper increases accuracy on a given benchmark from 94% to 95%, progress has been made. Future papers may improve on the benchmark, but accuracy will not decrease. In contrast, measuring progress in adversarial machine learning is exceptionally difficult. By definition, the metric used to measure accuracy on a given defense is success on the best attack (that respects the threat model), which may not exist at the time of publication. This is why future third-party analyses of defense techniques are important.

                            robust-ml.org lists current defenses along with analyses of the defenses, making it easy to get a complete picture of which techniques have been shown to be broken and which techniques currently seems to be working.

                            1. 1

                              Cool work! Thanks for posting it.

                              “Adversarial machine learning is a relatively new”

                              I haven’t gotten into this topic yet. The descriptions of what it’s about are pretty exciting given outsiders have worried about the security of ML approaches. Far as new, I wonder if you all would count work like Danny Hillis’ use of adversarial co-evolution? In his work on sorting algorithms, he kept changing the tests to be harder to break the algorithms. They were like parasites in the metaphors. The results over his prior method without co-evolution were pretty impressive.

                              Hillis’ stuff was always one of my favorite stories in that space. I guess I’m just curious if that kind of thing was any inspiration to your field, if you all classify it as a technique in your field, and/or if the field still uses methods like that?I’m also curious if there’s been any general-purpose methods so far in the new research that you think can get interesting results on cheap hardware. What should I tell people at smaller, local colleges to look into that they could do on their desktops or otherwise on a budget?

                              1. 2

                                Hillis’s work on adversarial co-evolution seems more similar to Generative Adversarial Networks than adversarial examples / robustness / machine learning security. Some subset of ML researchers group together GANs and adversarial examples under the label “Adversarial ML”, but many other researchers think of them as distinct research areas.

                                I’m not sure if Hillis’s work / similar efforts were an inspiration for GAN-based methods. I don’t think it was an inspiration for research related to ML security.

                                What’s neat about this research area, especially on the attack side, is that you don’t need that much compute. For example, all the work I’ve done on attacks can be done with a single high-end GPU, and reproducing some of the results on a slightly smaller scale can even be done on a laptop CPU (e.g. see this blog post).

                                1. 1

                                  That’s neat. Good that one can get results on a budget. I’ll keep the link saved for any students that might be interested.

                            1. 13

                              Hi Lobsters! This blog post is about Project Sistine, a hack that I worked on with @antimatter15, Guillermo Webster, and @loganengstrom. We turned a MacBook into a touchscreen using only $1 of hardware (a small mirror, some pieces of a rigid paper plate, and a door hinge).

                              We built this prototype some time ago, but we never wrote up the details of how we did it, and so we thought we should share. We’re happy to answer any questions you might have about the project!

                              1. 3

                                Great work. I love the idea. Does it work on the whole screen? Looking at the pictures in the article, it seems that the areas near the left and right edges are uncovered.

                                1. 3

                                  Thanks, glad you liked our hack :)

                                  Yeah, the current prototype doesn’t capture the whole screen (it probably captures ~1/2 to 1/3 of the screen area), due to the positioning of our flat mirror. We tried moving it farther away so it would capture more screen area, but with the low quality webcam we were using (480p), the resolution wasn’t good enough. A higher resolution webcam might be enough to solve this problem. Another solution might be using a curved mirror.

                                  1. 2

                                    Did you try with a convex mirror to capture a wider view? Will probably drive the dollar cost up - but would be interesting to see if you could find success with it.

                                    Great work, by the way!

                                    1. 4

                                      Thanks, glad you liked our work!

                                      Nope, we haven’t tried a convex mirror yet. I think convex mirror + 720p camera (standard on today’s MBPs) could make this system work a lot better. It would probably complicate the math a little bit, but it should be reasonably straightforward to handle as long as the optics of the mirror aren’t too weird.

                              1. 5

                                This is neat! Adversarial examples at the moment seem to be more of an arms race rather than anything deeply understood, so I’m not surprised that the new ways of making neural nets resistant to existing attacks would end up broken. I don’t think I expected it to be literally within days of the ICLR 2018 accepted paper list coming out though.

                                Do you have any sense of whether constructing more robust defenses is plausible? The only paper I’ve run across that I feel gave me any deeper theoretical insight into adversarial examples is another ICLR 2018 paper, “Adversarial Spheres” by Gilmer et al. Is there anything else out there?

                                1. 6

                                  As far as the “arms race” is concerned: I think it’s also a problem of many papers simply not considering adaptive attacks. In our paper, we noted that some papers (e.g. Xie et al. 2018) were already trivially circumvented using a technique that had been described half a year ago (the existence of robust adversarial examples trivially implies that most randomized input transformation-based defenses are probably broken). I would guess that there would be fewer defenses being published if they were thoroughly evaluated against adaptive attacks.

                                  I think the Towards Deep Learning Models Resistant to Adversarial Attacks paper is pretty good at developing some theory: it presents a view of adversarial examples through the lens of robust optimization. I think adversarial training is the only defense technique that’s shown a demonstrable robustness against white-box attacks in a reasonable threat model (Madry’s paper considers white-box access and a bound on the l-infinity perturbation that the attacker is allowed to make).

                                1. 10

                                  Hi Lobsters! I’m one of the researchers who worked on this. Both Nicholas Carlini (a co-author on this paper) and I have done a bunch of work on machine learning security (specifically, adversarial examples), and we’re happy to answer any questions here! Adversarial examples can be thought of as “fooling examples” for machine learning models. For example, for image classifiers, for a given image x classified correctly, an adversarial example is an image x* such that x* is visually similar to an image x, but x* is classified incorrectly.

                                  We evaluated the security of 8 defenses accepted at ICLR 2018 (one of the top machine learning conferences) and we find that 7 are broken. Our attacks succeeded when others failed because we show how to work around defenses that cause gradient-descent-based attack algorithms to fail.

                                  1. 3

                                    Is it possible to train the networks further on adversarial examples you yourself generate to defeat it? Or can you just keep applying the algorithm repeatedly.

                                    1. 9

                                      What you’re describing is called “adversarial training” – training a network on adversarial examples generated for that network, and repeatedly doing this process – and it’s a pretty good idea! It has shown to increase resistance to black box attacks (see https://arxiv.org/abs/1706.06083), but with current approaches, it doesn’t seem to help in the white-box case.

                                    1. 8

                                      Just as a heads-up: the ML tag refers to the ML programming language and its relatives like OCaml. In fact, I don’t think there is a tag in the tag list for data analysis/statistics/machine learning – these story usually get tagged under either AI or maths, as far as I can tell.

                                      [Edited to linkify the tags]

                                      1. 4

                                        Oops, thanks for pointing that out. Fixed!

                                      1. 1

                                        Please tell me there is some kind of transport encryption involved here

                                        1. 3

                                          Out of curiosity, when you evaluated Knossos performance, were you testing using histories with crashed clients, or happy-path histories where client operations succeed relatively quickly? Knossos makes some choices which optimize for the former, and I think the P-compositionality paper focuses on the latter, but it’s been a few years since I’ve been down in those papers guts. May need to revisit those assumptions if they were slower for your workload.

                                          1. 3

                                            Hi aphyr!

                                            To make the comparison more fair to Knossos, I tested histories where you can’t take advantage of P-compositionality. (in short, P-compositionality is when a history is linearizabile iff all subhistories in a partitioned history are linearizable - e.g. with a map, you can partition by keys and check the subhistories independently, and that’s a lot faster)

                                            I used test data from Jepsen etcd tests: https://github.com/anishathalye/porcupine/blob/master/test_data/jepsen

                                            Here’s the quick-and-dirty benchmarking code I used: https://gist.github.com/anishathalye/a315a31d57cad6013f57d2eb262443f5 (basically, just timing knossos.core/linearizable-prefix).

                                            Even where Knossos is slow, e.g. etcd_002.log and etcd_099.log (timed out after > 2 days). Porcupine seems to do fine, taking hundreds of milliseconds on a single core to check the histories.

                                            Out of the ~100 tests, filtering for ones that Knossos finished in < 1 hour, we have a speedup of 1000x on Knossos’s fastest test (etcd_016.log) and a speedup of 20,000x on Knossos’s slowest test (etcd_040.log). And for the ones that timed out (because I didn’t want to run the tests for way too long), e.g. (etcd_099.log), Porcupine had a speedup of > 10^6.

                                            I haven’t had time to look into Knossos’s implementation in detail and figure out exactly where Porcupine’s speedups are coming from, but that would be cool to do at some point. Jepsen/Knossos are obviously a way more complete solution for testing distributed systems, and it would be cool to speed up the linearizability checking aspect.

                                            1. 2

                                              Ohhhhhh! Yeah, you’re using the original algorithm–that’s definitely slower. Try (knossos.linear/analysis model history) instead–that’s based on the JIT-linearization algorithm from Lowe’s paper, plus some additional optimizations–instead of performing union-find for compaction, we pre-compile the state space into a graph of pointer arrays, which turns the search into immutable pointer-chasing instead of running model code. There are… certain cases where the knossos.core algorithm is preferable (it offers better parallelism) but linear should be significantly faster. Still not good though; I’d like to sit down and figure out some alternative strategies.

                                              And yeah, as you note, we don’t do P-compositionality in Knossos–that’s handled by Jepsen, which performs the decomposition into independent keys for maps, sets, etc, then hands individual histories to Knossos. Would be nice to fold into Knossos later though!

                                              Last thing, if you wind up packaging this for distribution, I’d like to offer a hook in Jepsen so we can pass histories to it. If you’d like to define some sort of serialization format (JSON, tsv, EDN, protobufs, etc) for passing histories in and getting analysis results back, I can wrap that up in a Jepsen checker as an alternative strategy. :)

                                              1. 2

                                                Oops, I didn’t know that. I redid the benchmarking with (knossos.linear/analysis model history), running Knossos on 6 cores and Porcupine on 1 core.

                                                The benchmark results did change: Knossos completed every check significantly faster. On some tests, the new algorithm performed significantly better: e.g. on etcd_040.log, Porcupine has a speedup of 278x, as opposed to a speedup of 20,000x when comparing against the original algorithm (knossos.core/linearizable-prefix).

                                                Porcupine still ran faster on all tests, though; the following is a summary of the results (over the ~100 Jepsen etcd tests):

                                                Min speedup:    8.1x     on etcd_002.log
                                                Max speedup:    21,219x  on etcd_067.log
                                                Median speedup: 1000x
                                                

                                                Ooh, that sounds cool! I’ll probably end up packaging this for distribution in a couple weeks, and I’ll definitely reach out to you once I have an API for communicating with Porcupine.

                                          1. 5

                                            It’s a nice piece. The only deficiency is toward the end on formal methods. The SPIN model checker has probably found more obscure protocol bugs, from system to hardware level, than about anything else. TLA+ has similarities to it with it finding all kinds of protocol bugs in projects that used it. They and/or papers/articles covering industrial use should probably be more prominent in references than the Coq work. The Coq stuff is relatively recent, hard to use, and barely used. Readers might be mislead as to what they should try. Answer is straight-forward: TLA+ due to easy tutorials + cost/benefits + tooling ecosystem; SPIN next due to proven use in industry w/ next-bast cost/benefit & tooling available; Coq-based stuff only if they know Coq, the protocol is relatively simple, and they have plenty of time.

                                            EDIT to add references:

                                            http://spinroot.com/spin/whatispin.html

                                            http://learntla.com/

                                            1. 3

                                              Thanks! Yeah, I agree, the post doesn’t really explain the model checking stuff. I was focusing more on checking implementations for correctness (there’s the formality gap when model checking, because people don’t actually test the implementation – e.g. Paxos has been verified in a model checker, but there are lots of buggy implementations out there).

                                              Thanks for pointing out the SPIN model checker, I hadn’t read much about it before. And yeah, I totally agree, people should try model checking first. Verifying realistic implementations using proof assistants is impractical right now (the Verdi guys spent several person-years on Verdi and verifying Raft, and that was a really impressive engineering effort).

                                              Another thing people might find interesting is how Amazon has made use of lightweight model checking in verifying their systems: http://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf

                                              1. 2

                                                Oh, yeah, Amazon is one of the main motivators I was thinking about. Their report generated a surge of new interest in TLA+. I also like the example they gave where the model-checker easily caught an error that would’ve taken 30+ steps in testing to reproduce. That really drives the value home.

                                                Far as your concern about model-checking vs other methods, I submitted a link here before that confirms that which you might find interesting. It also is sort of a guide on what to look for with non-formal methods since those areas are known spots for trouble.

                                                https://locore.cs.washington.edu/papers/fonseca-dsbugs.pdf

                                                1. 2

                                                  Oh neat, looks like an interesting read. Thanks!