Cleanlab (github.com/cleanlab/cleanlab) is a family of algorithms for automatically finding issues in datasets. It might seem surprising that it’s possible to automatically identify label errors and out-of-distribution data; Cleanlab does this using the algorithms published in arxiv.org/abs/1911.00068.
Cleanlab’s algorithms, while clever, are actually relatively simple. To help ourselves (and others!) build intuition for how they work, we built Vizzy, an interactive demo that runs in the browser. Vizzy lets you experiment with an example dataset, tweak the labels, and run Cleanlab to automatically find issues like label errors and out-of-distribution data
I’m happy to answer any questions related to Vizzy, cleanlab, or confident learning and data-centric AI in general!
Hi all, I’m one of the authors of the blog post / open-source Python package. cleanlab started out as a grad student research project. As we saw data scientists finding the tool useful for real-world applications, and as we did more research that applied the tool to find issues in academic datasets at scale (Lobsters submission), we realized that this was an important real-world problem and decided to spend more time and energy building a framework for solving data-quality challenges.
We’d love to hear any ideas or feedback from the Lobsters community, especially from those doing data science or machine learning. We (me, @curtisnorthcutt, and @jonasm), who all have a background in ML research, would also be happy to answer any questions you have related to cleanlab or data-centric AI.
I might be able to Google a bit to better ask this, so forgive me if I’ve been remiss. I have noticed what seems like a rise in “distributed ML” frameworks and libraries. Obviously there’s whatever comes with Apache Spark but I know there are some other projects out there (Dask) that people seem to talk about.
Distributed ML is an active area of work, in both academia and industry, and it has been for some time now. Companies like Google were doing distributed machine learning decades ago. For some use cases, libraries like scikit-learn are totally adequate, while for other use cases, e.g. when using sophisticated models that require a lot of compute to train, training over large datasets that don’t fit on a single node, distributed computing is essential.
On the topic of data storage: in some cases, system builders do co-design the data storage and data processing, e.g. data processing using Hadoop over data stored in HDFS. Such co-design can give performance gains.
Relatedly, there is a growing movement that says that there may be better solutions than throwing more data and compute at the problem, and that “data-centric” approaches to ML (like cleanlab) might actually reduce the need for some of these complex distributed systems solutions to scaling ML in many situations: better data over big data.
How do you plan to generate the bitmap index and commit-graph (as well as running repack/gc) on remote side?
git-remote-dropbox’s current on-disk representation is really simple: it stores all objects as loose objects, no packfiles. Supporting packfiles and repack, or even supporting gc with the current representation is a little tricky because unlike a “real” Git remote, we can’t run any code server-side, so all computation has to happen client-side, and all coordination has to be done through Dropbox’s API. Even supporting gc with the current format is a little tricky because we want proper support for concurrency (just like a real Git remote), and dealing with e.g. a gc that’s concurrent with a push is a bit tricky. We could use repository-level locking, but because all the work has to be done client-side, we get into a tricky situation where if we use a standard lock and a client dies while holding the lock, the repository is “stuck”, and if we use a lock with a lease, it’s not safe in the presence of clock skew or network delays (though probably safe enough in practice).
I see. Without repack, gc and packfiles, the on-disk size and the data transfer cost is huge as the delta compression is not used.
So this is git-annex but on a proprietary service?
No. git-annex is for storing extra files in external service (IIRC Dropbox is supported backend as well). This project is more like GitHub on Dropbox.
Like @hauleth said, this is more like “GitHub on Dropbox” — it uses a Dropbox folder as a backing store for an entire git repo, not just the large files.
It looks like there are some tools for using Dropbox as a backend for git-annex: https://git-annex.branchable.com/tips/dropboxannex/
I just released v2.0 of this git remote helper that allows using Dropbox as a git remote with all the safety guarantees that Git provides (e.g. safe under concurrent updates by multiple users, such as multiple developers pushing different changes at the same time). See this post for some additional context and the design doc for details on how it works (the key is that it leverages an atomic compare-and-swap that the Dropbox API provides).
If you are using the rev parameter to implement “compare and swap” with the Dropbox API, you may want to consider using the strict_conflict parameter.
Thank you, I added this based on your recommendation!
The documentation isn’t precise about what exactly strict_conflict=True does, and the two examples it gives (a write/delete conflict, or a write/write conflict where the contents are the same) are not an issue for git-remote-dropbox, but still, it seems better to enable “strict conflict checking”, whatever that means.
In your case it may matter under this series of steps:
You’d want client A to fail until it gets the other changes made by Client B but without strict_conflict it could potentially pass.
This interests me. I keep a lot of code in bare repositories in syncthing. How integrated is this with the docker api, could it be abstracted for syncthing too? It seems to rely completely on the docker api.
(Hi Vaelatern, I recognize you from GitHub :) )
git-remote-dropbox is specialized to the Dropbox API. It would be neat to write a “meta git-remote-helper” or a library that makes it easy to write git remote helpers on top of a filesystem-like API that supports some kind of concurrency control mechanism (like Dropbox’s atomic CAS), so it could be easy to write adapters for Google Drive, Microsoft OneDrive, etc. I’m not familiar with Syncthing, so I’m not sure what its API looks like / if it has a sufficient set of operations to implement something like git-remote-dropbox on top.
Good to see you!
I think the challenge with syncthing is the way it is not a centralized service. You don’t get atomic-over-all-syncs, just your local copy. It might not be possible. But then again, it might not be needed, so long as I reconnect my laptop to the web after a long trip before committing from my desktop.
A bit of context: Microsoft developed PhotoDNA to identify illegal images like CSAM – NCMEC maintains a database of PhotoDNA signatures, and many companies use this service to identify and remove these images.
A PhotoDNA hash is not reversible, and therefore cannot be used to recreate an image.
A PhotoDNA hash is not reversible, and therefore cannot be used to recreate an image.
This project shows that this isn’t quite true: machine learning can do a pretty good job of reproducing a thumbnail-quality images from a PhotoDNA signature.
There has been some previous posts about PhotoDNA, reverse engineering the algorithm and claiming that it is reversible. But there was no public demonstration of this as far as I know.
That’s interesting. Does that imply Apple’s NeuralHash is also reversible to some extent?
I haven’t tried it, so I can’t say for sure. Perceptual hashes basically have to leak some information about their input (because they are set up so that d(x, x') small => d(h(x), h(x')) small). But NeuralHash uses a CNN and outputs a 96-bit hash, while PhotoDNA computes image statistics and outputs a 1152 bit hash, so I expect NeuralHash hashes would reveal less information and be harder to invert.
d(x, x') small => d(h(x), h(x')) small
MIT’s 6.824 Distributed Systems class has a nice collection of distributed systems papers. Paper selection varies a bit from year to year, so you can check out older years for even more papers.
Stanford’s CS208 Canon of Computer Science has a nice list of seminal papers in computer science. These are papers mostly on the older side, all pre-2000.
In general, I’ve found course websites to be a great place to find lists of papers. They’ve been selected with care, so it’s likely that it’s a collection of great papers. Choosing papers out of conference proceedings (if you’re interested in distributed systems, check out e.g. SOSP and OSDI) is a good way of getting a sense of current research directions.
Hi Lobsters! One of the authors here.
We found pervasive errors in the test sets of 10 of the most commonly used benchmark ML datasets, so we made labelerrors.com where anyone can examine the data labels. We think it’s neat to browse through the errors to get an intuitive sense of what kinds of things go wrong (e.g. completely mixed-up labels, like a frog being labeled “cat”, or situations where an image contains multiple things, like a bucket full of baseballs being labeled “bucket”), so that’s why we built this errors gallery. To our surprise, there are lots of errors, even in gold standard datasets like ImageNet and MNIST.
For those who want to dig into the details, we have a blog post here: https://l7.curtisnorthcutt.com/label-errors, where we talk more about the study and the implications.
Also of interest might be github.com/cgnorthcutt/cleanlab, the open-source software we used for initially identifying potentially mislabeled data.
Happy to answer any questions here!
I think that what you’re referring to is named “Single-instance storage”, where the data is unique at the file level. Data deduplication is done at the block level, using chunks of data. This means that even if multiple files differ, but have a common portion, that portion will only be stored once.
I was a bit confused when reading your article and was waiting for the actual “deduplication” to happen as I was reading it. The write-up is pretty cool and the tool as well though ! Nice article, besides the confusing title for me ^^
I see how “deduplication” is confusing and isn’t quite the right word to use. I guess I was thinking about it in terms of the layman’s intuitive definition of deduplication rather than the technical term. Is “single-instance storage” correct? It seems that it means something similar to (the technical term) deduplication, retaining multiple ways to access the file while having a single instance on disk, so e.g. replacing all duplicates with hard links would qualify. I’m not sure what technical term describes what I was going for…
To be fair, I don’t know ! I pointed out the meaning of deduplication because I worked on such tool a few month back, and was expecting a similar topic here.
If I had to name what you did I’d say “remove duplicates” ? But you’d loose the pun with the periscope then 😉
Hi Lobsters! Recently, I decided to take care of a task I had been procrastinating for a while: to organize and de-dupe data on our home file server. I was thinking of it as a mundane task that needed to get done at some point, but the problem turned out to be a bit more interesting than I initially thought.
There are tons of programs out there designed to find dupes, but most just spit out a huge list of duplicates and don’t help with the work that comes after that. This was problematic (we had ~500k dupes), so I wrote a small program to help me. The approach, at a high level, is to provide duplicate-aware analogs of coreutils, so e.g. a psc ls highlights duplicates and a psc rm deletes files only if they have duplicates elsewhere.
I thought it was a somewhat interesting problem and solution, so I wrote a little write-up of the experience. I’m curious to hear if any of you have faced similar problems, and how exactly you approached organizing/de-duping data.
Very interesting and looks great. I have been wanting a system like this but for 3D layout of woodworking projects because I also dislike clicking around repetitively in visual software, and I don’t really enjoy SolidWorks or SketchUp; I want to just define my furniture in language. Nice to know that Z3 works well, when I started to think about this I got bogged down with trying to understand linear solvers but SMT is easier for me to grok…
Totally agree with you on certain solvers being hard to work with – it always causes a bit of friction to have to express an optimization problem in the form that the solver expects. If you haven’t already, I’d highly recommend trying out Z3’s Python API. I’ve been super happy with it. Z3 doesn’t require you to specify which theories you are working with, or anything like that, and you don’t have to write down your equations in a certain way. You just write what look like regular equations in a fairly natural style, using their DSL, then run solver.solve(), and it just works!
Hi Lobsters! I’ve been frustrated with standard GUI-based design tools, because they don’t match the way I think: like a programmer, in terms of relationships and abstractions. (Does this resonate with the community here?)
So I’ve been hacking on this new DSL for design that allows the designer to specify figures in terms of relationships, which are compiled down to constraints and solved using an SMT solver. This also leads to good support for abstraction, because constraints compose nicely.
The system is still a WIP, but it’s usable enough that I made all the figures and diagrams in my latest paper and presentation with it, so I thought I’d write up a blog post about it and share some results.
This post is about how I think about design and how the programming language implements this philosophy. It also contains a number of case studies of designing real figures with the tool.
There are also some (probably useless?) but really amusing results, like what happens if you solve figures via gradient descent (the result is oddly satisfying).
I’d love to hear what you all think about the ideas in the post. Do you share any of my frustrations with existing design tools? How do you think when doing design – does your model match mine, or is it something different? Do you currently use any design tools that you’re really happy with, that I should know about?
I was reading the post and thinking how awesome it would be to have this in racket, and then I saw at the end that you were doing exactly that :) looking forward to seeing it!
Hmm. I might actually use Basalt for something, someday. Thanks for sharing it! I make a fair amount of diagrams in my work, but I’ve found it’s usually easier to just draw them freehand. There are definitely exceptions, though.
Just skimming through your post, I’m a little surprised to see no mention of TeX, from which TikZ/PGF has sprung. There are quite a few TeX packages and macros which use the constraint-solving machinery of the system to good effect, e.g. asymptote and other more domain-specific packages. I can understand why TeX may not fit your use cases, but it might be worth looking through CTAN for ideas anyway. I think having interactive sliders and instant feedback is very helpful, since (in my experience) modeling the visual optimization problem in sufficient detail is often more work than it’s really worth. Even if you’re going for a fully automated solution eventually, having a ‘visual REPL’ is very helpful for development.
As for iterative ‘force-directed’ (effectively gradient descent) graph layout, it seems to be a very common feature of web-based graph rendering libraries nowadays. GraphViz of course does constraint solving of some sort, but I’ve never looked into the details.
I’ve used TikZ before, but not the other TeX-based drawing packages. Thanks for the recommendation, I’ll look into those!
Have you looked at Apple’s auto layout system or the underlying Cassowary incremental solver? I’m not sure it is powerful enough for your application but it is fairly fast: http://www.badros.com/greg/cassowary/js/quaddemo.html
I’d been using https://www.anishathalye.com/2015/02/07/an-asynchronous-shell-prompt/ on a large monorepo (~60,000 files) but last night I found that gitstatus (even in synchronous mode) is still faster.
Glad you found that post useful :)
For anyone else considering following the advice in my post – I think async prompts are super nice (and I’m still using basically that same code today), but I think the implementation in the submission above is cleaner than mine. It relies on zsh-async, which I think is less hacky than the file and signal-based approach I am using.
For anyone who wants to see a demo of a difference this makes in practice, here’s a GIF comparing a synchronous prompt with a async one when cding into the Linux kernel source (with an empty buffer cache).
OpenAI’s blog post has a nice high-level summary of the research areas discussed in the paper.
I’m Anish, and my blog is here. I mostly write about open-source projects, research (in systems, security, or deep learning), and hacks (in this sense of the word).
@anishathalye did you run into Beamer by any chance? It was the hot thing a decade ago I think. There is an extension for it for posters. I see that you have :)
Yeah, beamer / beamerposter is awesome! I just didn’t like the way the default themes / existing third-party themes looked, so I made my own.
Hi Lobsters, author here.
I wanted to give a little bit of background on the motivation behind this post. For a while, I’ve been making academic posters using PowerPoint, Keynote, or Adobe Illustrator, and while it’s possible to get a high-quality result from these tools, I’ve always been frustrated by the amount of manual effort required to do so: having to calculate positions of elements by hand, manually laying out content, manually propagating style changes over the iterative process of poster design…
For writing papers (and even homework assignments), I had switched to LaTeX a long time ago, but for posters, I will still using these frustrating GUI-based tools. The main reason was the lack of a modern-looking poster theme: there were existing LaTeX poster templates and themes out there, but most of them felt 20 years old.
A couple weeks ago, I had to design a number of posters for a conference, and I finally decided to take the leap and force myself to use LaTeX to build a poster. During the process, I ended up designing a poster theme that I liked, and I’ve open-sourced the resulting theme, hoping that it’ll help make LaTeX and beamerposter slightly more accessible to people who want a modern and stylish looking poster without spending a lot of time on reading the beamerposter manual and working on design and aesthetics.
Yes, I use LaTeX or ConTeXt for most of my writings, apart from notes in plain text.
No, I just don’t think TeX is a great way for posters. Probably because I am a control freak in making posters, I really want my prominent content/figures exactly where they are supposed to be and how large I want them to be on a poster. Sometimes I ferociously shorten my text to just be able to get the next section a little higher, so the section title does not fall off the main poster viewing area. So, yes, I still use pages.
I guess the difference is whether I am more focused on explaining things, which I use LaTeX, or I am more focused on laying out text blocks and figures, which GUI-based tools excel.
I often want something in between. Like I want to click and draw arrows and figures but have that turned into LaTeX code so I can still style around that.
Hi Lobsters! I’m one of the authors of this resource.
Adversarial machine learning is a relatively new but rapidly developing field.
It’s easy to see why people are excited about this research area: ML systems
are being increasingly deployed in the real world, and yet, they’re very easy
to fool with maliciously perturbed inputs. There have been dozens of proposed
attacks and hundreds of proposed defenses against malicious inputs to machine
learning systems. To help researchers keep up with developments in this field,
we created this community-run reference for state-of-the-art adversarial
Unlike most subfields of ML, security is a negative goal: the goal is to
produce a machine learning system that can’t be fooled. Showing that a system
can’t be fooled is really hard.
Measuring progress in traditional machine learning can be done through a
monotonically increasing objective: if a paper increases accuracy on a given
benchmark from 94% to 95%, progress has been made. Future papers may improve on
the benchmark, but accuracy will not decrease. In contrast, measuring progress
in adversarial machine learning is exceptionally difficult. By definition, the
metric used to measure accuracy on a given defense is success on the best
attack (that respects the threat model), which may not exist at the time of
publication. This is why future third-party analyses of defense techniques are
robust-ml.org lists current defenses along with analyses of the defenses,
making it easy to get a complete picture of which techniques have been shown to
be broken and which techniques currently seems to be working.
Cool work! Thanks for posting it.
“Adversarial machine learning is a relatively new”
I haven’t gotten into this topic yet. The descriptions of what it’s about are pretty exciting given outsiders have worried about the security of ML approaches. Far as new, I wonder if you all would count work like Danny Hillis’ use of adversarial co-evolution? In his work on sorting algorithms, he kept changing the tests to be harder to break the algorithms. They were like parasites in the metaphors. The results over his prior method without co-evolution were pretty impressive.
Hillis’ stuff was always one of my favorite stories in that space. I guess I’m just curious if that kind of thing was any inspiration to your field, if you all classify it as a technique in your field, and/or if the field still uses methods like that?I’m also curious if there’s been any general-purpose methods so far in the new research that you think can get interesting results on cheap hardware. What should I tell people at smaller, local colleges to look into that they could do on their desktops or otherwise on a budget?
Hillis’s work on adversarial co-evolution seems more similar to Generative
Adversarial Networks than adversarial examples / robustness / machine learning
security. Some subset of ML researchers group together GANs and adversarial
examples under the label “Adversarial ML”, but many other researchers think of
them as distinct research areas.
I’m not sure if Hillis’s work / similar efforts were an inspiration for
GAN-based methods. I don’t think it was an inspiration for research related to
What’s neat about this research area, especially on the attack side, is that
you don’t need that much compute. For example, all the work I’ve done on attacks can be done with a single high-end GPU, and reproducing some of the
results on a slightly smaller scale can even be done on a laptop CPU (e.g. see
That’s neat. Good that one can get results on a budget. I’ll keep the link saved for any students that might be interested.
Hi Lobsters! This blog post is about Project Sistine, a hack that I worked on with @antimatter15, Guillermo Webster, and @loganengstrom. We turned a MacBook into a touchscreen using only $1 of hardware (a small mirror, some pieces of a rigid paper plate, and a door hinge).
We built this prototype some time ago, but we never wrote up the details of how we did it, and so we thought we should share. We’re happy to answer any questions you might have about the project!
Great work. I love the idea. Does it work on the whole screen? Looking at the pictures in the article, it seems that the areas near the left and right edges are uncovered.
Thanks, glad you liked our hack :)
Yeah, the current prototype doesn’t capture the whole screen (it probably captures ~1/2 to 1/3 of the screen area), due to the positioning of our flat mirror. We tried moving it farther away so it would capture more screen area, but with the low quality webcam we were using (480p), the resolution wasn’t good enough. A higher resolution webcam might be enough to solve this problem. Another solution might be using a curved mirror.
Did you try with a convex mirror to capture a wider view? Will probably drive the dollar cost up - but would be interesting to see if you could find success with it.
Great work, by the way!
Thanks, glad you liked our work!
Nope, we haven’t tried a convex mirror yet. I think convex mirror + 720p camera (standard on today’s MBPs) could make this system work a lot better. It would probably complicate the math a little bit, but it should be reasonably straightforward to handle as long as the optics of the mirror aren’t too weird.