1. 5
  1. 8

    Computationally homogeneous. A typical neural network is, to the first order, made up of a sandwich of only two operations: matrix multiplication and thresholding at zero (ReLU). Compare that with the instruction set of classical software, which is significantly more heterogenous and complex. Because you only have to provide Software 1.0 implementation for a small number of the core computational primitives (e.g. matrix multiply), it is much easier to make various correctness/performance guarantees.

    Well actually classical software can be made up of only one instruction (NAND) so it’s twice as good as neural networks

    The 2.0 stack also has some of its own disadvantages. At the end of the optimization we’re left with large networks that work well, but it’s very hard to tell how. Across many applications areas, we’ll be left with a choice of using a 90% accurate model we understand, or 99% accurate model we don’t.

    The 2.0 stack can fail in unintuitive and embarrassing ways ,or worse, they can “silently fail”, e.g., by silently adopting biases in their training data, which are very difficult to properly analyze and examine when their sizes are easily in the millions in most cases.

    This seems like the crux of it, though? If we don’t understand how it works and it can fail in unintuitive and embarrassing ways, how can we actually trust it?

    1. 3

      This seems like the crux of it, though? If we don’t understand how it works and it can fail in unintuitive and embarrassing ways, how can we actually trust it?

      ML is generally good for problems where either:

      • You don’t actually understand the problem,
      • There might not be a correct answer, but a mostly-correct answer is useful, or
      • The problem changes frequently.

      Shape detection is a good example of the first. Plato onwards have tried to define a set of rules that let you look at an object and say ‘this is a chair’. If you could define such a set of rules, then you could probably build a rule-based system that’s better than example-based systems but in the absence of such a set of rules the example-based approach is doing pretty well.

      The middle category covers a lot of optimisation problems. Even where there is a correct (optimal) answer for these, the search space is so large that it’s not in a complexity class that makes it even remotely feasible. Example-based solutions over a large set of examples let you half-arse this and get something that is a lot better than nothing and a lot less computationally expensive than an optimal solution.

      The last category is particularly interesting. A lot of fraud detection systems are like this: they’re spotting patterns and the attacker adapts to them pretty well. Spam filtering has been primarily driven by ML for a good 20 years (I think the first Bayesian spam filters might have been late ‘90s, definitely no later than 2002) because it’s trivial for a spammer to change their messages if you write a set of rules and much harder for you to change the rules. These things are not flawless for security because they’re always trailing indicators (the attacker adapts, then your defence adapts) but they’re great as a first line of defence. Project Silica at MSR one floor down from me used ML for their voxel recognition for data etched into glass to massively speed up their development flow: they could try new patterns as fast as they could recalibrate the optics and then retrain the same classifier and see how accurate it could be. A rule-based system might have been a bit more accurate, but would have required weeks of software engineering work per experiment.

      Things like Dall-E fit into all three categories:

      • Generating a set of rules for how to create art is a problem that various artistic movements over the centuries have tried and failed to do.
      • If you really want an image with a particular characteristic, you probably need to hire an artist and have multiple rounds of iterations with them, but an image that’s more-or-less what you asked for and is cheap to generate is vastly cheaper than this and much better than no image.
      • The prompt changes every time, requiring completely different output. Artistic styles change frequently and styles for commercial art change very rapidly. Retraining Dall-E on a new style is much cheaper than writing a new rule-based generator for that style.

      I see ML as this decade’s equivalent of object orientation in the 1980s/ 1990s and FP in the last decade or so:

      • Advocates promise that it can completely change the world and make everything better.
      • Lots of people buy the hype and build stuff using it.
      • A decade or so later, it’s one of the tools in a developer’s toolbox and people accept that it’s really useful in some problem domains and causes a lot of problems if applied in the wrong problem domain.
      1. 2

        As far as I can tell, software that worked 99% of the time would generally be an improvement.

        1. 3

          As far as I can tell, software that worked 99% of the time would generally be an improvement.

          That’s an obvious nonsense. Imagine routers only routed 99% of traffic correctly. After just 4 hops TCP would break down. We are very close to 100% everywhere it matters enough for people to care.

          You will get at most 95% at typical ML tasks while people care.

          ML models also tend to suck at failing. Typical router will just reject unintelligible packets while ML model will do something totally arbitrary. Such as classifying furniture as animals.

          What I mean is: please don’t use ML for ABS and always make it so that it assists real people, never let it run unattended.

          1. 3

            It’s also worth bearing in mind that 1% in terms of a particular sample doesn’t mean 1% in the real world. There are probably no bug-free routers today, but if you buy ten good routers, it’ll take years to get all ten of them to incorrectly route packets due to implementation bugs, and it’ll involve quite a lot of reverse engineering and testing. Meanwhile, you can get most 2.0 software (!?) to fail in all sorts of funny ways with a few hours of trial and error, and I guarantee that each of them is backed by a slide deck that claimed 99.99% accuracy on ten different data sets in an internal meeting.

            Bugs in the implementation of a model tend to cluster in poorly-understood or unexamined areas of the model, and you can usually tell which ones they are without even running the software, just reading the code, the source code repository history, and doing a literature survey if it’s a well-documented problem (like routing). Figuring that out on statistical models usually devolves into an exercise in stoner philosophy at the moment.

          2. 1

            As a general statement that is probably true.

            However, we can prove (or disprove) 100% correctness of traditional software. We don’t do it for all software because it’s hard but we know how to do it in principle. At the same time interpretability is an open problem in ML. We can reverse engineer (more like guess) the algorithm encoded in some simplest models but it’s far from perfect. The algorithm is approximated most of the time and not exact. It can differ subtly from a classical algorithm we infer. And we can’t do it for big models like GPT-3. We also can’t do it reliably even for all simple models. So it might look like model works 99% of the time but you can’t rigorously prove it does or that it’s actually 99%.

            1. 2

              I think you can only (trivially) disprove it by producing a counter-example on which it fails. The example that springs to mind is with facial recognition that was trained on mostly white faces failing on black faces.

              There might be ways to construct such examples like the “adversarial models” to cause image recognition to fail.

              1. 1

                The adversarial model problem strikes me as a hard one to ever solve because any attacker with access to the original model can just use it to train an adversary.

                1. 1

                  Then I’d rather say instead of it being 99% correct, it’s actually 100% incorrect, because it can never be fixed. At least if traditional software has a bug, you can fix it.

                  1. 2

                    Well, you just train the model against the adversary, obviously. :-)

          3. 1

            This seems like the crux of it, though? If we don’t understand how it works and it can fail in unintuitive and embarrassing ways, how can we actually trust it?

            IMO, the crux of it is that “software 2.0” is good at solving a class of problems that are commercially relevant, and that “software 1.0” is not so good at. Typically domains where we’ve needed expensive humans to do things, and in which human practitioners have developed a great deal of tacit knowledge about how to perform their tasks that are hard to make explicit. It really is incredible that we’ve now got a generalizable approach for automating things that used to require human practitioners with a great deal of human experience.

            But in domains where explicit knowledge is more important, I’d think “software 1.0” will dominate. Though, if AGI ever becomes practical / powerful enough, I don’t discount the idea of “software 2.0” AGI programmers developing (in partnership with humans, at least at first) “software 1.0” systems.

            Anyway, to respond to your actual point, “how can we actually trust it?”:

            1. We won’t necessarily have a choice. Economics being what they are, and human drive being what it is, a technique being more effective more easily will result in its winning in its niche, whether or not that’s considered good for us. I can probably mock up a prisoners dilemma scenario to illustrate this better, but I’m already writing too much.
            2. At some point of examination, trust will break down, in any system. We probably all know about https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf. In math, ZFC set theory is a major foundation of many results, but then there’s this: https://scottaaronson.blog/?p=2725. IMO, the reasonable approach to trusting “software 2.0” systems is similar to the way we establish faith in the sciences: through hypothesis generation and statistical testing.
            1. 1

              In the mean time, I believe we’ll soon experience a new AI winter. All it takes is one spectacular failure, or a huge money sink not paying off, like, let’s say, self-driving cars.