1. 17
    1. 3

      Abstract interpretation means a couple different things to different people. There’s rigorous mathematical formalism thanks to Patrick and Radhia Cousot, our favorite power couple, and there’s also sketchy hand-wavy stuff like what will follow in this post

      an unappreciated fact: abstract interpretation (the formalism) is not a way of doing things, but a way of interpreting things that programming language implementations do. hence, all your hand-waving is perfectly cogent abstract interpretation (as are the optimisations developed before ai was ever formulated)

      abstract interpretation is an over-approximation of the behavior of a program. Correctly implemented abstract interpreters never lie, but they might be a little bit pessimistic

      a sound abstract interpreter does not miss any behaviours of the original program, and a complete one describes only behaviours which were in the original program. i don’t know of much use for complete-but-unsound abstract interpreters, but a sound and complete abstract interpreter is exactly a programming language implementation (both (concrete) interpreters and compilers can be seen this way!)

      we group them into a finite number of named subsets

      most abstract domains are infinite; eg the domain of constants. there’s really no problem with that. one related property of interest is finite descending chain (which is satisfied by constants, but not eg by intervals); if you don’t have this then the interpreter design will have to be more complex to guarantee termination (generally you use a widening operator), but you can still cope with it

      Using abstract values from a lattice promises … [t]he analysis will terminate

      whether the analysis terminates depends on the analysis. and see above re finite descending chain etc.

      Instead, we’ll see if we can optimize operations of the form bitand(X, 1). If we statically know the parity as a result of abstract interpretation, we can replace the bitand with a constant 0 or 1.

      We’ll first modify the analyze function (and rename it) to return a new Block containing optimized instructions:

      distant sobbing noooooooo you’re supposed to do it with a constant domain and then show how if you add a reachability domain you get conditional constant propagation for free

      (ok, you’re doing traces or something for some reason so it wouldn’t work but still. i think you at the very least still converge faster that way)

      1. 3

        i don’t know of much use for complete-but-unsound abstract interpreters

        I wonder whether those might be useful for bug-finding tools that are trying to not report any false positives, in the same way that incorrectness logic is used.

        1. 3

          i considered the possibility before posting, but i am not sure. for one, you have to decide what is the ‘program’ - you need an ‘entry point’; can you apply this to libraries? and are the type signatures on the entry points tight enough? (no.) also, i really don’t know how you’d implement something like that; how can you abstract meaningfully without sacrificing completeness?

          although. i guess this is basically what fuzzing is. so really the question is more like ‘how can we make a better fuzzer’

          1. 2

            so really the question is more like ‘how can we make a better fuzzer’

            yep, seems your question was the right one. I didn’t really read up on this too closely, but “complete-but-unsound” seems to be what concolic testing uses. It’s an approach that tries to find example inputs for specific library functions that maximize code coverage. This is done by symbolic execution and then using an SMT solver or a constraint solver to find concrete example inputs that make certain if-conditions in the program go in a direction for which there are no examples in the generated tests yet. I found this paper, which looks interesting, but I only skimmed it so far: Abstract Interpretation, Symbolic Execution and Constraints

            1. 1

              https://gist.github.com/VictorTaelin/ i wonder if victor taelin’s crap is relevant

      2. 2

        This is an excellent writeup. I appreciate the diagrams for the lattices.

        We’re approaching this the way that PyPy does things under the hood, which is all in roughly a single pass.

        There are several quirks from the PyPy approach worth noting. The one which sticks out to me is labeling every value with bottom before starting. This is done to bake in another easy optimization: any value which is annotated bottom can be removed, because it can’t possibly be reached by any previous instruction. But note that the given transfer functions will never introduce bottom outputs for non-bottom inputs (a common property of transfer functions), so values are only marked bottom at the end if they were marked as bottom at the start. I only say this because otherwise newcomers may be confused: shouldn’t we assume a value is top at the start? Yes, we should assume reachable values are top at the start, and we might do that in a multi-pass abstract interpreter.

        1. 3

          multi-pass abstract interpreter

          😱😱❌❌🤮😿

          labeling every value with bottom before starting

          cliff click calls this the ‘optimistic assumption’, and sets up a straightforward dichotomy: optimism is more precise, but pessimism is iteratively sound. intuition the first: starting lower gets you to a lesser fixed point. intuition the second: optimism improves precision specifically in the case of loops. we need induction to reason about loops; optimism is effectively the ‘assume’ step

          (of course, the dichotomy is wrong; you can be iteratively sound without sacrificing precision. but you have to do it differently)

          1. 2

            Yeah good point. CF made the same point. I just wanted to include a bottom element because people are going to see it elsewhere (and I wasn’t sure if I originally wanted to have guards).

            1. [Comment removed by author]

            2. 1

              oh and one more important point i forgot

              classically, the elements of your abstract domain represent not sets of terms, but sets of program states. hence, you can reason about relationships between terms (‘i know nothing whatsoever about the specific values of a and b, but i know a<b always at this program point, so i can kill this bounds check’—the set of all program states in which a is less than b) as well as things that are not per se terms (like memory or call stacks)

              of course, if you happen to specify your abstract domain so that terms are abstractly completely independent and can be varied independently, then you can take advantage of that in the implementation and then it starts to look more like abstract data flow. much cheaper but less precise

              (yadda yadda ssa is abstract interpretation; hide a theorem prover in an abstract domain; term-rewrite vs static analysis can substitute for another but also nest inside each other. is an interesting thread no one cares about apparently)

              1. 2

                I feel like all these comments are interesting but I’m not well versed enough to understand them unfortunately

                1. 1

                  what are you missing?

                  ‘ssa is ai’ is this paper; it has far too many greek letters and took me forever to get through but was well worth it (can give cliffs-notes version if interested). a more approachably explained symbolic abstract domain is octagons

                  conditional constant propagation for free is an example in cliff click’s phd dissertation which i linked upthread; i suggest reading chapters 2 and 3

                  1. 1

                    I am definitely interested in the cliffs notes version of that paper!

                    1. 2

                      the concrete semantics is a control flow graph (no ssa, just assignments). we have an abstract state for each concrete control node which is a mapping from concrete program variables to the values they have at that point in the program. the values are symbolic expressions—a separate language for recursive expressions (like how llvm scalar evolution is a separate language for representing recursive expressions, related to but distinct from llvm ir). in effect, we have a control flow graph, and then we have another structure which is like the dataflow portion of sea of nodes (except phi is a bit different), and then we have abstract stores at every concrete control node to relate the two

                      unlike rvsdg, sea of nodes doesn’t actually do anything useful with control flow (the control portion of son is just a cfg), so this doesn’t really sacrifice any of the advantages of sea of nodes, and it gains all of the advantages of abstract interpretation

                      since the abstract interpretation is sound and complete, having produced this dataflow graph, we can optimise it in situ, and then project it back into concrete control flow land and schedule it. we can completely avoid reasoning about concrete expressions directly—we just have to worry about mapping them into and out of abstract-symbolic-dataflow-land

                      1. 1

                        Thanks for the pointers–I have pretty much only read the Sea of Nodes paper