1. 25
  1.  

  2. 10

    I’ve been a professional Python programmer for years, I believe in unit-testing, I’m all about hexagonal architecture (or boundaries if you prefer Ruby or Python over Java). I’ve got a continuous-integration setup, and I love the idea of having a computer slowly, over time, pinpoint all the weak points of my code and automatically notify me with reduced test-cases. I should be, like, patient zero for adopting Hypothesis and ranting at all my friends until they use it too.

    But despite having heard of Hypothesis months ago and occasionally pondering where I might fit it in, I’ve never actually used it for anything—because most of the time I’m not dealing with data-types that are amenable to being randomly-generated, and when I am, the “properties” I’d want to test seem like they’d be exact mirrors of the random-generation rules.

    I’m not saying that Hypothesis is useless, I’m just saying I can’t see how to use it. It took me years of stumbling around trying to write unit-tests for things before I felt I had a good intuition of when code was unit-testable, and a few more years before I knew what to do with code that wasn’t. I quite expect Hypothesis-style testing to have a similar learning curve, and I’m still at the very start.

    All that to say: I don’t think MacIver is paying enough attention to the second of his list of hypotheses: Nobody cares about this problem, so they haven’t bothered to pick the fruit. Not that it’s an irrelevant or useless, just that it’s far enough out there that most people haven’t heard of it, or can’t see how to use it, so they haven’t tried it and there’s still a lot of rough edges to be smoothed out.

    I watched a video about a QuickCheck implementation for Erlang a while ago, and from what I recall they had a similar problem: they had a thing they knew was very cool and very useful, but they found it very difficult to convince other people to try it out. There’s little incentive to make property-based testing more efficient if you can’t make people use it at all.

    1. 13

      I am a user of QuickCheck in Erlang and Hypothesis in Python. I’ve found both extremely powerful and useful, but code has to be written in such a way that it is usable. Specifically, all code has properties, once you recognize that you have to decide how to write code in a way that exposes those properties in a way that they are testable. On top of that, property based testing requires (probably) the most creativity in coming up with a test. In some cases the random thing you generate is not actually the data directly, but a way to generate data in a controlled way. For example, for some tests I do, I generate random integers and it’s really an ‘index’ into another conceptual space. It’s non-trivial, but many errors have been caught, IME, in the process.

      1. 2

        I started writing a reply to this but then realised that I was repeating things I had said enough times that it really should be a blog post. So now it is.

        But I’m not sure how much bearing this actually has on the question. It’s not surprising to me that these tools aren’t wider spread in industry - I’ve got a pretty good handle on how difficult it is to actually write a solid one - but why there isn’t more research on the subject, and there’s enough research on related subjects that I don’t think it’s the accessibility of the ideas.

        ETA: Oh, also, I think you’re probably wrong if you think your values aren’t easily amenable to random generation. The Hypothesis strategy library is really very cutomizable to generating the data you need.

      2. 4

        Are the statements made in this article still valid? It seems like there is still feature development happening on Hypothesis. Is there a follow-up post that explains why this is?

        I’m just curious to know.

        1. 5

          I have not proven very good at sticking to my guns on not working on Hypothesis.

          1. 4

            The author had some good ideas for a new take on Hypothesis, improved them despite himself, and noted he was working on it

            For me 2015 has basically been the year of Hypothesis. This was very much not intended from the outset. I originally intended to put in a couple months work on it. I even made a concerted decision to give it up. Now I’m building a business around it.

            1. 3

              Note that these ideas have all made it into a Hypothesis release now.

              1. 1

                Now I’m building a business around it.

                Sounds interesting, watching to see how this turns out! :)

              2. 2

                It looks like his year in review gives an update. He changed his mind and is now “building a business around [Hypothesis].”

              3. 4

                My optimistic suggestion would be that the intersection of people who care about lower defect rates than you achieve with a small pile of manual unit tests and people using Python is close to empty. It’s pretty easy to use a good type system these days, and once you have that, for any formalizable correctness property why would you ever not lift that property into the type system where it’s not just tested but proven correct?

                My pessimistic suggestion would be some combination of no-one caring about software quality and no-one actually doing anything nontrivial with software.

                1. 4

                  It’s pretty easy to use a good type system these days, and once you have that, for any formalizable correctness property why would you ever not lift that property into the type system where it’s not just tested but proven correct?

                  This seems like a gross oversimplification. Isn’t the answer obvious? Because not all type systems (even “good” ones) are capable of lifting all properties into the type system.

                  Alternatively, there exist properties that aren’t easily formalizable, but are still worth testing. (I wish I could give an example, but I’m not really sure what you have in mind.)

                  Alternatively still, I contest the notion that just because one can lift a property into the type system doesn’t mean one should do it. Depending on your type system and the nature of your invariants, it might make the code a lot more complex or harder to use, which should absolutely be weighed against the benefits of correctness.

                  1. 2

                    Because not all type systems (even “good” ones) are capable of lifting all properties into the type system.

                    In theory maybe. In practice I never seem to hit anything important and formalizable that I can’t express via types.

                    Alternatively, there exist properties that aren’t easily formalizable, but are still worth testing. (I wish I could give an example, but I’m not really sure what you have in mind.)

                    I’m thinking that in any case where I could formalize an invariant well enough to test it in Hypothesis style I could likely just put it in the type system. Rather than me giving an example, do you have an example of something that’s so much easier to test that way?

                    Alternatively still, I contest the notion that just because one can lift a property into the type system doesn’t mean one should do it. Depending on your type system and the nature of your invariants, it might make the code a lot more complex or harder to use, which should absolutely be weighed against the benefits of correctness.

                    Hmm. On the assumption that it’s going to be a similar amount of code either way, I’d rather have it in the types than in the tests - the feedback cycle is shorter, and new code is automatically covered rather than needing to write new tests, which to my mind makes that new code simpler to write, at least for the same level of correctness. Sometimes correctness enforcement isn’t worth it, but for cases where it is, I don’t see tests being simpler.

                    1. 5

                      In theory maybe. In practice I never seem to hit anything important and formalizable that I can’t express via types.

                      I find this to be an incredibly suspicious claim. There is an entire field of study whose main purpose is to find new ways use types to make stronger guarantees. Has all progress in this field been reduced to something purely academic?

                      I’m thinking that in any case where I could formalize an invariant well enough to test it in Hypothesis style I could likely just put it in the type system. Rather than me giving an example, do you have an example of something that’s so much easier to test that way?

                      Well, the classic example is abstract syntax. How can you use the type system to guarantee that there are no expressible inhabitants that your evaluator can’t handle? This comes up a lot in practice for me. Depending on how complicated your invariants are, it’s usually not that hard to write quickcheck style Arbitrary impls to generate and shrink inhabitants.

                      Of course, there are ways to move these guarantees to the type system. GADTs is one way. Kiselyov has a phenomenal paper on it (“Typed Tagless Final Interpreters”). I bet there are other ways that you might educate me on, but I’m at the boundary of my knowledge here.

                      Hmm. On the assumption that it’s going to be a similar amount of code either way, I’d rather have it in the types than in the tests - the feedback cycle is shorter, and new code is automatically covered rather than needing to write new tests, which to my mind makes that new code simpler to write, at least for the same level of correctness. Sometimes correctness enforcement isn’t worth it, but for cases where it is, I don’t see tests being simpler.

                      I guess if your assumption holds, then maybe I’ll grant you that. I’m not sure how good that assumption is though. More types (possibly a lot more, depending on how expressive your type system is) can be required, for example.

                      1. 1

                        I find this to be an incredibly suspicious claim. There is an entire field of study whose main purpose is to find new ways use types to make stronger guarantees. Has all progress in this field been reduced to something purely academic?

                        I think most of the research is about doing this without needing any turing-complete pieces (i.e. doing this kind of analysis in a way that’s complete as well as consistent). That’s a laudable goal and enables many worthwhile things, but not directly necessary for correctness.

                        I mean given Curry-Howard, research can only really be about a) proofs that are known but can’t be expressed in current type systems - very rare (probably impossible since it’s a mechanical translation?) b) conjectures for which no proof is known - not something I’ve ever had to use directly when programming, and not cases that I can see Quickcheck-style checks helping with (maybe it’s more useful for exploratory programming?) c) things that aren’t directly about correctness - e.g. I see a fair bit of research about efficient representations, but that’s not going to be a concern for anyone for whom Python is an option.

                        Well, the classic example is abstract syntax. How can you use the type system to guarantee that there are no expressible inhabitants that your evaluator can’t handle? This comes up a lot in practice for me. Depending on how complicated your invariants are, it’s usually not that hard to write quickcheck style Arbitrary impls to generate and shrink inhabitants.

                        Can’t you just define a fold/visit method? (and then rely on the parametricity theroem, i.e. if it implements a method that takes a Visitor<V> and returns a V then it must be one of the cases of the visitor, barring bottoms). Or am I missing something?

                        1. 3

                          I think most of the research is about doing this without needing any turing-complete pieces (i.e. doing this kind of analysis in a way that’s complete as well as consistent). That’s a laudable goal and enables many worthwhile things, but not directly necessary for correctness.

                          I mean given Curry-Howard, research can only really be about a) proofs that are known but can’t be expressed in current type systems - very rare (probably impossible since it’s a mechanical translation?) b) conjectures for which no proof is known - not something I’ve ever had to use directly when programming, and not cases that I can see Quickcheck-style checks helping with (maybe it’s more useful for exploratory programming?) c) things that aren’t directly about correctness - e.g. I see a fair bit of research about efficient representations, but that’s not going to be a concern for anyone for whom Python is an option.

                          Where did this come from? All you said was “… In practice I never seem to hit anything important and formalizable that I can’t express via types.” To me, this suggests that you think there are no practical benefits to come from further study of type theory. The whole point of the field is to make it not only possible, but convenient to make stronger guarantees using types.

                          I also find your rigidity to be rather distracting. Just because a mechanical translation is possible (which might require an order of magnitude more code) doesn’t detract from my point at all.

                          Can’t you just define a fold/visit method?

                          But how do you prevent construction of nonsense terms? Jones, Washburn and Weirich explain it better: http://www.msr-waypoint.net/pubs/65143/old-wobbly.pdf — You don’t need to read the whole thing (although it is a good read), but I think just the background section is enough to describe what I have in mind.

                          And Kiselyov’s paper is also great for an alternative solution to this problem (type classes instead of GADTs).

                          It’s pretty easy to use a good type system these days, and once you have that, for any formalizable correctness property why would you ever not lift that property into the type system where it’s not just tested but proven correct?

                          Frankly, I find your claim that quickcheck is essentially useless to be rather extraordinary and very controversial. From where I’m standing, you’re completely dismissing a rather large body of evidence that suggests otherwise. (e.g. QuickCheck is used quite a bit among Haskell programmers.)

                          Other things I find quickcheck useful for:

                          • Testing algebraic laws. For example, does your implementation of Monad satisfy the monad laws? Does my implementation of a min-heap satisfy the heap invariant after any sequence of insertions or deletions?
                          • Testing the equivalence of two different implementations of the same idea. For example, does my implementation of regular expressions match the output of a different implementation of regular expressions given the same inputs?
                          • Testing the invariants defined in the contract of functions. For example, if my function’s documentation says, “the output will satisfy property X on any input,” then that seems like a natural use for quickcheck if that property is hard or impossible to move into the type system. (For example, a “sequence of sorted non-overlapping, non-adjacent Unicode codepoints.”)
                          • If I have an AST for a regular expression, print it, and then parse it, do I get back the original AST? That’s pretty straight-forward to test with quickcheck, but I have no earthly clue how you’d move that into the type system.
                          • I wrote a recursive directory iterator, and I wanted to make sure that I could describe a directory hierarchy in memory, create it, walk it and get back the same directory hierarchy that I started with. Quickcheck shined here.
                          • If you’re writing a linear algebra library, then there is probably no limit to the number of properties one might want to test.
                          1. 1

                            All you said was “… In practice I never seem to hit anything important and formalizable that I can’t express via types.” To me, this suggests that you think there are no practical benefits to come from further study of type theory. The whole point of the field is to make it not only possible, but convenient to make stronger guarantees using types.

                            Ok, maybe I am saying that. I don’t have any particular insight into why people are researching or funding research into type theory. I mean I find e.g. alternatives to the boilerplate of monad transformers to be a fascinating idea (I’m working on an implementation of the “freer monads, more extensible effects” paper in my spare time), but I can’t say I’ve actually found the boilerplate prohibitive in practice. So yeah, ok, I’ve never encountered an open problem in type theory in practice, and from my point of view type research probably is of purely academic interest at this stage.

                            I also find your rigidity to be rather distracting. Just because a mechanical translation is possible (which might require an order of magnitude more code) doesn’t detract from my point at all.

                            Shrug. I’ve not seen an order of magnitude blowup (at least, not in a way that couldn’t be encapsulated) in practice. I’ve occasionally had to copy-paste code because a language didn’t have kind polymorphism, but not often enough that I feel it’s particularly important.

                            But how do you prevent construction of nonsense terms? Jones, Washburn and Weirich explain it better: http://www.msr-waypoint.net/pubs/65143/old-wobbly.pdf — You don’t need to read the whole thing (although it is a good read), but I think just the background section is enough to describe what I have in mind.

                            AFAICS the example in the background section is straightforward to implement in any good type system, or even an old and creaky language like Java? All it requires is polymorphism and basic generics, and any serious typed language in this day and age will have those.

                            Frankly, I find your claim that quickcheck is essentially useless to be rather extraordinary and very controversial. From where I’m standing, you’re completely dismissing a rather large body of evidence that suggests otherwise. (e.g. QuickCheck is used quite a bit among Haskell programmers.)

                            You’re right yeah. Thinking about it my claim is pretty much that by using this kind of technique one can implement a spec correctly (i.e. such that the only bugs are ambiguities in the spec, or behaving-as-specified) without using QuickCheck-like tools, which is an extraordinary claim and you should be skeptical. It is my experience, but I don’t know how I could even begin to convince you.

                            For example, a “sequence of sorted non-overlapping, non-adjacent Unicode codepoints.”

                            Sounds simple enough to, if not prove in the type system per se, just not allow the relevant type to be constructed any other way.

                            If I have an AST for a regular expression, print it, and then parse it, do I get back the original AST? That’s pretty straight-forward to test with quickcheck, but I have no earthly clue how you’d move that into the type system.

                            I’d use the same structure to define the parser and the printer - parser combinators are quite close to that already. Looking around there seem to be some Haskell libraries and papers taking that approach.

                            Testing algebraic laws. For example, does your implementation of Monad satisfy the monad laws? Does my implementation of a min-heap satisfy the heap invariant after any sequence of insertions or deletions?

                            Testing the equivalence of two different implementations of the same idea. For example, does my implementation of regular expressions match the output of a different implementation of regular expressions given the same inputs?

                            If you’re writing a linear algebra library, then there is probably no limit to the number of properties one might want to test.

                            These sound like cases where this kind of testing would be useful. I think there’s a common thread of equivalences that one doesn’t necessarily know how to quotient out. I’m used to working in a style where the data has a defined canonical form (or else, defining one). I see some commonality with the stuff I’ve been saying about (Turing-complete) functions and preferring to represent operations as data that can be compared for equality. If you have something that you can only represent as a function then this kind of testing is perhaps the best (only?) way to compare it. Maybe that’s a better (or at least more practical) approach than trying to eliminate functions entirely.

                            Food for thought. Thanks.

                            1. 1

                              Ok, maybe I am saying that. I don’t have any particular insight into why people are researching or funding research into type theory. I mean I find e.g. alternatives to the boilerplate of monad transformers to be a fascinating idea (I’m working on an implementation of the “freer monads, more extensible effects” paper in my spare time), but I can’t say I’ve actually found the boilerplate prohibitive in practice. So yeah, ok, I’ve never encountered an open problem in type theory in practice, and from my point of view type research probably is of purely academic interest at this stage.

                              I think we’re still talking past each other. Here’s a snippet from the SPJ paper I linked that is exactly what I have in mind here:

                              This use of explicit constraints has the advantage that it can be generalised to more elaborate constraints, such as subtype constraints. But it has the disadvantage that it exposes programmers to a much richer and more complicated world. In keeping with our practical focus, our idea is to see how far we can get without mentioning constraints to the programmer at all – indeed, they barely show up in the presentation of the type system. Our approach is less general, but it has an excellent power-to-weight ratio.

                              It’s an explicit acknowledgment of the trade offs involved here, which is critical to developing type theory for practical use. There are all sorts of things that we can prove using types, but if it’s really hard to do it, then you’ve lost something that you might not be willing to give up.

                              That you’re talking about “open problems in type theory” is an enormous distraction from what I’m saying. GADTs didn’t solve any open problems (that I’m aware of?), but they undoubtedly made it easier (for some definition of “easier”) to prove certain properties about one’s code using types.

                              AFAICS the example in the background section is straightforward to implement in any good type system, or even an old and creaky language like Java? All it requires is polymorphism and basic generics, and any serious typed language in this day and age will have those.

                              No. The evaluator is proven correct in the type system. Without that proof, you need to fall back to unit or property based tests to make sure your evaluator doesn’t do something horrible when it’s given an invalid input. GADTs specifically rule out the possibility of an invalid input in the type system.

                              Something similar might be accomplished by using privacy and combinators, where any combination of combinators is guaranteed to produce inhabitants of your AST that your evaluator can handle. But how do you know that’s true? If you got it right, then certainly, for callers it’s true. But internally? Maybe a certain permutation of your combinators can produce an inhabitant that your evaluator chokes on. Then again, I could definitely see how the combinators might be so simple as to be obviously correct, but I don’t think it’s something that is proven by the type system itself.

                              Sounds simple enough to, if not prove in the type system per se, just not allow the relevant type to be constructed any other way.

                              But how do you test that your construction is correct in the first place?

                              I’d use the same structure to define the parser and the printer - parser combinators are quite close to that already. Looking around there seem to be some Haskell libraries and papers taking that approach.

                              But your approach doesn’t replace a property based test because you still can’t guarantee that printing is correct. For example, perhaps my printer doesn’t write the syntax correctly, or misquotes a character, or… And it’s not necessarily straight forward to couple parsing and printing since there can be multiple parses the produce the same AST, but printing the AST is always going to produce exactly one variant. There are algebraic laws at play here! Look:

                              parse(print(expr)) == expr
                              

                              But this does not necessarily hold:

                              print(parse("...")) == "..."
                              

                              I very much want to test the first property!

                              These sound like cases where this kind of testing would be useful. I think there’s a common thread of equivalences that one doesn’t necessarily know how to quotient out. I’m used to working in a style where the data has a defined canonical form (or else, defining one).

                              That is quite convenient. It might help to clarify this earlier!

                              I see some commonality with the stuff I’ve been saying about (Turing-complete) functions and preferring to represent operations as data that can be compared for equality. If you have something that you can only represent as a function then this kind of testing is perhaps the best (only?) way to compare it. Maybe that’s a better (or at least more practical) approach than trying to eliminate functions entirely.

                              That’s a good insight I think! I might buy it.

                              1. 1

                                It’s an explicit acknowledgment of the trade offs involved here, which is critical to developing type theory for practical use. There are all sorts of things that we can prove using types, but if it’s really hard to do it, then you’ve lost something that you might not be willing to give up.

                                I guess. I stand by “if it’s formalizable at all, I’ve (personally) not (yet) found it hard to express in types”.

                                Something similar might be accomplished by using privacy and combinators, where any combination of combinators is guaranteed to produce inhabitants of your AST that your evaluator can handle. But how do you know that’s true? If you got it right, then certainly, for callers it’s true. But internally? Maybe a certain permutation of your combinators can produce an inhabitant that your evaluator chokes on. Then again, I could definitely see how the combinators might be so simple as to be obviously correct, but I don’t think it’s something that is proven by the type system itself.

                                If you define a visit/cata method on the interface, where the visitor has to handle all the valid cases, then you have parametricity theorem style assurance that the valid cases are the only cases, don’t you? And then you’d define the evaluator in terms of that method.

                                But how do you test that your construction is correct in the first place?

                                The same way - either its invariants are obviously true, or break them down (by representing them as types) until they are.

                                And it’s not necessarily straight forward to couple parsing and printing since there can be multiple parses the produce the same AST, but printing the AST is always going to produce exactly one variant.

                                To the extent that this is a problem, isn’t it exactly the same problem if you use property-based testing to test that all the AST -> string -> AST roundtrips are correct?

                                1. 1

                                  I guess. I stand by “if it’s formalizable at all, I’ve (personally) not (yet) found it hard to express in types”.

                                  Do you consider mathematical properties formalizable? How do you encode the various algebraic laws listed here into types? https://en.wikipedia.org/wiki/Matrix_(mathematics)#Main_types

                                  If you define a visit/cata method on the interface, where the visitor has to handle all the valid cases, then you have parametricity theorem style assurance that the valid cases are the only cases, don’t you? And then you’d define the evaluator in terms of that method.

                                  This sounds isomorphic to Kiselyov’s approach with type classes.

                                  To the extent that this is a problem, isn’t it exactly the same problem if you use property-based testing to test that all the AST -> string -> AST roundtrips are correct?

                                  I don’t see how. Writing that test requires production rules for the AST, a shrinker, a printer and a parser. There needn’t be any explicit coupling between the printer and the parser (other than one being a consumer of the AST and the other being a producer).

                                  I mean, if your printer looks like this:

                                  print Dot = "."
                                  print Literal s = s
                                  print Kleene e = print e ++ "*"
                                  print Concat e1 e2 = print e1 ++ print e2
                                  print Alternate e1 e2 = print e1 ++ "|" ++ print e2
                                  

                                  Then what exactly in the type system dictates that "." is actually the correct syntax for Dot? Nothing does, but a quickcheck property that tests parse(print(expr)) = expr can totally check that.

                                  A real world regex engine has syntax more complex than that. For example, in addition to Dot you might also have DotAny, where the latter might need a special flag enabled to actually be used.

                                  1. 1

                                    Do you consider mathematical properties formalizable? How do you encode the various algebraic laws listed here into types? https://en.wikipedia.org/wiki/Matrix_(mathematics)#Main_types

                                    The same way - you give the particular types their own, well, type, require suitable evidence when constructing instances, and supply relevant evidence derived from that evidence. The one tricky part is that simpler type systems don’t always let you generalize (similar to working in first-order logic where you can’t make “for all f”-type statements) - but you never actually need a for-all-like statement in programming because you’re always ultimately going to operate on values. So you can work in a kind of continuation-passing-esqe style where you represent a “for all a” statement as a function from a to the specific instance of the statement.

                                    This sounds isomorphic to Kiselyov’s approach with type classes.

                                    It may well be, but you don’t need typeclasses to do it; you can implement it in any language that has polymorphism and basic generics i.e. any serious typed language.

                                    There needn’t be any explicit coupling between the printer and the parser (other than one being a consumer of the AST and the other being a producer).

                                    There needn’t be, but why wouldn’t there be? Writing the parser means expressing enough information to print the object, so there’s no sense repeating that definition separately to write the printer.

                                    1. 1

                                      The same way - you give the particular types their own, well, type, require suitable evidence when constructing instances, and supply relevant evidence derived from that evidence. The one tricky part is that simpler type systems don’t always let you generalize (similar to working in first-order logic where you can’t make “for all f”-type statements) - but you never actually need a for-all-like statement in programming because you’re always ultimately going to operate on values. So you can work in a kind of continuation-passing-esqe style where you represent a “for all a” statement as a function from a to the specific instance of the statement.

                                      Sounds like you’re making the code quite a bit more complex. I’m not interested in what’s theoretically possible here. I’m interested in what can actually be used. Frankly, I won’t buy this until I see it, so I think we’ll just have to disagree. I also wonder whether your approach can be done such that it is as fast as implementations that don’t need any testing. I find the lengths to which you are arguing against quickcheck as an effective tool to be extremely strange.

                                      There needn’t be, but why wouldn’t there be? Writing the parser means expressing enough information to print the object, so there’s no sense repeating that definition separately to write the printer.

                                      You can’t. We have parse(print(expr)) = expr but not print(parse("...")) = "...". You could probably gain the second law with a sufficiently complex abstract syntax to capture all of the particulars of the concrete syntax, but that abstract syntax is now a lot more complicated and much harder to work with. Hence, putting invariants into the type system just isn’t always worth it. Proof of correctness isn’t everything, but if we can approximate it with property based testing while avoiding the complexities of moving as many invariants into the type system as possible, then maybe that’s just fine.

                                      1. 1

                                        Frankly, I won’t buy this until I see it, so I think we’ll just have to disagree.

                                        Fair. I literally can’t remember the last time I saw a bug in my code that compiled that wasn’t also a bug/ambiguity in the spec (and I don’t think this is particular skill on my part - when I tried to write Python I put an embarrassingly large number of bugs live). But yeah, implausible claim I know.

                                        I also wonder whether your approach can be done such that it is as fast as implementations that don’t need any testing.

                                        No, not remotely. But again, presumably not a concern for cases when Python is an option.

                                        I find the lengths to which you are arguing against quickcheck as an effective tool to be extremely strange.

                                        Shrug. We all have our preferred tools/techniques. If you think you have a better way of doing something, isn’t it natural to argue strongly for that approach?

                                        We have parse(print(expr)) = expr but not print(parse(“…”)) = “…”

                                        Sure, but how does that change anything?

                                        1. 1

                                          No, not remotely. But again, presumably not a concern for cases when Python is an option.

                                          Well, I don’t really care about Python here. I’m trying to poke at your argument that quickcheck is useless in languages with good type systems.

                                          Shrug. We all have our preferred tools/techniques. If you think you have a better way of doing something, isn’t it natural to argue strongly for that approach?

                                          Sure, but there’s a lot of data you seem to be shrugging off…

                                          Sure, but how does that change anything?

                                          It makes the “one unified definition” impossible? I sure would love to see a real regex parser that could prove roundtripping via the type system!

                                          1. 1

                                            Sure, but there’s a lot of data you seem to be shrugging off…

                                            One of the eternal Hard Questions is what to do when the data is contrary to your own experience.

                                            It makes the “one unified definition” impossible? I sure would love to see a real regex parser that could prove roundtripping via the type system!

                                            I’m talking about something like https://github.com/pawel-n/syntax-example-json/blob/master/Main.hs , where you write a single definition of your syntax tree and then that single definition is used for parsing and printing, so roundtripping is necessarily correct.

                                            1. 1

                                              where you write a single definition of your syntax tree and then that single definition is used for parsing and printing, so roundtripping is necessarily correct.

                                              I don’t see how that works when there are many different ways to write the same AST. For example, (?s)foo.bar is has an AST equivalent to foo(?s:.)bar. Similarly, [a-c] might have an AST equivalent to a|b|c.

                                              As I mentioned previously, if you defined your AST such that the above wasn’t true (e.g., [a-c] always has a different AST than a|b|c), then perhaps one could take your approach. But now your AST is much harder to work with. To make it easier to work with, you probably need more types and a lot more code.

                                              1. 1

                                                For example, (?s)foo.bar is has an AST equivalent to foo(?s:.)bar. Similarly, [a-c] might have an AST equivalent to a|b|c.

                                                They don’t have the same AST; their syntax is different. They’re semantically equivalent. You might define a “canonicalize”-type function that would transform ASTs such that equivalent ASTs output the same form (in which case roundtripping would be “free” but it would be desirable to prove that canonicalize was idempotent), or you might define a semantic representation (e.g. the compiled DFA) in which all equivalent values are identical (and then it would be a case of proving that the compiled representation roundtripped via the AST, i.e. that the conversion functions are partial inverses).

                                                1. 1

                                                  They don’t have the same AST; their syntax is different.

                                                  No. Their concrete syntax is different. That doesn’t mean the AST must be different.

                                                  This is exactly my point. I could define an AST that is more faithful to the precise concrete syntax, but now I’ve created a lot more work for myself, including one type that expresses a very precise but unwieldy AST and another type that expresses a “canonical” and more convenient AST that can be used for analysis and/or compilation to a finite state machine.

                                                  Or I could have one AST and write property based tests.

                                                  1. 1

                                                    “AST” literally means abstract syntax tree - a tree whose structure corresponds to the structure of the syntax. What you’re describing is not having an AST.

                                                    To be able to make statements about the concrete syntax you need a formal representation of it, sure. But how could you ever reason about it without that?

                                                    1. 1

                                                      You can argue about definitions all you want, but the fact of the matter is that I have a tree that describes the syntax of a regex and multiple distinct concrete instances may map to a single instance of that tree. I pointed out long ago that this representation makes it difficult to prove round tripping via the type system. Conversely, it simplifies many other portions of the code such as analysis and compilation.

                                                      You’re diving down exactly the rabbit hole I’m trying to point out. To move more invariants into the type system requires more types, more code and more complexity in the API surface. I mentioned this several comments ago as a key trade off.

                                                      I don’t know how many times I have to say this, but the point isn’t “what can I prove using types,” but more, “what can I prove using types convenienty.” You’ve retorted that the two categories are “the same in your experience.” I’ve submitted several examples where they aren’t the same in my experience.

                                                      1. 1

                                                        I think the complexity is there already, and I’d want the code there already. Even if I were working in Python I’d want to have an AST as well as a more semantic representation - if nothing else, if you are doing testing it makes it much easier to test the logic if both input and output are first-class objects rather than strings. And I don’t see how you could possibly do property-based testing of the parser without having a structured representation of its input (I mean, presumably you would actually want to test some input strings that aren’t canonical representations, so you can’t just do all the testing by roundtripping - the case in which you can do that is precisely the case where you have an actual AST where the structure corresponds 1:1 to the textual structure).

                      2. 5

                        In theory maybe. In practice I never seem to hit anything important and formalizable that I can’t express via types.

                        Unless you program in Coq, how would you lift some evidence that a sort algorithm works into the type system? Note, OpenJDK’s sort algorithm was shown to be broken for some inputs relatively recently IIRC.

                        Another example is something like a Paxos or Raft implementation. This goes through several transitions and one needs to prove all transitions do not invalidate any variants. In Quviq’s quickcheck this is building a state machine model and running over it. Again, unless one is programming in Coq, how would you do that in the type system?

                        1. 1

                          Unless you program in Coq, how would you lift some evidence that a sort algorithm works into the type system?

                          Even in say Java you can do something like:

                          public interface CompareResult {
                            <V> V visit(CompareResultVisitor<V> visitor);
                          }
                          public interface CompareResultVisitor<V> {
                            V visitLessThan(LessThan lt);
                            V visitEqual(Equal eq);
                            V visitGreaterThan(GreaterThan gt);
                          }
                          public class LessThan {
                            public final int a;
                            public final int b;
                            LessThan(final int a, final int b) {
                              assert(a < b);
                              this.a = a;
                              this.b = b;
                            }
                            @Override
                            public <V> V visit(CompareResultVisitor<V> visitor) {
                              return visitor.visitLessThan(this);
                            }
                          }
                          //Equal and GreaterThan omitted lest I die of old age, I'd forgotten quite what writing Java is like
                          public class Compare {
                            private Compare(){}
                            public static CompareResult compare(int a, int b) {
                              if(a < b) return new LessThat(a, b);
                              else if(a == b) return new Equal(a, b);
                              else return new GreaterThan(a, b);
                            }
                          }
                          

                          And then you build on that e.g. you might have another method that only accepts a LessThan. If there are intermediate states as you’re sorting the list then you define a class to represent each one, and write the transitions such that they can only ever happen correctly (using visitors). You can take it as far as you like, e.g. in Coq you might define a < b as ∃ c > 0: a + c = b, but you can do exactly that in Java, encoding the naturals with Church encoding or similar. In practice you generally choose a point at which things are “obviously correct” and assume they’re, well, correct.

                          This isn’t proof in the type system proper as you would have in Coq; you don’t have the same level of tool support, and you don’t have provable termination (e.g. you could “cheat” by defining compare as return compare(a, b); - or more subtly, do that only for one of the cases). You can avoid that by disallowing recursive calls and recursive types but in practice that’s impractical (e.g. you’d need to define a concrete type for each list size - even HList as usually defined can be “tricked” into creating a circular structure). Still, I don’t find these to be limiting in practice - even many dependent-typed languages allow recursive types which have the same limitation, and no-one actually writes the malicious infinitely-recursive counterexamples.You can reduce the problem of writing a correct implementation of your algorithm to only instantiating your “primitives” when their invariants hold (i.e. only creating a LessThan for a < b which is easy enough to verify by inspection - there is no public constructor so you only have check code in the same package, if you really need to protect against malicious classes being created in other packages you can split your project into OSGi modules), and if the invariant for any given “primitive” is too complex then you can reduce it to simpler ones in exactly the same way you’d do when writing a proof in Coq.

                          I can see that in high-performance code (like the main sort algorithm) you’d want to reuse the same class for multiple states and in that case separating the proof from the implementation code could be valuable - but I don’t think that’s a concern that applies to any problem where Python is an option.

                          1. 6

                            How does this prove a sort function sorts? The visitors are run at runtime. With asserts removed in a release build I can construct a LessThan that is not less than. So, if I want an invalid sort function to not compile, how does this accomplish that? I’m not much of a type theorist so perhaps I’m missing something obvious.

                            1. 1

                              The visit method is generic, so we have parametricity-theorem-style assurance that it calls one of the cases. Which in turn means it must provide one of the cases, which it can only do by either being one itself or calling compare. So although you can write a custom implementation of CompareResult, the only possible values are the ones defined here or a “lazy” value that’s isomorphic to them and preserves the correctness properties (assuming we disallow e.g. global mutable state).

                              That relies on visit not being implemented as bottom, e.g. return null; or throw new RuntimeException(); or return visit(visitor);. The first two we can rule out by just disallowing those constructs everywhere (as with global mutable state); the last we either accept the risk for simple cases, or for complex mutual recursion cases we can require passing evidence that the recursion is getting smaller.

                              You can’t construct a LessThan that is not less than because the constructor is package-private. We can enumerate all the constructor calls within the package and verify by inspection that it’s only called when a really is less than b (i.e. treating the correctness of compare() as an axiom in our proof), or if we decide less-than-ness is too complex a property to be happy that the code is obviously correct then we break it down into simpler properties using the same techniques (which in this case probably means representing the numbers at the type level as Zero and Successor, like you do in Idris). Even in Coq you can only ever prove things under axiom assumptions.

                              1. 1

                                I still do not see how this actually proves a particular sort implementation is correct. Could you please elaborate on that? On top of that, your solution seems to get farther and farther away from something usable, if I am understanding it correctly (which I don’t think I am).

                                1. 1

                                  I still do not see how this actually proves a particular sort implementation is correct. Could you please elaborate on that?

                                  Given a proof that an invariant holds, you translate each step of the proof into a type; constructing that type requires the assumptions of that step, and that type provides evidence that the outcome holds (implemented using the inputs).

                                  On top of that, your solution seems to get farther and farther away from something usable, if I am understanding it correctly (which I don’t think I am).

                                  The further you go in formalizing it from very basic assumptions it the less practical it gets - I probably wouldn’t want to work with naturals implemented directly in real code. But whatever reasoning it is that convinces you that the invariant you believe in is true, you can encode that same reasoning (in the same terms) into the type system - the parts that you see as obviously true, you just leave as obviously true.

                                  I’ve never had to implement something that I wasn’t convinced was correct (like I said, maybe these things make more sense in an exploratory programming setting where you might genuinely not know whether a given property holds - does that happen?), and pushing that reasoning into the type system enforces that the reasoning itself is correct (in as much as the conclusion follows from the premises). It’s still possible to make errors at the premise level, but I think when making that kind of mistake and using quickcheck-like tools I’d just recapitulate it in the generators.

                                  1. 3

                                    Given a proof that an invariant holds, you translate each step of the proof into a type; constructing that type requires the assumptions of that step, and that type provides evidence that the outcome holds (implemented using the inputs).

                                    For sorting, I have a function like int[] sort(int[])

                                    What are the types that would prove that sort actually sorts? In Ocaml, I can certainly construct a type that encodes the sorting between 2 elements, or 3 elements, but not for an arbitrary N elements. This is where dependent types come in to play as far as I am aware. If I understand the example you gave, you have encoded a relationship between two elements, but you haven’t shown how that proves an entire sequence of values is in a particular order or how to take that and correctly construct an array with that invariant.

                                    A similar example I’ve seen is handling commiting something in Raft only after a majority of nodes have acked and update. Doing it for a fixed number of nodes is easy, you can encode all the states in a phantom type or similar. But doing it for an arbitrary number of nodes is not encodable directly in the type system (at least in Ocaml), unless I missed something trying to do it.

                                    1. 1

                                      What are the types that would prove that sort actually sorts? In Ocaml, I can certainly construct a type that encodes the sorting between 2 elements, or 3 elements, but not for an arbitrary N elements. This is where dependent types come in to play as far as I am aware. If I understand the example you gave, you have encoded a relationship between two elements, but you haven’t shown how that proves an entire sequence of values is in a particular order or how to take that and correctly construct an array with that invariant.

                                      For arbitrarily many elements you do the CPS-like thing - the return type is something like interface SortedListGreaterThanOrEqualTo<N extends Nat> { public <V> V cata(Visitor<N, V> visitor); } where interface Visitor<N extends Nat, V> { V empty(); <M extends Nat> V cons(M head, SortedListGreaterThanOrEqualTo<M> tail, GreaterThanOrEqualTo<M, N> evidence); }

                                      In terms of showing a specific sorting algorithm works I’d start from a proof for that sorting algorithm - in practice if the problem was implementing sorting I’d probably write a mergesort, which might well be simple enough to be obviously correct.

                                      1. 2

                                        You still haven’t shown how you can prove a sorting algorithm sorts via the type system, at least in Java (I’m assuming we agree that something like Coq or Agda can prove such a thing in the type system). Your claim was:

                                        It’s pretty easy to use a good type system these days, and once you have that, for any formalizable correctness property why would you ever not lift that property into the type system where it’s not just tested but proven correct?

                                        And the counter argument is that no, you cannot, for all things. Expressing something like a sorting algorithm is correct in the type system requires a powerful type system. So far you have insisted that you can lift this into the type system but have not proven it. So please, lay down the law and show me that I am incorrect, or admit there is some portion of a program that you cannot express in the type system, and this is where something like property based testing adds value.

                                        1. 1

                                          If the task is “write a sorting algorithm in Java that you’re confident is correct, without using property-based testing” then I’m very happy to do that, but it probably doesn’t involve much that you’d call proof in the type system. But that’s the thing that would have to be difficult in order for property-based testing to be useful.

                                          You want an actual proof, as a demonstration of the techniques? How basic do you want the axioms to be? (E.g. can I assume that the < operator does what it does?) And what, formally, do you want proof of - just that the result is ordered?

                                          1. 1

                                            Let’s not get too crazy, I’m just really curious how you would actually solve this problem without resorting to some kind of testing (and a sorting algorithm is a place where prop testing shines). In the code you’ve been showing, your code always goes through some black-hole of correctness information. Your claim earlier was that you haven’t met an invariant that you haven’t been able to lift into the type system

                                            I assume that means you can define types such that an invalid sorting algorithm does not compile, as that is what almost everyone I’ve met means by that claim. Is that assumption correct? If so, how would you do this? Otherwise what do you mean by lifting to the type system?

                                            1. 2

                                              I assume that means you can define types such that an invalid sorting algorithm does not compile, as that is what almost everyone I’ve met means by that claim. Is that assumption correct?

                                              If you mean by this “any edit to the source that preserved all type signatures would necessarily result in a valid sorting algorithm or something that doesn’t compile”, I think that’s theoretically possible in Java but not practical. I tend to think of representing some of the information by classes with sealed constructors as still lifting to the type system (I see it as similar to having an unchecked theorem in a more fancy typed system), but maybe that’s an important distinction I’m glossing over there. So I guess I’m making a couple of different claims:

                                              1. The extra tools provided by a system like Coq or Agda are never strictly necessary for proving correctness of a concrete program. They make proofs a lot more practical and a lot more human-readable, but anything you can prove (about a single concrete program - I think they might have ways to talk about all possible programs that you can’t do in simpler languages) in them you could (at least in theory) mechanically-ish expand out to prove in any type system that has polymorphism and basic generics.

                                              2. A programmer of moderate skill who is deliberately pushing invariants into the type system (and explicitly not writing for maximum performance) can, in practice, write correct-to-the-specification code without any kind of testing (beyond checking it compiles) in any reasonable typed language (again, polymorphism and basic generics).

                                              I think an example is probably a good idea - I’ve already tried to describe what I’m talking about doing as clearly as I can - looking around for examples of mathematical proof they all seem too simple or too general though. I certainly wouldn’t want to try to express the TimSort verification paper in Java (I had a go at it in Scala once and that was hard enough), but that’s an array-twiddly performance-oriented piece of code where the invariants have dozens of conditional clauses.

                                              Honestly I’m not sure how I ended up arguing for Java here, because I’m all in favour of using systems like Coq to make proving correctness easier. But even if you are tied to Java I think it’s still possible to use types to enforce correctness, and I think that’s more likely to be an effective approach than quickcheck-like techniques.

                                              In particular the bug in JDK sorting was, as far as I know, found by applying formal methods to it, not by a quickcheck-like testing system. And in the scalaz-stream example that was posted recently the tester seemed to spend more time encoding their assumptions into the generative tester than actually finding any bugs with it. So one thing that would be interesting to see would be more examples of bugs found with that kind of tool, and whether the invariant that was violated could have been expressed directly in the code.

                                              1. 4

                                                Thanks for the thoughtful response.

                                                I think they might have ways to talk about all possible programs that you can’t do in simpler languages) in them you could (at least in theory) mechanically-ish expand out to prove in any type system that has polymorphism and basic generics.

                                                I think this is likely untrue, if you want to take a dynamic amount of input at any point. Consider how you would mechanically expand out into the type system that List.add results in a list that is the the current_length + 1. I think this is impossible in a language like Java without really impractical restrictions.

                                                A programmer of moderate skill who is deliberately pushing invariants into the type system (and explicitly not writing for maximum performance) can, in practice, write correct-to-the-specification code without any kind of testing (beyond checking it compiles) in any reasonable typed language (again, polymorphism and basic generics).

                                                I think this is likely untrue as well. Again, I think a sort function is a valid counter example here. It’s easy to construct a relationship between a fixed-number of elements, like your LessThan class, but establishing a relationship between a dynamic number of elements is not possible without a powerful type system. As far as I am aware Scala, Java, and even Haskell, do not have sufficiently expressive type systems (although maybe that’s changing with type families in Haskell). Even a simple mergesort, I think you’d be unable to prove correct using the type system of Scala or Java (although I know less about what magic type features Scala might offer).

                                                But even if you are tied to Java I think it’s still possible to use types to enforce correctness, and I think that’s more likely to be an effective approach than quickcheck-like techniques.

                                                I don’t know where this belief comes from and there doesn’t seem to be any reason to believe it’s true. Not because prop testing itself is special but because there seems to be a clear limit on what can be expressed in Java’s type system.

                                                In particular the bug in JDK sorting was, as far as I know, found by applying formal methods to it, not by a quickcheck-like testing system.

                                                This is absolutely correct. However this thread has not been about proving property based testing is useful but rather exploring your claim that you can move any invariant into the type system.

                                                And in the scalaz-stream example that was posted recently the tester seemed to spend more time encoding their assumptions into the generative tester than actually finding any bugs with it

                                                What’s the problem with this? If there are no bugs to be found then there are no bugs to be found. The post did prove that (for the properties the author cared about) the scalaz-stream code had the properties it claimed.

                                                So one thing that would be interesting to see would be more examples of bugs found with that kind of tool, and whether the invariant that was violated could have been expressed directly in the code.

                                                My best experiences with a property based testing has been validating state machine transitions. Specifically, QuviQs tooling lets one express operations that can be done on a state machine and invariants that must be held after the operation, then it performs a large number of them and tries to find minimal number of transitions when a fault is found. I’ve used this quite a bit in code that has to serialize data to the disk and be crash proof. Some of the state machine transitions where simulating various system crashes and then validate the data was readable (but perhaps missing entries). This found errors that were difficult to reason about, or poorly documented, or just unclear.

                                                A lot of my experience with prop testing has also been finding inputs that I didn’t consider. In many cases it’s around user input that, as the developer, I’m unlikely to think of because it’s clearly wrong. One recent example of this is a program that takes two set of values, one being replaced by the other set. And the original code did not catch the case, because it make no sense given what the program does, when the user put the same value in both sets. Pretty obvious in retrospect but the prop testing found it pretty easily when we were not even testing for it directly.

                                                Thanks again for the response. I still feel unsatisfied with your claim that you can lift any invariant to the type system. So far you’ve just seemed to insist that one can but haven’t demonstrated it. But it would probably require a lot of work to demonstrate you are correct, more work than is probably wroth it for a lobste.rs post, so I don’t expect it.

                                                1. 1

                                                  I think this is likely untrue, if you want to take a dynamic amount of input at any point. Consider how you would mechanically expand out into the type system that List.add results in a list that is the the current_length + 1. I think this is impossible in a language like Java without really impractical restrictions.

                                                  May be my JVM bias showing. I think it’s possible to do this in Java by representing integers as Succ<Succ<...<Zero>...>>. Of course you’d never want to write out 100 in code in that form - but you’d never want to write a 100-item list literal either. If you’re just using these as type parameters they’ll be erased at runtime, so the overhead you get is compile-time-only and proportional to the size of the literals you use.

                                                  I would like to give an example of sorting when I get time - you’re right not to take my word for it - I’ll probably use Scala but java-izable Scala, if that makes sense.

                                                  1. 1

                                                    I just don’t see how that’s going to be possible. What would the type of an append look like for a List? Where we take two Lists of length m and n and we want to ensure the output is a List of length m + n. Each one of those are unique types and the third type is derived from the previous two types. AFAIK, you require dependent types to express that.

                                                    1. 2

                                                      In a dependent typed language we’d write something like:

                                                      interface FixedLengthIntList<N extends Nat> {
                                                        <M extends Nat> FixedLengthIntList<#(N + M)> append(FixedLengthIntList<M> other);
                                                      }
                                                      

                                                      With a simpler type system we can’t write the dependent type, but we can do what a dependently typed language would do by hand: we return a pair of FixedLengthList<O> and a witness/evidence IsSum<M, N, O>, for some (unknown) type O extends Nat. In Java we can write an existential type like that directly; otherwise we can do a CPS-like transform where we write the type “X<T> for some T” as “(X<T> => U for all T) => U for all U”.

                                                      It becomes our responsibility rather than the compiler’s to cart around the value of type IsSum<M, N, O> (representing the fact that M+N=O) to the point where we we actually use the fact that M+N=O. We have to implement any facts we use about + (e.g. associativity). We can either treat our IsSum as a marker that’s axiomatically correct, or we can implement everything by recursion: allow IsSum to be instantiated only for M=0 or N=0 or by adding 1 to each side of a previous IsSum. (In the implementation of append we’d build up the IsSum recursively at the same time we were doing the append recursively - in the base case one of the lists is empty and M + 0 == M, in the recursive case we can append a list that’s 1 longer to get a result that’s 1 longer). And in the opposite direction, IsSum implements a cata-like method that says that an IsSum<N, M, O> is either an M=0 and an N=O, or a IsSum<N, M-1, O-1>. Then we don’t have to take associativity as an axiom, we can implement an associate method with a signature something like

                                                      static <A, B, C, AB, BC, ABC> IsSum<A, BC, ABC> associate(ab: IsSum<A, B, AB>, abc: IsSum<AB, C, ABC>, bc: IsSum<B, C, BC>);
                                                      

                                                      in terms of this cata-like method and the implementation will be provably correct.

                                                2. 4

                                                  So one thing that would be interesting to see would be more examples of bugs found with that kind of tool, and whether the invariant that was violated could have been expressed directly in the code.

                                                  If you want examples, the authors of the Erlang version have found a decent number of non trivial bugs using it. Some references:

                                                  I think all of these (maybe but the last one) are fairly difficult to avoid just relying on type safety. Despite of the actual quality of the code (I haven’t looked at it), the first two ones are caused by concurrency, and specially the second depends on some semantics that are not trivial to describe formally, yet is practical to describe them as quickcheck properties.

                                                  The third family of bugs are arguably just quality problems of the compiler, but formally specifying a compiler is not a trivial task either, again quickcheck provides a cost effective approach to at least test it more intensively.

                                                  The fourth is the one I believe could be possibly avoided by using better coding techniques, yet using property based testing again helped covering more corner cases than the existing tests had done.

                                                  It would be interesting to actually know what could be the cost of avoiding those kind of errors using other techniques.

                                                  1. 1

                                                    Do you have transcripts or similar for either of the first two? I’m particularly intrigued by this idea of semantics that are hard to describe formally but easy to describe as quickcheck properties, because as far as I can see that’s the same thing?

                                                    If I’ve understood that compiler example correctly I’d be very surprised if it wasn’t caught by any type system. Proving that a compiler outputs correct code is hard, but enforcing that the tree is some valid tree at every stage is pretty easy.

                                                    1. 1

                                                      No, I don’t have much more than those links, and the vage memories from when we talked about the actual bugs. Those are all open source projects though, so it should be possible to find the patches in github.

                    2. 1

                      @drmaciver: not sure if its helpful, but my friend did a masters thesis on the topic. Maybe 1 of 76 citations will be useful? http://www.eecs.northwestern.edu/~clk800/klein-ms-6aug09.pdf

                      1. 1

                        Oh, nice, thanks. Don’t yet know if it will be helpful, but this is super relevant to my interests.