I find interesting that this Haskell implementation of the Timsort sorting algorithm exhibits the same bug as Python and Java implementations, despite Haskell having a more advanced type system. This is a reminder that type checking is not a magic bullet, and is hardly enough to prove a program correctness. Formal proofs and whole program static analysis are still necessary, with or without an advanced type system.

Using a more powerful type system like Idris, you could rewrite the QuickCheck properties to types and then prove them. Type checking may not be a magic bullet but are actually perfect for solving this problem.

1/ How is Idris different from KeY (the formal verification tool used to find the bug in Timsort: http://en.wikipedia.org/wiki/KeY)? I just know that Idris is a pure functional language with dependent types that tries to blur the lines between general-purpose programming and theorem proving, but my knowledge stops here :)

2/ What would be the pros and cons of using Idris instead of KeY to verify an implementation of Timsort?

Pro: verification in the same language as the main code, so any Idris tools would understand the proof and e.g. preserve it when doing automated refactoring. Con: Idris probably has less tooling available than Java.

Pro: verification in the same language as the main code, so any Idris tools would understand the proof and e.g. preserve it when doing automated refactoring.

As far as I know, the JML (Java Modeling Language) specifications used by KeY are written directly into the Java source code files, as comments, which implies they can be preserved too.

Con: Idris probably has less tooling available than Java.

Does the fact that Idris is a total language limit in some way the kind of algorithm you can write and/or constrain the way you write them, compared to an imperative language like Java or even a pure functional language like Haskell?

As far as I know, the JML (Java Modeling Language) specifications used by KeY are written directly into the Java source code files, as comments

Which means most Java tools will treat them as simple comments, and not e.g. update them appropriately when performing automated refactoring.

Does the fact that Idris is a total language limit in some way the kind of algorithm you can write and/or constrain the way you write them, compared to an imperative language like Java or even a pure functional language like Haskell?

There are escape hatches even in Idris, so the difference is more that you’re explicit about which parts of your program contain assumptions. The difference between a total language and not is that a non-total language allows you to invert functions when you can’t prove the inverse exists. In practice it would only ever come up if you wanted to write a program that relied on some statement that you couldn’t prove but nevertheless thought was true. There are plenty of those in mathematics (e.g. a function that accepts an even number >= 4 and returns two primes that sum to that number), but it’s hard to think of one that would be useful in programming. But if you did come up with such a thing, in Idris all that would mean is that you’d have an explicit “this program assumes Goldbach’s conjecture” somewhere in your source.

The space of all computable functions is generated by the five primitive recursive operators plus the μ operator ( http://en.wikipedia.org/wiki/%CE%9C_operator ). Informally, the μ operator inverts a function.

Any function defined solely in terms of the five primitive recursive operators is necessarily total and as such is trivial to express in a total language. So the difference between a turing-complete language and a total language is solely about which applications of the μ operator are allowed; that is, a total language only allows you to apply the μ operator when you can prove it is total (i.e. when you can prove that the inverse of the function you’re inverting exists), whereas a turing-complete language always allows you to apply the μ operator even when you can’t give such a proof.

You can’t get around doing the work somewhere if you want a proof - and even a language like Haskell will allow you to not prove all the properties of your program. But in something like Haskell you can do the work in the type system rather than in magic comments. To my mind that’s an improvement.

Yes, I agree that if you want a proof, you have to do the work somewhere. My point is that, in the specific case of this implementation of Timsort, it was not possible to do the “work” in the type system, even in Haskell.

I think it’s more likely that the author wasn’t bothered enough to do it at all than that it was impossible to do it in the type system. I don’t know Haskell but I could take a stab at porting the proof to Scala if you think that would be enlightening.

Are you saying that you think it would have been possible to “refine” the types enough to identify the bug, by just using Haskell’s type system, and without any external proof (like the verification system KeY used to identify the bug in Python and Java)?

But if you’re right, it would imply that the bug in Timsort could have been found by simply implementing the algorithm in Scala with a set of types restrictive enough to exhibit the bug? I’d be curious to see that.

The SKI calculus has been implemented using the Scala type system, which means it’s turing complete. I have no idea how hard it would be to implement something of this complexity using it, however.

So it’s a subtle point, but Turing completeness is true of several typesystems, including Haskell’s and Common Lisp’s, but that doesn’t mean they can necessarily catch this. Dependent types are not more expressive computationally, but they allow more categories of input into typing decisions, and are therefore able to express constraints that “weaker” systems cannot.

Rewriting Timsort in terms of SKI calculus is certainly possible since the SKI calculus has been proven Turing complete, but the code would be a lot harder to write and maintain than in Java, Scala or Python. Moreover, as @lrene wrote, it doesn’t necessarily means the bug would have been found after such a rewrite. It probably explains why people use verification tools like the KeY project.

I read a different page (http://envisage-project.eu/timsort-specification-and-verification/) and was wondering if someone had a page with a list of numbers that allowed us to verify this bug? I’d like to be able to paste the list into Ipython and see the behavior myself, rather than have to run some Java code. Thanks.

The cpython code comments say hey, this code will work for arrays up to length 2**64, but it provably won’t work beyond 2^{49} elements. It’s very academic since no known machine can handle arrays that long right now in memory, but still it’s an interesting point.:

“The Java version of TimSort was ported from the original CPython version. That version also contains the bug and was intended to work for arrays with up to 2^{64} elements. However, on current machines it is not possible to trigger an out-of-bounds error in the Python version: it allocates 85 elements for runLen, which suffices (following our analysis in the full paper) for arrays with less than 2^{49} elements. For comparison, the current most powerful supercomputer http://en.wikipedia.org/wiki/Tianhe-2 has about 2^{50} bytes of memory in total.”

There is a table in the paper that indicates the required stack size for a given list length with both a correct implementation and incorrect. You could always recompile python with a smaller stack size in between those two numbers if you want to see it crash.

I find interesting that this Haskell implementation of the Timsort sorting algorithm exhibits the same bug as Python and Java implementations, despite Haskell having a more advanced type system. This is a reminder that type checking is not a magic bullet, and is hardly enough to prove a program correctness. Formal proofs and whole program static analysis are still necessary, with or without an advanced type system.

Using a more powerful type system like Idris, you could rewrite the QuickCheck properties to types and then prove them. Type checking may not be a magic bullet but are actually perfect for solving this problem.

Interesting. Two questions on this topic:

1/ How is Idris different from KeY (the formal verification tool used to find the bug in Timsort: http://en.wikipedia.org/wiki/KeY)? I just know that Idris is a pure functional language with dependent types that tries to blur the lines between general-purpose programming and theorem proving, but my knowledge stops here :)

2/ What would be the pros and cons of using Idris instead of KeY to verify an implementation of Timsort?

Pro: verification in the same language as the main code, so any Idris tools would understand the proof and e.g. preserve it when doing automated refactoring. Con: Idris probably has less tooling available than Java.

As far as I know, the JML (Java Modeling Language) specifications used by KeY are written directly into the Java source code files, as comments, which implies they can be preserved too.

Does the fact that Idris is a total language limit in some way the kind of algorithm you can write and/or constrain the way you write them, compared to an imperative language like Java or even a pure functional language like Haskell?

Which means most Java tools will treat them as simple comments, and not e.g. update them appropriately when performing automated refactoring.

There are escape hatches even in Idris, so the difference is more that you’re explicit about which parts of your program contain assumptions. The difference between a total language and not is that a non-total language allows you to invert functions when you can’t prove the inverse exists. In practice it would only ever come up if you wanted to write a program that relied on some statement that you couldn’t prove but nevertheless thought was true. There are plenty of those in mathematics (e.g. a function that accepts an even number >= 4 and returns two primes that sum to that number), but it’s hard to think of one that would be useful in programming. But if you did come up with such a thing, in Idris all that would mean is that you’d have an explicit “this program assumes Goldbach’s conjecture” somewhere in your source.

[Comment removed by author]

The space of all computable functions is generated by the five primitive recursive operators plus the μ operator ( http://en.wikipedia.org/wiki/%CE%9C_operator ). Informally, the μ operator inverts a function.

Any function defined solely in terms of the five primitive recursive operators is necessarily total and as such is trivial to express in a total language. So the difference between a turing-complete language and a total language is solely about which applications of the μ operator are allowed; that is, a total language only allows you to apply the μ operator when you can prove it is total (i.e. when you can prove that the inverse of the function you’re inverting exists), whereas a turing-complete language always allows you to apply the μ operator even when you can’t give such a proof.

You can’t get around doing the work

somewhereif you want a proof - and even a language like Haskell will allow you to not prove all the properties of your program. But in something like Haskell you can do the work in the type system rather than in magic comments. To my mind that’s an improvement.Yes, I agree that if you want a proof, you have to do the work somewhere. My point is that, in the specific case of this implementation of Timsort, it was not possible to do the “work” in the type system, even in Haskell.

I think it’s more likely that the author wasn’t bothered enough to do it at all than that it was impossible to do it in the type system. I don’t know Haskell but I could take a stab at porting the proof to Scala if you think that would be enlightening.

Are you saying that you think it would have been possible to “refine” the types enough to identify the bug, by just using Haskell’s type system, and without any external proof (like the verification system KeY used to identify the bug in Python and Java)?

Yes.

But if you’re right, it would imply that the bug in Timsort could have been found by simply implementing the algorithm in Scala with a set of types restrictive enough to exhibit the bug? I’d be curious to see that.

The SKI calculus has been implemented using the Scala type system, which means it’s turing complete. I have no idea how hard it would be to implement something of this complexity using it, however.

So it’s a subtle point, but Turing completeness is true of several typesystems, including Haskell’s and Common Lisp’s, but that doesn’t mean they can necessarily catch this. Dependent types are not more expressive computationally, but they allow more categories of input into typing decisions, and are therefore able to express constraints that “weaker” systems cannot.

Rewriting Timsort in terms of SKI calculus is certainly possible since the SKI calculus has been proven Turing complete, but the code would be a lot harder to write and maintain than in Java, Scala or Python. Moreover, as @lrene wrote, it doesn’t necessarily means the bug would have been found after such a rewrite. It probably explains why people use verification tools like the KeY project.

I would love to see that!

Would also be interesting to hear from someone verse in Haskell about an alternative, bug-free implementation.

I read a different page (http://envisage-project.eu/timsort-specification-and-verification/) and was wondering if someone had a page with a list of numbers that allowed us to verify this bug? I’d like to be able to paste the list into Ipython and see the behavior myself, rather than have to run some Java code. Thanks.

Python will only exhibit the bug for really large lists (larger than you can create) because it preallocates enough space.

Interesting, what does this mean then? How large a list? I thought the bug was exposed by a particular pattern of numbers?

The cpython code comments say hey, this code will work for arrays up to length 2**64, but it provably won’t work beyond 2

^{49}elements. It’s very academic since no known machine can handle arrays that long right now in memory, but still it’s an interesting point.:http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/

“The Java version of TimSort was ported from the original CPython version. That version also contains the bug and was intended to work for arrays with up to 2

^{64}elements. However, on current machines it is not possible to trigger an out-of-bounds error in the Python version: it allocates 85 elements for runLen, which suffices (following our analysis in the full paper) for arrays with less than 2^{49}elements. For comparison, the current most powerful supercomputer http://en.wikipedia.org/wiki/Tianhe-2 has about 2^{50}bytes of memory in total.”There is a table in the paper that indicates the required stack size for a given list length with both a correct implementation and incorrect. You could always recompile python with a smaller stack size in between those two numbers if you want to see it crash.