1. 4

Abstract: “The GNU Multi-Precision library is a widely used, safety-critical, library for arbitrary-precision arithmetic. Its source code is written in C and assembly, and includes intricate state-of-the-art algorithms for the sake of high performance. Formally verifying the functional behavior of such highly optimized code, not designed with verification in mind, is challenging. We present a fully verified library designed using the Why3 program verifier. The use of a dedicated memory model makes it possible to have the Why3 code be very similar to the original GMP code. This library is extracted to C and is compatible and performance-competitive with GMP.”


  2. 0

    This paper spends a lot of time pointing out possible reasons there might be faults in the GNU Multi-Precision library and then suggests that if an arbitrary-precision integer library was implemented using a language co-invented by the authors of this paper, the code would be correct.

    Lobste.rs needs a vanity-research tag.

    1. 9

      We might need a fantasy skepticism tag, too, if you keep selectively reading papers or ignoring prior work like that. Right after the section you mention, they describe the foundation of their solution: the ML-like language with formal specifications and solvers in Why3. The prior work on ML-like languages indicates using them lowers defects compared to C. The prior work with Why3, by itself or with stuff like SPARK on top, caught plenty of defects of the exact nature they’re describing such as range errors, alias, etc. It’s also cathedral-style where a focused team takes it a component at a time instead of drive-by contributions. That itself has increased quality in prior projects. Based on prior data, any of these should make a safer library. A combination even more so.

      Re-skimming it, I find that they attempt to address some of the problems that come to my mind:

      1. Mismatch between safe language and target. They use an imperative subset of Why3’s language which has already been integrated with C, Java, and Ada with real-world use. Lots of errors caught but few introduced. Subset itself is close to C. Increases confidence.

      2. C memory model can have errors. They make a simple one for just features of C they need. Less complexity = less errors. Legacy code doesn’t get machine-checked against a memory model at all.

      3. Humans need to inspect the code it produces to find errors in their scheme. They include numerous examples of the WhyML code side-by-side with extracted C that show extracted code is quite readable. That means reviewers can check it. Since it’s C, the reviewers can also apply a huge ecosystem of field-proven tools to find bugs.

      4. Provers could get something wrong. Although this area needs more work, this project uses four provers mainly so each can catch things others miss. A side effect is an increase in trust for whatever they all agree on. They said that data is hard to precisely know for some reason. Fixing that will help with 4 for future experiments. Also, they corroborated the first part when finding specific provers were really good at handling specific properties. Any empirical confirmation or use of provers could increase confidence by tying problem types with provers shown experimentally to handle them well.

      5. Difference in integer types for adder. They use two, one-limb whereas GMP uses two, two-limbed addition. That leads to a slowdown. They say they plan to switch but I wonder why they broke compatibility there in first place. Best case is it could be as simple as them not knowing it was important to do the addition a specific way. There’s a hint of that in the problem statement where a lot of optimizations are built-in to GMP that people might not know about. Worst case is it was too hard to verify in time for paper submission.

      If ignoring backward-compatibility, the stuff in the related work section is stronger than this work. Especially which uses HOL4 with its small checkers in TCB or using SPARK which is field-proven in many projects. A drop-in, compatible replacement for GMP makes sense, though. This work even with deficiencies inspires more confidence than methods used to build GMP with their associated vulnerabilities and difficulty of analysis.

      1. 6

        You are highly biased against formal verification research and have already made several incorrect/misinformed claims about it (blog posts). I respect your C expertise, but I think people should be aware that you are extremely biased in this case.

        Besides, frankly this Vanity Research or Real Project blog post is insulting and I think you could avoid that. You have a different philosophy about scientific research than what is commonly accepted (you seem to disagree with formalism and believe only in experiments, which is an extreme but possible position), it does not justify insulting or “naming and shaming” people that recognize, accept and follow other (perfectly consensual) scientific processes.

        1. 2

          “you seem to disagree with formalism and believe only in experiments”

          The amusing thing is that his experiments use formal methods in form of numbers, algebra, probability, and propositional logic (i.e. computer gates they run on), possibly model-checking (i.e. concurrent algorithms they use). He seems to trust math given he derives empirical results by plugging data into such mathematical formulas. That’s what people do with formal verification. Except, there’s both more a faith-it-works aspect and higher error margins with the statistical math versus the simpler forms of logic used in program verification. Maybe there’s something I’m missing but empirical methods seem highly dependent on math functions and reasoning on output of math functions. For some, especially probabilistic vs simple logics, it also looks to take more trust to believe statistical models than logical ones.

          I’ve been thinking on that since he first started talking about superiority of [math-based] methods he called emprical being better than math-based methods others called formal. I dug up some articles today in what spare time I had to further explore this contradiction. I was guessing core areas of math/logic could probably be considered empirically proven given all field data confirming the basic primitives and rules in them. The stuff I was finding was corroborating that or vainly trying to reject falsifiability due to math’s abstraction level. I might post some references or summaries in the future on this topic when I have more time and sleep.

          1. 3

            Let me mention that I appreciated your replies in this thread. I found the general skeptical tone of the discussion baffling, but I think that you did an excellent job addressing concerns and discussing them with patience.

            1. 2

              Thank you. :)

              The skepticism comes mostly from his extreme bias but there’s a kernels of truth in it that makes me address this stuff:

              1. Overselling of what’s accomplished in some projects. For instance, his blog correctly mentions CompCert was not a verified C compiler. It was a verified Clight compiler IIRC with an unverified, front-end for C or a C subset. I think they added something formal in the front later. Another is when some talk like there can be no errors because they did a portion formally. Countering that stuff is worthwhile.

              2. There’s also a habit of not doing rigorous testing after something was formally specified or proven. The old Guttman paper had a lot of examples of verified programs failing after someone so much as ran them once or did some testing. The specs or verification methods can be wrong. That’s why simultaneously doing that and empirical methods makes sense. Of course, high-assurance field has been advocating that longer than any other with formal proof one tool among many. It usually catches the deepest bugs but other methods are more cost-effective for other bugs.

              3. The results in many of these experiments aren’t reproducible or cross-checked enough. It helps that most are using the same FOSS software for the proofs with a lot of the work published as open-source. There’s also been efforts he doesn’t assess to reduce the TCB of checkers and extractors down to nearly nothing. There’s still a lot of room for improvement on experiment design, reproducibility and review, though. My bet being we’ll see errors in some of this work as other parties either take a close look at them or try to apply them to lots of real-world software.

              4. Cost. Formal methods, depending on how heavy, can add enough cost in terms of less features and time-to-market/completion of a project. Specific ones in industry like Astree can also cost money. This is a tangent here given this is just a research report showing how formal methods were applied to a piece of code. It is worth thinking about when deciding whether to use them or how much. I usually advocate lightweight, formal methods combined with review and automated testing if we’re talking regular software developers with limited interest in or resources for highest correctness.

              Although still waking up, those are four reasons that come to mind for why you might see a skeptic in formal methods threads. I didn’t think “pixie dust” was warranted on this work given they picked battle-tested foundations that were FOSS to improve a problematic piece of software. It’s one of the better applications imho. If people want more empirical data, they could have some people code up some libraries like this in C and a group do it in Frama-C, SPARK, or WhyML. Then, run a bunch of static analyzers and fuzzers on it looking for the kind of errors these tools prevent. If our side is correct, they’ll see a drop in bug count. I got money on that given we’ve already seen a drop in bug count every time these methods are used. ;)

              1. 3

                Nothing wrong with addressing concerns.

                For the record, I don’t agree with those claims that Compcert (or related work) is oversold. I think that it is perfectly fair to described Compcert as a “formally verified C compiler”. The Compcert website is clear about which parts of the compiler have or don’t have a formal specification, and the publications about this work are explicit about what precisely has been specified or proven.

                I think that the accusation of overselling come from taking a scientific work out of its context (a scientific community discussing software verification) and blaming it for generating unrealistic expectations outside this context. I can see how someone misinformed about what formal specification is could believe that “formally verified” means “will not have a bug ever” without asking themselves what specification it was specified against; but I don’t think you can blame Compcert for that.

                I agree of course with your point 2, 3 and 4. But I think it’s also important to point out that those should not necessarily be expected of academic work. Academics that are trying to push the boundary on verified software may find out how to build interesting programs, without being expert on how to do production-testing of those programs, and I don’t think it would be reasonable to expect them to do both – assuming that industrial/production testing methods in the area have not been made publicly and easily available. “There was no production-grade testing of this system yet” is a perfect reason for an engineer not to jump ahead to use the code proposed by a verification effort, but I don’t think it is a valid criticism of the research work.

                Finally, I think there is something interesting and delicate to be said about the notion of TCB. In my experience, people that are discovering computer-assisted theorem proving tend to over-emphasize the “who guards the guardians?” aspect (what if there is a bug in the verified), while depending on the sort of verification technique used (I’m thinking of proof-term-generating proof assistants or evidence-generating automated provers) it is usually extremely safe to assume that a practical or theoretical bug in a proof assistant would not result in a mistakenly-verified program being written in good faith (I’m not talking about adversarial contexts) – sort of like the (very real) existence of bug in text editors does not mean it is important to worry about the source code we write not being the source text we expect. On the other hand, there are other tooling-related parts of verification efforts where mistakes can more easily turn into mistaken assurance; for example, I think it’s right to worry about extractors.

                1. 2

                  You seem to be saying that researchers don’t need to be intellectually honesty about what they claim in papers.

                  Is it ok for doctors to claim to have found a cure for a disease when they have solved part of the problem, can mathematicians claim to have proved a theorem when they have only proved a subset of it?

                  Soap powder manufacturers are accountable to a level of honesty about what they claim (by the advertising standards authority).

                  What makes formal methods researchers so special, that they can make claims that soap powder manufacturers are not allowed to make?

                  The claims made in the Compcert paper are not perfectly clear, at least to somebody who has spent several decades working on compilers and formal systems (i.e., me). They have to be carefully unpicked to reveal that some of them are hollow.

                  1. 2

                    I do certainly think that researchers need intellectual honesty.

                    The point I was making is that academic publications are mostly intended for peers (others experts in the field), and the way they formulate their caveats may be differently from what you would expect when they are writing for the general public (which also happens and is quite different) or, for example, with industrial people working on critical software or certification agencies. To reuse your analogy, doctors talking to each other about the effect of a medication may use terms that wouldn’t be effective communication when talking to the general public – to the point of being perceived as misleading if taken out of context.

                    Of course, it may also happen that people do oversell their claim (the publish-or-perish system that some academics must live under is an external pressure making this more likely to happen), and I think it is perfectly reasonable to be vigilant – not only on actual malpractice, but also on good-faith people that are oversimplify and oversell. I believe that in the specific cases of the work that you are trashing (Compcert, Verasco, and now this unrelated Why3 research), your accusations are not reasonable. The fact that you decided, in addition, to be extremely insulting makes it very difficult for experts to have a hope of a constructive conversation with you.

                    1. 1

                      These papers may be targeted at other academics, but their headline claims get posted and widely cited (by the authors and others).

                      Soap powder manufacturers are under pressure to make money by increasing sales, does this pressure make it ok for them to make overblown claims about their products? Academics work in a publish-or-perish environment, does this make it ok for them to make overblown claims?

                      I certainly did not mean to be insulting. However, the researchers involved are likely to have been very uncomfortable with my soap powder comparison or suggesting that their work be labeled as vanity research (what label would you apply to work that invents something that is claimed to have various properties/characteristics, where the claims are based on the researchers preferences and not experimental evidence?)

                      1. 3

                        In my experience overblown claims in the field of formal software verification are fairly rare, because the research paper format requires authors to precisely state their claims and give experts the necessary material to understand precisely what has been achieved. (There have been, of course, mistakes of various sorts made, although this is getting rarer with the adoption of proof assistants to mechanically verify the formal arguments – a level of rigor that exceeds the one of many other scientific fields.). Again, titles and abstracts might simplify claims (it is their purpose to be synthetic), but the precise claim are in the articles themselves and this is what should be looked at – unlike soap powder advertising, paper abstracts are not aimed at the general public but at other researchers in the field.

                        If you don’t mean to be insulting then maybe you should stop comparing an entire sub-field of research to “soap powder manufacturing”, and call perfectly sensible work (such as the one presented in this thread) “vanity research”. Generating verification conditions and discharging them with SMT solvers seem like a robust methodology, rather than “researcher preferences” – you seem to be misunderstanding or misrepresenting this research.

                  2. 1

                    I think that it is perfectly fair to described Compcert as a “formally verified C compiler”.

                    The reason I disagree on that point is the formal semantics. Formally-verified parsers for stuff that’s simpler than C was hard enough work. The formal semantics of C usually had features missing or trouble handling undefined behavior. It wasn’t until just a few years ago that realistic models of C with and without undefined behavior were formally specified. So, we have two reasons no C parser is significant: (a) formal specifications or verifications of tedious, text-processing algorithms often find errors; (b) C itself had been really hard to specify.

                    So, a “verified compiler”… as in everything is verified… for a hard language should cover each of the hard problems with a formal specification and verification. If it leaves one out, it’s not a verified compiler: it’s a partially-verified compiler. I don’t think more accuracy in titles would hurt here. They might do “A Mostly-Verified, Compiler for a Useful Subset of C” with a subtitle in smaller print or parentheses saying “we verified everything except the front-end.” People would still be impressed with nobody misled. I should note that I was misled by original title, summaries, and discussions that talked like it was fully-verified. When I read the paper(s), I got more accurate information that required me to change my claims a bit. The only one that really handled C’s semantics was KCC: a C semantics with the K framework. Derek actually gives them credit for realistic semantics on his blog.

                    “I think that the accusation of overselling come from taking a scientific work out of its context (a scientific community discussing software verification) “

                    They are promoting it for use in DO-178B software in industry. It’s proprietary software that AbsInt sells. That should probably be considered when assessing what terminology they should use for target audience. Of course, it would depend on when they first committed to doing that and what they were writing at the time.

                    “Academics that are trying to push the boundary on verified software may find out how to build interesting programs, without being expert on how to do production-testing of those programs, and I don’t think it would be reasonable to expect them to do both”

                    That could be fair if it’s hard to test. I always said they could do generation of tests from correctness goals or properties with runtime checks. Even Data61 is now using property-based testing on specs in at least one project. Generating some checks by hand in final code is straightforward for a lot of properties with a fuzzer able to deliver the input. I think they should definitely do more on empirical side confirming what their specs say with tests. I’m not expecting the full battery of testing a high-assurance product would get, though, if the researchers’ stated goals is just working on formal methods. I’d like to see the tests mainly to check their specifications and tooling for errors.

                    “ people that are discovering computer-assisted theorem proving tend to over-emphasize the “who guards the guardians?” aspect “

                    I agree people worry too much about it. I’d like some specific tools, esp checkers, done in a foundational way. Past that, I don’t care as much since we’re mostly just knocking out problems without believing the result will be perfect. I will note there’s been errors in solvers significant enough that this led to the evidence-generating ones in the first place. A lot of the speed-demons are already complex with some written in unsafe languages. The results of that might get worse as they’re accelerated with multicores, GPU’s, FPGA’s, etc to run through programs faster or with more lines of code. I’m for a pile of annotated code as a benchmark to throw at all the provers, individual projects and version to version, to get an empirical idea of what they’re catching and missing. Plus, trusted checkers for all of them.

                    “for example, I think it’s right to worry about extractors”

                    Total agreement there. Good news is there’s quite a few projects verifying them. There’s also an extractor for C language now with lower TCB than its predecessor for ML. Before I saw it and still a bit, I was thinking of something like Tolmach et al’s ML-to-Ada-or-C translator being verified so people can go from verified extractors to ML to verified translators to Ada/C w/ their tooling running to verified compilation. Like in hardware, formal equivalence checks are run on each step for a range of values on every path. A verified translator straight to C would be nicer but I’m not a proof engineer: can’t say how hard formal logic to C would be versus ML to C.

                    1. 3

                      You claim three limitations of Compcert:

                      1. It only covers a subset of C.
                      2. The parser is not verified. (An expert would add: nor the CIL-like post-parsing elaboration pass.)
                      3. Undefined behaviors are not fully modelled.

                      (1) is certainly true. I am not convinced that this can be described as “overselling”, though. MSVC only covers a subset of C (plus some extensions), and most people would not object to calling it a “C compiler” – although it is certainly important to point out, in more detailed descriptions of it, that there are limitations.

                      I don’t agree with calling (2) and (3) overselling, for technical reasons that merit discussions.

                      For (2), the question is: what would a verified parser even be? If you think about it, you will realize that it is very hard to define a formal specification for a parser that is not an implementation (in particular: none existed before Compcert was created), and same for the elaboration pass – or assembly-level printing for that matter. If there is no specification other than the implementation, verification does not make much sense anyway. The best work on what “formally verifying a parser” would mean has been done precisely in the context of Compcert (Validating LR(1) parsers, Jacques-Henri Jourdan, François Pottier and Xavier Leroy, 2012), and what it does is produce a formal “grammar” description that you have to manually inspect (and convince yourself that it corresponds to the grammar of C you have in mind… which is no small task) along with a parser automaton that is formally proved to verify this grammar. If one small part of a tool has no specification other than its implementation (this is not specific to the Compcert work but a general property of C compilers and programming languages), I don’t think it is misleading to call the whole tool “verified” – or is it perpetually invalid to call any compiler “verified” because the parsing phase has no reasonable specification?

                      For (3), there is a misunderstanding in your criticism: a semantics of C does not need to model every form of undefined behavior to build a verified compiler, or more generally to allow all behaviors permitted by the standard. It is perfectly valid to use, as Compcert does, a reference semantics that is more defined than what the standard says (typically: it defines some overflow behaviors and fixes expression evaluation order). This is of course the exact meaning of what “implementation-defined” means in the standard, but it is also perfectly valid to do the same for undefined behaviors: giving a defined meaning, in an implementation, to program with undefined behavior is perfectly valid – given that any behavior is valid for this program. We then have to check (informally) that the semantics given by Compcert is, indeed, stronger than the standard’s (it does not make different, incompatible choices in some situations). But it is a misunderstanding to claim that a compiler’s semantics is incomplete because it does not model all undefined behaviors. (In fact, the vast majority of C compilers out there will make similar choices of working with a stronger semantics than strictly required with the standard, if only because it makes the compiler simpler.)

                      On the other hand, a fine-grained modelling of undefined behavior matters if you want to draw conclusions that are valid for any C compiler, and not just for one fixed implementation. This is why it is an important research endeavor if you want to build static analysis tools for C that would draw conclusions valid for any C compiler. The Verasco project, for example, is a static analyser that uses Compcert’s specification, and its results are only valid if you compile the program using Compcert – Verasco can only check that a certain class of runtime error will be absent from the executable produced by Compcert (or any compiler respecting Compcert’s stronger semantics). Others works, such as KCC (which is more recent than Compcert) and more recently Robert Krebbers’ work, have worked on more complete formal specifications of very-undefined semantics for C.

          2. 2

            Cargo cult arguments against C are popular lately. “X is written in C, therefore it’s probably buggy and needs to be rewritten in Y.” Always lots of hedging and talk about what could be, but usually not many real bugs mentioned.

            1. 5

              Your accusation of this work being a “cargo cult argument” is highly unfair. The authors built an implementation of a GMP-like library and defined a specification for it, such that a computer can verify that the implementation matches the specification. This is a very strong argument in favor of the correctness of their implementation, not cargo-culting.

              This sort of work is possible to do in C (see Frama-C, VeriFast) but fairly difficult, due to the complexity of the full semantics of the C programming language. Instead, the authors work in a language designed for formal verification, whose semantics is easier to reason about – and with good tooling support for automating verification – writing their program in such a style that they can translate it back to C.

              For a given specification of what a piece of software should do, it is reasonable to assume that a program that has not been verified against this specificatin has a higher likelihood of having defects than a program that was verified against the specification. There is nothing cargo-culted in that.

              1. 3

                There’s been bugs reported in GMP already. Range errors are also common vulnerabilities in general case. How much stronger case needs to be made at that point that immune by default is better position?

                1. 0

                  Ah, the magic pixie dust of mathematics, which must be correct is just as buggy as software.

                  1. 2

                    That’s correct. They can screw them up. It’s why I referenced prior work using the language and tool to find defects in software. I thought your view was that experiments or field data showing a method gets results, esp consistently on similar problems, increases the confidence we have in it working. If so, then you should have at least as much confidence in Why3’s math-based approach that got results as you would a purely code approach.

                    A tool that usually works is a tool that usually works. Nothing more but nothing less. Do explain further if your perspective disallows as evidence field application of a method for software correctness by multiple parties over time.

                    1. 1

                      Yes, if experiments or field data show that a method gets results, I’m all for it (of course the code/benefit may not work out, but that’s another matter).

                      But “immune by default” applied to formal techniques is a magical pixie dust mindset.

                      Formal techniques, like many other techniques, find problems in code. How cost effective are they in comparison? I wish somebody would do some experiments to find out (the main obstacle is the likely very high cost).

                      1. 1

                        But “immune by default” applied to formal techniques is a magical pixie dust mindset.

                        I held off replying to this because I want to handle it in a separate thread. That is, empirical proof of fundamentals of specific, formal methods that (a) justify/reject trusting them in empirical sense and (b) maybe help derive empirical evidence for/against them as they’re used. The initial idea is having concrete examples and test patterns of every primitive, logical rule, etc Obviously, rely on the simplest ones as the formal methodists are doing with stuff like HOL Zero and Milawa prover. The specifications and activites are vetted by potentially hundreds to hundreds of thousands of tests with diverse strategies as they’re employed. Another quick one is a scheme like @moyix’s that inserts specific flaws in code that formal methods are supposed to catch. Whoever uses them and that tool might let it run overnight on new code to generate background measurements of effectiveness that a central organization collects and reports for those studying effectiveness.

                        That concept will need more thought about not only the answers (how to test/review) but the right questions to ask (the what). If I put a proposal together, I’ll tag you in it if you’re interested in reviewing it.

                  2. 0

                    Is there any popular software that hasn’t had bugs reported already?

                    There are plenty of good reasons not to use C, but bugs that don’t actually exist aren’t one of them.

                    1. 2

                      “Is there any popular software that hasn’t had bugs reported already?”

                      That’s peripheral to my point. The point is we say don’t use C or trust C apps by default if one is worried about common defects, esp code injections. The reason starts right with the fact that it was intentionally designed to let them hack together code on a PDP-11. Nothing more. To meet their goals within 1970’s constraints, they didn’t add any safety or maintainability features that were in ALGOL-derived languages on better hardware. Some other languages, esp Ada and Modula’s, did add safety and maintainability features. The C apps had lots of defects over time with even experts bitten by them. The apps in the other languages had less defects due to their design even when people had same or less skill. Studies confirmed this for Ada with CVE list showing it for a lot of other languages.

                      Given that backdrop, both the logical and empirical evidence support that anyone wanting to get low bugs and high maintainability with little effort should avoid C by default unless having really good reasons to use it. Even then, it may be easier to do reference implementation in a safe, non-C language that a C deployment is derived from. It’s not even C bashing or anything from this viewpoint: it’s a BCPL derivative with rough semantics designed for unsafe hacking of small, non-security-critical programs on 1970’s hardware in benign environments. It was great for that. It was just never designed in the first place for what people are using it for today. (Something similar happened with HTML and Javascript for similar reasons caused by a killer app.) Burden of proof for defaulting on C for security-critical work has been on C promoters for some time given all the evidence of its risk vs alternative language designs.

                      “There are plenty of good reasons not to use C, but bugs that don’t actually exist aren’t one of them.”

                      Average C app has plenty of bugs alternatives either never have or rarely have depending on design. This tech can prevent them, too. GMP has had them. Writing GMP with alternatives would inherit their properties of having either none of these bugs or less of them. So, they did. That’s the logic of it. Glad you brought it up, though, since looking at the bug reports shows everything from buffer overruns to these gems in v5.0.3:

                      “There are several problems in the multiply code that can be seen as miscomputations and/or ASSERT failures. These bugs have been triggered in our testing environment, but the probability for miscomputations using uniformly distributed random numbers is extremely small. These bugs have been around since GMP 5.0

                      “There are several problems in the functions for gcd, gcdext, and invert, causing miscomputations and/or ASSERT failures. These bugs have been triggered in our testing environment, but the probability for miscomputations using uniformly distributed random numbers is infinitesimally small. These bugs have been around since GMP 4.3.

                      They not only have bugs: they’re bugs that have persisted over numerous versions they couldn’t eradicate with their informal reviews and testing methods. A good time to try something new with a track record of eliminating the kinds of bugs triggered in situations they can’t see or test for.

                  3. 2

                    If a C code base includes calls to unsafe functions like strcpy(43), sprintf(29), strncpy(27), or strncat(1) then it seems likely that one or more of them incorrectly computes the length of a string.

                    1. 1

                      then it seems likely

                      This is exactly what I mean. There’s either a bug or there’s not.

                      I might as well claim that it “seems likely” a rewrite of GMP will introduce regressions, so everybody should keep using the original in order to avoid them.