1. 35

This library is now 1 year old and has followed Rust’s many changes along the way. It has been a great experiment in making zero copy, completely generic parsers in Rust. Now, it is at last stable enough for most uses.

  1.  

  2. 8

    Under “Why is Nom faster?”

    It uses the slice heavily, a Rust data structure containing a pointer and a length

    This is how ByteString works in Haskell, so I don’t think it’s a cogent explanation of why Nom is faster if your attoparsec parser is using ByteString.

    It might very well be the case the Nom does a better job avoiding copying/allocations, but ByteString in Haskell already lets you do that. Vector and Text do too, but via different means.

    First glance at the attoparsec HTTP parser looks reasonable, but the benchmarking methodology is wildly different for Rust and Haskell AFAICT, so I’m a little suspicious of the comparison. It would’ve been better to compare Rust ~ Rust with the same benchmarking kit.

    1. 5

      I would really like to improve the benchmarking methodology, the one for C/C++ is not great either. It is still useful to establish a range and see in which category nom lies. The HTTP parser may have been less tested, but people spent some time on improving the MP4 parsers in Haskell.

      I’d like to investigate and see why there is a difference between Rust and Haskell here if attoparsec does it the same way.

      1. 2

        It would’ve been better to compare Rust ~ Rust with the same benchmarking kit.

        The thesis is really that you can write safe parsers without sacraficing speed. Comparing nom to other empiracally fast parsers is directly in support of that.

        While the benchmarking might be “wrong”, I didn’t read it as if the author cares that it’s faster. But, rather, just fast enough that you dont feel like you are giving up speed at the risk of safety. So, unless the benchmark is really wrong, I’m going to submit that this is fine.

        1. 2

          Errr, that’s not really how he put it when he first announced the benchmarks pre <1.0 but okay, we can go with that interpretation if it suits you.

          Also, “sacrificing speed” - are you aware of how fast attoparsec is? Attoparsec already proved you could write safe parsers that were fast. It was already fast and has improved leaps and bounds in the last few years.

          1. 7

            I don’t have any historical reference here, so I’m taking the post at face value. If you have other information that would help me change my mind here, so be it. I just reread the post given your reaction. There are more claims of speed than I remembered, but I still don’t see that as the real thesis, nor do I see these claims as outlandish. There are links to benchmarks (even 1 that is a nom ~ {various other date parsing libraries in Rust}), with source available, and even discussion that the benchmarks are intended to discuss “relative” performance (my words). His words:

            The goal is to compare their usability and their performance on a real world binary file format. As with all benchmarks, the results must be taken with a grain of salt. This is not a formal comparison of languages, but an experiment to check where the nom parser library stands against more established parser libraries, in terms of performance and usability. I welcome any idea or contribution to improve performance for either of the parsers, or improve statistical significance. The parsers have been written in the most naive way, to make them as deterministic as possible. In each benchmark, the files are completely loaded in memory before measuring, and the parser is applied repeatedly to the buffer in memory. The hammer parser has some slight memory leaks, the developers have been notified of this and the bugs will be fixed in the future.

            source

            Why do I care so much about this that I reread the post and pulled out quotes and stuff? Because the general reaction that I see when someone claims something is “faster” is almost always “the benchmark is wrong.” “The benchmark isn’t good enough.” “The benchmark doesn’t take into account $X.” You just happen to be on the receiving end of my reaction to it. Sorry.

            Surveys have very much the same response, btw. We discover flaws after the fact. This sucks, but in many cases the results are “good enough” to draw a conclusion that is “close enough.” I feel, in this case, that holds. Especially since I feel that the benchmarks only substantiate the claim that you can write a safe parser that doesn’t sacrifice speed.

            Also, “sacrificing speed” - are you aware of how fast attoparsec is? Attoparsec already proved you could write safe parsers that were fast. It was already fast and has improved leaps and bounds in the last few years.

            I’m not aware, but can you use an attoparsec parser and link it easily with C code? Or Python? Or Ruby? Or Go? That’s another one of the claims stated in the post.

            1. 2

              “The benchmark isn’t good enough.”

              Criterion in Haskell establishes statistical significance and will let you know when the result is weak. The Rust benchmark harness isn’t doing anything of the sort AFAICT. If you’re going to use a trivial benchmark harness, at least compare like-with-like. The point isn’t that it would change the result, per se.

              You can call into Haskell code from C via the FFI. I’ve liked Haskell’s FFI w/ C better than I did Python’s.

              We’ve now exceeded the scope of what I cared about on this particular topic, which was mostly about measuring and process.

              1. 4

                at least compare like-with-like.

                But how is he not comparing like for like? You have two parsers doing the same amount of work.

                Would it have been reasonable for him to do a “real”, Criterion based Benchmark for Attoparsec (which established the significance) and then did the same amount of work in the nom equivalent? Is that what you’re after here?

                1. -1

                  I already described what I thought would’ve been more appropriate given reasonable time investment constraints in my original comment.

                  I’m asking politely, but directly now: please stop replying now, we have gone beyond the scope of what I care about here.

                  1. 3

                    Feel free to ignore me, but I don’t understand why you think it’s not “like-with-like.” I must be missing something trivial.

                    If I have two runners, one is 6' tall, and the other is 5' tall and ask them to run a mile as fast as they can, and then time them–is that a like-with-like situation? Or are you going to argue that the wind, humidity, amount of debris on the road makes it different some how?

      2. 3

        This is really exciting. Do you have a list of the projects using nom in practice? You might want to submit to http://langsec.org/ if they have a CfP in 2016.

        1. 4

          well, I already presented it at Langsec 2015. For Langsec 2016, I would have to show something new. Maybe I have, maybe not ;)

          To find users, I take a look at a Github search. I also know it is used at some companies like Dropbox.

          1. 1

            awesome… I forget how I missed that. Here’s a link to your vid: https://www.youtube.com/watch?v=b7M8Uj7k_0Y

        2. 1

          I think its great to see “combinator style” parsers; after writing my first parser using Parsec I feel bad when writing quick-and-dirty parsers in C++/Java.

          However, and this may be pedantic, but I feel the claim “safe by default” is a bit strong. Seems the authors wants to say that it is memory safe (no use after free/out of bounds access) but in the absence of a safety specification for what the parser should do and a way to show the parser meets that specification I don’t think you could call it safe. Additionally, the fact that the library was fuzzed to “verify” it is safe seems to hint that it does not have a proof of the absence of such errors: such a proof is what I would call safe.

          Again, this is probably just me having my own definitions of what “safe” and “verify” mean.

          1. 7

            Again, this is probably just me having my own definitions of what “safe” and “verify” mean.

            In the context of Rust, ‘unsafe’ means ‘memory safety’.

            in the absence of a safety specification for what the parser should do and a way to show the parser meets that specification I don’t think you could call it safe

            Well, Rust as a language guarantees memory safety as long as you don’t use unsafe, and then it’s up to you to use it correctly. nom currently uses unsafe four times, two to perform an unsafe cast, and two to copy some pointers. As long as these four uses are correct, then if nom isn’t memory safe, it’s a bug in Rust.

            1. 2

              I know, absence of proof is not proof of absence. But outside of actual proof of correctness for the generated machine code (which exists in some specific cases, not applicable here yet), correct combinators able to withstand tools like AFL are a pretty good deal.