1. 30

  2. 8

    In C++, this can be done in zero-copy fashion with string_view. In C, every string has to be null-terminated. Thus, you need to either manipulate the original buffer, or copy it over. I elected the latter.

    I don’t see the reasoning behind using C instead of Rust or C++, but of course you can define a string view/slice type in C (struct { size_t len; char *s; }). Sure, it won’t be interoperable with the standard string manipulation functions (except printf %.*s and sometimes maybe some strn* functions), but forcing usage of different functions should be fine in high performance applications.

    Also: simdcsv

    1. 2

      I wonder how pandas’ CSV parser (which is pretty optimized) compares. Whenever I have to parse huge CSV files in Python, I use pandas just for that.

      1. 3

        I’ve never benchmarked pandas in particular, but have loosely benchmarked Python’s CSV parser. The inherent problem is measurement. What is your benchmark? Let’s say your benchmark is to count the sum of the lengths of all the fields. Well, that means Python will need to materialize objects for every record and every field. And that is probably what’s going to either dominate or greatly impact the benchmark, even if the underlying parser is written in C and could theoretically go faster.

        Pandas’ CSV parser is written in C, and if the comment at the top is true, it’s derived from Python’s csv module. Like the csv module, Pandas’ CSV parser is your run of the mill NFA embedded in code. This is about twice as slow as using a DFA, which is what my CSV parser uses. And the DFA approaches are slower than more specialized SIMD approaches. I’m less sure about the OP’s approach.

        1. 2

          Thanks! Love Ripgrep! I tried cargo build --release and time ./target/release/xsv index /tmp/DOHUI_NOH_scaled_data.csv, it took about 24 seconds for index to complete (I assume xsv index find all begins / ends of all cells, which approximately is what I am trying to do here for csv parsing).

          Didn’t do xsv entirely due to my unfamiliarity to Rust ecosystem. Sorry!

          1. 1

            Thanks. How do I run an equivalent benchmark using your CSV parser? I don’t think I see any instructions.

            1. 1

              It is not packaged separately, and ccv can be built with zero-dependency (meaning you may not have OpenMP enabled) so it is a bit more involved to make sure OpenMP is enabled.

              You can first install apt install libomp-dev clang, and then checkout https://github.com/liuliu/ccv repo. cd lib && ./configure to configure it with OpenMP (there should be a USE_OPENMP macro enabled, configure script should give you exact output of flags). cd ../bin/nnc && make -j would compile the demo csv program under ./bin/nnc

        2. 1

          Recently the guys from Julia started claiming that they have the fastest parser (link).

          1. 4

            It kind of looks like Julia’s CSV parser is cheating: https://github.com/JuliaData/CSV.jl/blob/9f6ef108d195f85daa535d23d398253a7ca52e20/src/detection.jl#L304-L309

            It’s doing parallel parsing, but I’m pretty sure their technique won’t work for all inputs. Namely, they try to hop around the CSV data and chunk it up, and then parse each chunk in a separate thread AIUI. But you can’t do this in general because of quoting. If you read the code around where I linked, you can see they try to be a bit speculative and avoid common failures (“now we read the next 5 rows and see if we get the right # of columns”), but that isn’t going to be universally correct.

            It might be a fair trade off to make, since CSV data that fails there is probably quite rare. But either I’m misunderstanding their optimization or they aren’t being transparent about it. I don’t see this downside anywhere in the README or the benchmark article.

        3. 1

          I’d be really interested to see how this compares to the approach available in the Haskell hw-dsv library. The benchmarks show much slower performance than discussed in the article, but the benefit hw-dsv is that it can generate extremely efficient indices, with an overhead of (IIRC) just over 1% of the size of the input CSV. If you know you need to look at the data multiple times, then the overhead of generating the rank-select indices should be offset by having constant time indexing into any cell in the CSV in the future.

          1. 1

            Why don‘t you use Dask or Apache Spark? They all read csv files in parallel. cudf does it even with the help of the GPU, reading from disk directly into GPU memory. It‘s an interesting article nonetheless :)

            1. 6

              I am not the author, but because it’s an interesting engineering problem? I’d rather read something like this than how to install and run Spark.

              1. 1

                I get that and that‘s why I said it‘s an interesting read, but the author should‘ve at least mentioned or even benchmarked already parallized implementations

                1. 5

                  I’ve been working on CSV stuff for a long time (I’m the author of xsv), and I’ve never even heard of cudf. So I wouldn’t fault the OP.

                  And just from quickly glancing at cudf, benchmarking its CSV parser looks non-trivial because the happy path is to go through Python. So you’d have to be really careful there. And what happens if the task you want to do with the CSV data can’t be done on the GPU?

                  Similarly, I’ve never used Apache Spark. How much time, effort and work would be required to get up to speed on how it works and produce a benchmark you’d be confident in? Moreover, if I want to use a CSV library in C++ in my application, is Apache Spark really a reasonable choice in that circumstance?

                  1. 1

                    I didn‘t want to be rude. From my perspective (mainly data science) everything is obvious and the tools I mentioned are very popular. I think the main difference between our „views“ is that the post focused on libraries and I‘m focused on frameworks. The solutions I proposed are fully fledged data processing frameworks like Pandas, R dataframes if you know those. Its basically Excel as a programming framework, but much faster and more capable. The abstraction level usually is very high. You would not iterate over rows in a column, but apply a function to the whole column. These are no solutions to be just used for their csv implementation, but as the solution for a complete data processing pipeline.

                    And what happens if the task you want to do with the CSV data can’t be done on the GPU?

                    cudf, dask and pandas all belong to the Python scientific ecosystem and are well integrated. You would convert it to Dask or pandas (numpy, cupy).

                    Spark belongs to the Hadoop ecosystem and is used by many companies to process large amounts of data. Again nobody would use it just for the csv implementation.

                    because the happy path is to go through Python

                    In all frameworks Python is just glue code. All numeric code is written in faster languages.

                    1. 2

                      No worries. I get all of that. I guess my comment was more a circuitous response to your initial question: “why not use {tools optimized for data science pipelines}” where the OP is more specifically focused on a csv library. But I also tried to address it by pointing out that a direct comparison at the abstraction level on display in the OP is quite difficult on its own.

                      But yeah, while I’m aware that data science has csv parsers in them, and they are probably formidable in their own right in terms of speed, I’m also aware that they are optimized for data science. Pandas is a good example, because its API is clearly optimized for cases where all of the data fits into memory. While the API has some flexibility there, it’s clear in my mind that it won’t and isn’t supposed to be a general purpose csv library. It may seem like a cop-out, but constraints will differ quite a bit which typically influences the design space of the implementation.

              2. 1

                Dask is an interesting omission, definitely on me! It would be tricky to do though, as @burntsushi pointed out. Dask tries to be as lazy as possible, and that can be a real challenge. OTOH, Pandas’ csv implementation is uninteresting. It is the reason I started to explore in the first place (it drives me crazy to save / load csv in Pandas!). I love Pandas for other reasons.

                As of Spark, I simply don’t know it has an interesting csv reader implementation!