Please forgive me if I have the wrong end of the stick, but if you’re writing the data loading in C++; are there not scientific math libraries in C++ that do the same work as Pandas in Python?
Good question! Indeed, dataframe libraries are starting to pop up in C++ as well, e.g. xframe and RDataFrame (see links at the bottom of the blog post). They are still quite new, not as mature as Pandas. Also, Pandas has great integration with Matplotlib for plotting and in general the Python data science stack is really convenient to work in, so I personally wouldn’t dare to make the plunge to full C++ for data science just yet.
Is there a reason why the Pandas data loading logic couldn’t be rewritten to use C++ underneath the hood, so you can still keep the Python API while reaping the performance benefits?
I’ve often wondered that as well. I’m not sure what the development plans are for Pandas, but since its creator is now working on Apache Arrow, I expect that may become a major backend at some point.
I’m a big fan of C++ and I use Python for work, but in contrast to the suggestion in this article I would say if you are a Python fan and 90% of your work is doable in Python but you have 10% of code that is a performance bottle neck I would look into Cython. This is especially true since the author (lightheartedly I think) says
… I guess if you don’t know any C++ it may take a bit longer, but not that much. Just start out with programming like you would in Python, but declare variables with types, put semicolons at the end of lines, put loop and branching conditions in parentheses, put curly braces around indented blocks and forget about the colons that start Python indented blocks… that should get you about 80% of the way there. Oh and avoid pointers for the time being. Oh and use references whenever possible. They’re kinda like pointers but… Well maybe avoid those as well for now.
but omits to mention that, while that is a questionable introduction to C++, parts of it are a pretty good introduction to Cython.
In my first draft I had some remarks about Julia and Cython, but I decided to leave them out, since I know far too little about either of them. To summarize: my main reason for not using those languages is that while I’m sure I can get maximum performance using C++, I cannot be sure that will always be the case in Cython or Julia. In fact, the internet is riddled with examples of where C++ will beat all alternatives. I admit, though, that I may be biased, since I’m also just a big C++ fan. Another reason is that C++ is a big industry standard, whereas Julia and Cython are still relatively niche languages. This is the same reason I haven’t tried Rust or Haskell yet :)
Cython allows you to generate annotated html files that allow you to map your Cython code to C and see how your annotations affect that and so on. Cython integrates fairly nicely with Python distribution mechanisms, though the last time I fought with it, there was a bootstrapping issue when the user did not have Cython already installed.
The downside is that Cython can look ugly - a franken language - with poorer IDE support.
Please forgive me if I have the wrong end of the stick, but if you’re writing the data loading in C++; are there not scientific math libraries in C++ that do the same work as Pandas in Python?
Good question! Indeed, dataframe libraries are starting to pop up in C++ as well, e.g. xframe and RDataFrame (see links at the bottom of the blog post). They are still quite new, not as mature as Pandas. Also, Pandas has great integration with Matplotlib for plotting and in general the Python data science stack is really convenient to work in, so I personally wouldn’t dare to make the plunge to full C++ for data science just yet.
Is there a reason why the Pandas data loading logic couldn’t be rewritten to use C++ underneath the hood, so you can still keep the Python API while reaping the performance benefits?
I’ve often wondered that as well. I’m not sure what the development plans are for Pandas, but since its creator is now working on Apache Arrow, I expect that may become a major backend at some point.
This is a good place to ask this: has any one had experience using Julia to replace Pandas? Say for something very like this use case? Thanks!
I’m a big fan of C++ and I use Python for work, but in contrast to the suggestion in this article I would say if you are a Python fan and 90% of your work is doable in Python but you have 10% of code that is a performance bottle neck I would look into Cython. This is especially true since the author (lightheartedly I think) says
but omits to mention that, while that is a questionable introduction to C++, parts of it are a pretty good introduction to Cython.
In my first draft I had some remarks about Julia and Cython, but I decided to leave them out, since I know far too little about either of them. To summarize: my main reason for not using those languages is that while I’m sure I can get maximum performance using C++, I cannot be sure that will always be the case in Cython or Julia. In fact, the internet is riddled with examples of where C++ will beat all alternatives. I admit, though, that I may be biased, since I’m also just a big C++ fan. Another reason is that C++ is a big industry standard, whereas Julia and Cython are still relatively niche languages. This is the same reason I haven’t tried Rust or Haskell yet :)
@egpbos, I’d say for this use case Cython would be a good fit.
With proper type annotations Cython compiles down to “bare C”.
I few years ago I wrote up some of my experiences: https://kaushikghose.wordpress.com/2014/12/08/get-more-out-of-cython/ but rereading I see I didn’t make it very detailed. (Edit: this post is possibly more helpful: https://kaushikghose.wordpress.com/2014/07/28/cythonize/ )
Cython allows you to generate annotated html files that allow you to map your Cython code to C and see how your annotations affect that and so on. Cython integrates fairly nicely with Python distribution mechanisms, though the last time I fought with it, there was a bootstrapping issue when the user did not have Cython already installed.
The downside is that Cython can look ugly - a franken language - with poorer IDE support.