1. 5
  1. 5

    The title is a little odd; pandas is an essential set of data structures and a user interface making it possible to do rapid and efficient data munging, cleaning, manipulation, and exploration. It simply wouldn’t be possible to do such rapid iterative work (in Python) without pandas because you’d be spending too much time writing code and too little time analyzing and hypothesis testing.

    What’s nice about pandas (compared to other ways you might work with R-style dataframes) is that you’re already working in a (very) general-purpose programming language no matter what direction you take a project after initial exploratory data analysis (EDA). The library ecosystem is incomparably large across a broad variety of domains, and you rarely have to redo research in a “production language” for high performance (though whether you can still call it Python at that point is debatable, and if we start talking about “big” data and related topics there is more nuance than fits in a brief comment).

    That’s the story we put out at Pycon 2012 and in all our pitches when Wes, Chang, Adam and I were working on it full time. I think have a bunch of glossy one sheeters sitting in a box somewhere about the “rapid research-production cycle” surrounded by other cute marketing buzzwords (though I truly believed and still believe that story).

    There are also a lot of warts that necessarily come with the good. There are huge function interfaces that appear to have come about organically, or provide duplicative functionality, or support use cases that might not make sense to you because you’re coming from SQL, or you’re not coming from SQL, or you’re a software engineer, or you’re a scientist, and so on.

    Similar to how many things in matplotlib seem (are) crazy if your expectations don’t account for what it meant for scientists to move to matplotlib from MATLAB, pandas has plenty of weirdness if it’s your first foray into data science and you’re thinking like a software engineer (which I expect most of this site’s audience is).

    People who work with pandas dataframes every day (like me) have to cope with plenty of rough edges, but we’re usually trying to go from a nascent idea to a tested hypothesis as quickly as possible. It’s not just that the code we write might be prototype code, but rather that the entire idea behind the need to write code might be bunk.

    The library was originally developed for quantitative strategists at hedge funds. If you’re paying someone millions of dollars for their quantitative research and trying to bring their ideas to market before the market moves on, you end up with a tool that looks like pandas.

    1. 1

      It would be cool to see those one sheeters from 2012. Any chance you could scan and post? Or email it to me and I’ll scan it in.

      I think it’s helpful to see historical marketing material, especially from things “before they got big.” It’s both helpful as a reference and fun to see marketing material beyond their intended value, since I expect you made them back then to encourage use or investment or something. But today, they are probably still useful, but for a different purpose.