1. 9
  1.  

  2. 2

    I worked with R full-time from around 2006 through to 2011. I still turn to it when I have serious data analysis or statistics to do; it’s indeed pretty nice when you’re comfortable with the various functional facilities at your disposal and have gotten past the syntactic quirks.

    Hadley Wickham’s libraries (reshape2, dplyr, ggplot2) will get you pretty far.

    1. 1

      Yeah, they have really reduced my reliance on trying to guess keywords to find some random built-in function.

    2. 2

      R is a terrible language. It’s not like Haskell or (to go further down the rabbit hole) Idris, which reward your initial difficulty with something new that most languages don’t give you (in the case of static typing, compile-time bug squashing and formal verification). R is over-complex, sloppily designed, and ugly. There, I said it. Hadley is brilliant, but he will not be able to save it. Despite the considerable merits of many tools in R, the language itself is a borked Lisp that supports four (yes, four) varieties of OOP.

      There are a lot of great statistical libraries in R, sure. I don’t think, at least not yet, you can be a true data scientist (as much as I hate that pompous term, “data science” hits on a certain useful skill set) without coming in contact with it. It’s what academics write prototype code in when they come up with new machine learning approaches, so if you want to be on the cutting edge– most data scientists shouldn’t be there, but that’s another debate– you need to be able to work with R.

      R wasn’t designed by language designers, and it has included a number of features that put convenience of use over infrastructural simplicity. Swiss Army Knives make great tools, but terrible infrastructure: you’d built a house out of bricks instead. The problem with software is that there’s no clear delineation between what is a tool and what is infrastructure. Dynamically typed languages legitimately have a better UX, because hammering out explicit conversions at the REPL is a drag, but they’re a really poor choice when you have long-lived code that’s going to have be read many times by a large number of different people.

      Ultimately, though, I take R as a source of lessons: * very smart people, without organizational and design sense, will build monstrosities. The badness is a faster-than-linear function of the number of such people. * terseness is overrated. Sure, the multiple ways of indexing dataframes make terser code– I’d bet that R’s actually terser than Haskell– but I’d much rather have explicitly named functions that force me to acknowledge and understand what I am actually doing. * cleverness is extremely overrated. But I think everyone already knew that.

      For my part, I doubt I’d hire a “data scientist” who could only code in R. I’d expect him to be willing to use (at least) Python for any code that would be read by other people, and I’d encourage him toward Haskell or C (depending on needs pertaining to performance, timeframe, correctness and complexity) for long-lived infrastructural code.

      1. 2

        R’s sweet spot to me is in interactive data exploration / fiddling. I haven’t found a better environment for doing that yet - I prefer R’s lisp-y flavour to Python’s imperative style, though I occasionally do ‘data science’ in Python as well. It also beats using bash + an army of unix tools on the interactive front.

        I always felt like mastering J would yield a great interactive experience. When you’re in the swing of things, manipulating data in J almost feels as easy and unconscious as manipulating text in Vim.

        But yeah, my preferred use of R is in research. Often I’ll generate data elsewhere and then explore/analyze it in R (using coda, ggplot2, etc.).

        1. 1

          Have you looked into Clojure? I worked with it a bit back in 2013. The ecosystem didn’t seem quite “there yet” on data science but it was close. One of the main issues was the lack of sparse matrix support in the JVM. This picture may have improved.

        2. 2

          Dynamically typed languages legitimately have a better UX, because hammering out explicit conversions at the REPL is a drag, but they’re a really poor choice when you have long-lived code that’s going to have be read many times by a large number of different people.

          Do you really need Static typing for code doing data stuff, even in production? I expect your code to do data stuff will only couple of data structure (say, time series, matrix, vectors, right now I might forget some but I doubt they are that many), so you will know which kind of inputs your function expects and what it should return. Granted you will have to deal manually with bad data (time series with no data or not enough data to calculate correlation matrix for example) but it’s not so bad if you are rigorous - I did it - and you get to the benefit of having lots of machine learning libraries of the languages (R and python) that are the most used for that.

          Probably in the future we will have Julia that gives the benefit of static typing combined with nice libraries but I wouldn’t count on it yet.

          1. 1

            Do you really need Static typing for code doing data stuff, even in production?

            Well, you don’t need static typing for production code. There are plenty of success stories with Erlang and Lisps. The quality of the programmers matters more than the language itself, and there certainly are great programmers in dynamic languages. I’m strongly in favor of static typing, but the lack thereof isn’t a deal breaker. You’d probably guess (and correctly so) that I’m not satisfied with C’s type system, but I’m not going to argue that C “didn’t work”.

            Static typing is great and, at a typical corporate pace, it’s indispensable for multi-developer projects. That said, dynamic typing can be used to build reliable systems. It’s not even that rare. But you need to develop at a sustainable pace, as opposed to “Agile” corporate haste.

            I expect your code to do data stuff will only couple of data structure (say, time series, matrix, vectors, right now I might forget some but I doubt they are that many), so you will know which kind of inputs your function expects and what it should return.

            So, it gets complicated. Do you let DataFrame be a single type, or do you refine according to the contents thereof? First you need extensible record/row types (which Haskell doesn’t have, yet). Then you can define a data frame type in two ways:

            • (column based) a record of vectors, e.g. {name :: Vector String, age :: Vector Int, isMember :: Vector Bool}
            • (row based) a vector of a record type, e.g. Vector {name :: String, age :: Int, isMember :: Bool}

            It gets thicker. A lot of data processing involves data-dependent transformations, e.g. removing highly-correlated (or just collinear) columns. So now you have dependent types by necessity… all for something that will probably be converted into a numerical matrix (possibly with dummy variables for categorical items) later.

            It does seem more appealing just to work with n-dimensional numerical arrays (which are not hard to put types on) and to just accept that dataframe errors will happen at runtime.