1. 11
  1. 3

    I was in doubt with which language take once more for my projects and go in deep. python, lua, go.. And then your post reminded me that I made good stuff five years ago with R. As you pointed is so easy read and plot data with R. I think R is my winner now and so I`m going to study again in deep Thank you

    1. 2

      A few comments while reading:

      Native data science structures

      Is that true in base R? I always thought that most of the data science structures are provided by tidyverse or data.table. Maybe I am missing something, but I interpret data science structures as data structures that allow for some basic data manipulation (merge, collapse, group by) and variable use. I don’t know if those things are available in base R.

      The glory of CRAN

      CRAN can also be a curse because of the lack of a centralized clearinghouse for packages. It is not uncommon for different packages to use different syntax to do similar things. This becomes even worse when you consider returns and the kind of data structures that are returned by different packages. S3 is a community-driven effort to try to standardize objects but I don’t know if it has been widely adopted.

      1. 4

        Base R provides the data frame, which I posit is the data structure for interactive data analysis, except for those problems where it isn’t. (E.g. imaging and image analysis works on pixel arrays, graph analysis is its own thing, etc). In fact, to the best of my knowledge the data frame was invented in R; or rather in the S language specification, which R implements.

        For those who haven’t encountered data frames before, they are indispensible when you have multi-type data where you need rows or columns.

        • A data frame is a matrix-like data structure …
        • … whose columns may be of different data types (unlike in matrices) …
        • … and on which indicating rows, indicating columns, and indicating both at once are all first-class operations. (Unlike a plain dict of column-lists, and unlike a list of row-tuples.)
          • An example of this in psuedo-code: patients_df[(1..4, 12), ('dose', 'response')] to get a new, smaller, data frame with the dose and reponse of the patients in row 1, 2, 3, 4, and 12.

        From A Brief History of S, Richard A. Becker:

        Initial S Concepts

        […] Another data notion in the first implementation of S was the idea of a hierarchical structure, which contained a collection of other S objects. Early S structures were made of named components; they were treated specially, with the “$” operator selecting componenets from structures[,] and functions for creating and modifying strutures. […] By the 1988 release we recognized these structures as yet another form of vector, of mode list, with an attribute vector of names. This was a very powerful notion, and it integrated the treatment of hierarchical structures with that of ordinary vectors. […]


        […] Another major addition to the language concerns data: the data frame class of objects provides a matrix-like data structure for variables that are not necessarily of the same mode (matrices must have a single mode). This enables a variety of observations to be recorded on the subjects: for example, data on medical patients could contain, for each patient, the name (a character string), sex (a factor), age group (an ordered factor), height and weight (numeric).

        My kb has deserted me (edit: I used an onscreen keyboard, then my kb came back, but it’s still bedtime. ), so briefly: tibbles in dplyr/tidyverse and data tables in data.table are both “data frames, but better”, they all represent the same mental concept. Merge/collapse (is this summarize?)/group by are not provided by the data structure, but by functions that operate on these structures. (Although in the case of data.table, I believe the innards were designed to allow the functions that implement group/apply/recombine to be very fast, so.)

        1. 1

          A late adddendum: a data frame has the same shape and contents as database table, which is a v. useful correspondence.

        2. 3

          There’s a huge cohort of R programmers who have only used the tidyverse stuff and think that’s The Way to get things done. And it does have it’s benefits, but it’s easy to forget that for a long time, there was base R. And base R is also very competent. In many ways different, but also competent.

          1. 3

            Yes, base R has data frames. In this post I show tidyverse code, Pandas code, SQL, as well as base R in Appendix B.

            What Is a Data Frame? (In Python, R, and SQL)

          2. 2

            Do people find R’s lazy evaluation useful, or would a strict R work just as well? R is one of the few mainstream languages besides Haskell that uses lazy evaluation.

            1. 5

              It is essential to R’s non-standard evaluation (NSE), which is in turn essential to R’s many DSLs that make interactive usage a joy. An example:

              countries <- read.csv('countries.tsv', sep='\t')
              ggplot(countries) +
                  geom_point(aes(x=tax_rate, y=prosperity, colour=continent))

              If R had non-lazy evaluation, you’d get an error “undefined variable: tax_rate”. Because it has lazy evaluation, the geom_point function can say “actually, look for a variable called tax_rate inside the (countries) data frame, and use that for the x coordinate”.

              Plotnine, the python ggplot clone, is forced to ask for geom_point({'x': 'tax_rate','y': 'prosperity', 'colour': 'continent'}) instead, and those extra quotmarks are surprisingly frictionful for interactive usage. OTOH, programming over NSE of variable names can be a right pain in the neck compared to programming which string literals to pass, to the extent that there is a small cottage industry of packages to decouple ggplot/dplyr code from variable names.

              Stopping due to awful keyboard probs and bedtime, let me know if you want links to such pkgs and I’ll come back tomorrow.