I enjoyed this post very much. Perfect length for a coffee, easy reading, nice touch of playfulness, I learned something (not least that there’s a non-mutating toSorted in JS, who knew?!) and it made me think too (in the K and Lil sections).
As a bonus, I have been reminded that there’s a fruit beginning with “C” (cherry) so I can avoid my usual standard but awkward three item example array of ["Apple", "Banana", "Carrot"].
An important dimension of how languages express this problem: where do the names live, and where do the expressions get evaluated.
K, and Javascript (lists): column names live in the top scope, expressions get evaluated immediately.
Example: names@>sales, both ‘columns’ accessed directly.
Like you say, @Internet_Janitor, this makes it really easy for the columns to get out of sync when only one of them gets sliced or sorted. To say nothing of name collisions once you work with multiple should-be-tables.
Javascript (list of dicts): column live in dict scope, expressions get evaluated immediately.
example: fruits.filter(x => x.name[0] != 'D') . Note the x => x.name lambda: a bare name would get evaluated immediately in the top scope, so we wrap it in a lambda that .filter() can pass each row to.
Lil, and R + dplyr (my example below): column names live in the table scope, and table manipulations take expressions that get evaluated in the table scope.
Example: select orderby count@name asc from fruits.
Example: fruits |> arrange(str_len(name))
Okay, time to show the R example; then I’ll write a few more words on what happens if “evaluate this in the table context” is not available.
library(tidyverse)
fruits <- tribble( # just a data frame, entered rowwise
~name, ~sales,
"Cherry", 22,
"Lime" , 5,
"Durian", 97,
"Banana", 12,
"Apple" , 1,
)
fruits |> arrange(name) # alphabetically
fruits |> arrange(str_len(name)) # by name length
fruits |> arrange(desc(sales)) # by descending sales
fruits |>
filter( ! str_detect(name, "^D")) |> # banish D-fruit
arrange(desc(sales)) # by descending sales
How would that last example look if the table namespace had to be specified on every column access?
This is already mildly noisy, and it’ll get even noiser when:
there are multiple columns involved in an expression (cars$km_preowned + cars$km_work + cars$km_personal) / cars$years_active)
or your library accesses columns like mytable["mycolumn"] instead of mytable.mycolumn\
In such cases, libraries commonly take it upon themselves to maintain an expression handler that takes column names, or entire expressions, as strings. Which provides a lot of benefit, but is also a lot of work.
Anyway, returning to the noise of repeated mytable.mycolumn: such noise creates a real pressure to make table names too short. Even a name like fruits can start to feel like a burden when you have to repeat it on. every. column. access. If your language lets the user specify the table once, and then work with columns within it, that’s real ergonomics.
this makes it really easy for the columns to get out of sync when only one of them gets sliced or sorted
It’s a more expressive semantics. Expressiveness is dangerous, but, obviously, very useful. In this case, we avoid compromising modularity, having the ability to add and remove columns on a contextual basis at no cost whatsoever; something which is usually not achievable without a mechanism similar to stealth mixins.
Given that the data is tabular in nature, it only remains valid if row order remains consistent across columns.
In that light, I’d say the bag of vectors does let you express more kinds of invalid data, but does not let you express the main constraint ‘these vectors are columns in the same table and should only be permuted/sliced together’. So not strictly more expressive.
Expressing tables that are midway through an ongoing update is necessary, of course — table abstractions are usually implemented on top of vector abstractions. But that kind of expressiveness is useful to the implementor of a table/data frame, not to the person in the article who has to analyze some tabular data.
Nice post. K also has tables, and the ability to evaluate expressions containing column names as variables, and a kSQL query language. Citation (for K9):
Now that I’ve seen it, the idea of “tables” and queries as a primitive in a non-DB scripting language seems so natural. I’m kicking myself for never having wondered about it before.
I actually have used it before but only in the context of DBs so didn’t realize it was a more general construct. Not surprising since I’m not a C# programmer.
One of my take-aways from this is that I learned, after a good 30 years programming with languages that have some form of “sort takes a function that returns negative when a < b, 0 when a == b and positive when a > b”, that I can just … subtract.
For decades, I’ve implemented this as:
if a < b: return -1
else if: a > b return 1
else: return 0
create table fruits (name text, sales integer);
insert into fruits values ('Cherry', 22), ('Lime', 5), ('Durian', 97), ('Banana', 12), ('Apple', 1);
select name from fruits order by name;
select name from fruits order by sales;
select name from fruits order by sales desc;
select name from fruits where name not like 'D%' order by sales;
I would like to know what the solutions are like to this problem in APL and J as well, since those are the only two languages in this family I have any familiarity with, although I’m so poor at them I don’t think I could write the code well enough myself.
Indeed it is similar to data frames. More generally, the fruits example is an instance of tidy data, a.k.a. rectangular data [EDIT: not all rectangular data is tidy, see below]. Quoting from the paper:
Every column is a variable.
Every row is an observation.
Every cell is a single value.
This is Codd’s 3rd normal form (Codd, 1990), but with the constraints framed in statistical language, and the focus put on a single dataset rather than the many connected datasets common in relational databases.
EDIT: not all rectangular data is tidy. In fact, over half of the Tidy data vignette (same link as above) is examples of data that are rectangular, and yet not tidy.
The built-in sorting algorithms weren’t required by the standard to be stable until 2019. Countless millions of devices browse the web today with software more than 4 years old. There is clearly an effort being made to correct this design flaw, but in my opinion it’s still unsafe to expect a stable sort when writing JS today.
As much as it saddens me that we can’t write software that stays useful for more than 4 years it is a very risky option to use any internet-facing software that hasn’t been updated in 4 years, especially a web browser with its massive attack surface. New vulnerabilities are publicly released more or less monthly for all browsers.
So I don’t take any effort to support browsers more than a year or so old, because it would be irresponsible to give the impression that these should be used on the open internet (not that I purposely break them).
I enjoyed this post very much. Perfect length for a coffee, easy reading, nice touch of playfulness, I learned something (not least that there’s a non-mutating
toSortedin JS, who knew?!) and it made me think too (in the K and Lil sections).As a bonus, I have been reminded that there’s a fruit beginning with “C” (cherry) so I can avoid my usual standard but awkward three item example array of
["Apple", "Banana", "Carrot"].Lil is such a beautiful little language.
https://beyondloom.com/decker/lil.html#lilthequerylanguage
If someone is looking for more info. “lil” apparently is a pretty overloaded language name.
An important dimension of how languages express this problem: where do the names live, and where do the expressions get evaluated.
names@>sales, both ‘columns’ accessed directly.fruits.filter(x => x.name[0] != 'D'). Note thex => x.namelambda: a bare name would get evaluated immediately in the top scope, so we wrap it in a lambda that.filter()can pass each row to.select orderby count@name asc from fruits.fruits |> arrange(str_len(name))Okay, time to show the R example; then I’ll write a few more words on what happens if “evaluate this in the table context” is not available.
How would that last example look if the table namespace had to be specified on every column access?
This is already mildly noisy, and it’ll get even noiser when:
cars$km_preowned + cars$km_work + cars$km_personal) / cars$years_active)mytable["mycolumn"]instead ofmytable.mycolumn\Anyway, returning to the noise of repeated
mytable.mycolumn: such noise creates a real pressure to make table names too short. Even a name likefruitscan start to feel like a burden when you have to repeat it on. every. column. access. If your language lets the user specify the table once, and then work with columns within it, that’s real ergonomics.It’s a more expressive semantics. Expressiveness is dangerous, but, obviously, very useful. In this case, we avoid compromising modularity, having the ability to add and remove columns on a contextual basis at no cost whatsoever; something which is usually not achievable without a mechanism similar to stealth mixins.
Given that the data is tabular in nature, it only remains valid if row order remains consistent across columns.
In that light, I’d say the bag of vectors does let you express more kinds of invalid data, but does not let you express the main constraint ‘these vectors are columns in the same table and should only be permuted/sliced together’. So not strictly more expressive.
Expressing tables that are midway through an ongoing update is necessary, of course — table abstractions are usually implemented on top of vector abstractions. But that kind of expressiveness is useful to the implementor of a table/data frame, not to the person in the article who has to analyze some tabular data.
Nice post. K also has tables, and the ability to evaluate expressions containing column names as variables, and a kSQL query language. Citation (for K9):
Note: I downloaded a copy of K9 (non-enterprise edition) from here (li is Linux, mi is Mac): https://web.archive.org/web/20220201221145/https://shakti.com/
Shout out to raising durian awareness. Wish it were a bit more affordable, but good fruit with top-tier texture.
This is a lovely blog post.
Now that I’ve seen it, the idea of “tables” and queries as a primitive in a non-DB scripting language seems so natural. I’m kicking myself for never having wondered about it before.
Have you seen LINQ? https://learn.microsoft.com/en-us/dotnet/csharp/linq/get-started/query-expression-basics
I actually have used it before but only in the context of DBs so didn’t realize it was a more general construct. Not surprising since I’m not a C# programmer.
One of my take-aways from this is that I learned, after a good 30 years programming with languages that have some form of “sort takes a function that returns negative when a < b, 0 when a == b and positive when a > b”, that I can just … subtract.
For decades, I’ve implemented this as:
I’ve never seen it implemented as
a-bbefore.Mind blown…
In a systems language like C/C++/Rust, be careful of
a - bif the arguments are fixed width N-bit integers that can overflow and wrap around mod 2^N.Your python code works for strings and lots of other types, but not so for
a-b. Consider usingin Python if you want something more generic.
Ruby even has an operator for that, the spaceship:
SQL is a pretty good fit:
(Try it in your browser over at SQL Fiddle.)
I would like to know what the solutions are like to this problem in APL and J as well, since those are the only two languages in this family I have any familiarity with, although I’m so poor at them I don’t think I could write the code well enough myself.
j, untested
You’re a Dyalog programmer now:
Alphabetic:
By length:
By sales, descending:
Filtering out the ‘D’ names:
I think this retains the fruits starting with ‘D’ which are not desired. But thank you!
See my other reply. This task can also be solved by forming a “table” first like:
or likewise in J;
Thank you!
Nobody asked but here is Uiua version.
Unfortunately, this retains the fruits starting with ‘D’ which are not desired.
Ah, missed that, thanks. Fixed
I’m not an expert at Uiua, so perhaps this could be written better.
No.
OK, I’ll bite! :-)
I got curious for how it’d work in pandas, so after some kludging here’s what I got:
It doesn’t feel as elegant as the lil, but it’s tolerable enough. Thanks for the motivation to finally learn some pandas!
That table data structure is pretty cool! I know Lua also has tables, but I’m not sure they’re the same.
I think Ruby is quite elegant too:
With the catch that you need to add
.to_hif you want a hash map back. I also could’ve used numbered params, but I don’t like those as much.I think lua tables are just associative arrays, not queryable tables like we have here. I wonder if this is similar to pandas dataframes!
Indeed it is similar to data frames. More generally, the
fruitsexample is an instance of tidy data,a.k.a. rectangular data[EDIT: not all rectangular data is tidy, see below]. Quoting from the paper:EDIT: not all rectangular data is tidy. In fact, over half of the Tidy data vignette (same link as above) is examples of data that are rectangular, and yet not tidy.
This isn’t true and hasn’t been for years. (And I think even for long before most or all browsers used a stable sort.)
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort#browser_compatibility
The built-in sorting algorithms weren’t required by the standard to be stable until 2019. Countless millions of devices browse the web today with software more than 4 years old. There is clearly an effort being made to correct this design flaw, but in my opinion it’s still unsafe to expect a stable sort when writing JS today.
As much as it saddens me that we can’t write software that stays useful for more than 4 years it is a very risky option to use any internet-facing software that hasn’t been updated in 4 years, especially a web browser with its massive attack surface. New vulnerabilities are publicly released more or less monthly for all browsers.
So I don’t take any effort to support browsers more than a year or so old, because it would be irresponsible to give the impression that these should be used on the open internet (not that I purposely break them).