I’ve been using Polars for a project analyzing compensation survey data. I’m absolutely in love with the framework. It’s amazing. It’s expressive. It’s easily testable. Its DSL is entirely understandable at face review and doesn’t require Pythonisms to grok:
# pandas
data = df[df['acolumn'] > 4][['acolumn','bcolumn']]
# polars
data = (df
.filter(pl.col('acolumn') > 4)
.select([pl.col('acolumn'), pl.col('bcolumn')]))
It’s more verbose in this example but in my project, we’re finding that we’re creating more reusable components in Polars than in Pandas and our code ends up more brief overall. We’re building a product, not optimizing for code golf!
For another project I did, switching from Pandas to Polars shortened my pipeline from around 30 seconds per report average to less than a second.
I’m also really happy that there are some alternatives to pandas that are trying a different API. I don’t love all of the polars API design, but I do see it as a big improvement. You may already know this, but your polars example could be even more concise. You don’t have to use a list for the select method and you also don’t need to wrap the column names in pl.col unless you’re going to manipulate them in some way.
data = (
df
.filter(pl.col("acolumn") > 4)
.select("acolumn", "bcolumn")
)
Ah, yes, definitely. In my newer project, I’ve got a class defined with all of our expected columns as pl.Expr from pl.col(). It goes through some mapping I have yet to refactor.
question_to_column = {
"What is your name?" : "adventurer_name",
"What is your quest?" : "adventurer_quest",
"What is the airspeed velocity of an unladen swallow?" : "velocity_swallow"
}
class Columns:
adventurer_name = pl.col("adventurer_name")
adventurer_quest = pl.col("adventurer_quest")
velocity_swallow = pl.col("velocity_swallow")
surviving_adventurers = (
adventurers
.filter(Columns.velocity_swallow.is_in(["African", "European"]))
.select(Columns.adventurer_name)
)
Eventually, we’re going refactor to inline all of the mappings. Something ~cool about the columns is being able to do Columns.adventurer_name.meta.output_name() to get the column’s name as a string for functions that require a string, e.g. groupby() and stuff in Plotly that expects string column names like graph x and y and color arguments.
I’ve been using Polars for a project analyzing compensation survey data. I’m absolutely in love with the framework. It’s amazing. It’s expressive. It’s easily testable. Its DSL is entirely understandable at face review and doesn’t require Pythonisms to grok:
It’s more verbose in this example but in my project, we’re finding that we’re creating more reusable components in Polars than in Pandas and our code ends up more brief overall. We’re building a product, not optimizing for code golf!
For another project I did, switching from Pandas to Polars shortened my pipeline from around 30 seconds per report average to less than a second.
I’m also really happy that there are some alternatives to pandas that are trying a different API. I don’t love all of the polars API design, but I do see it as a big improvement. You may already know this, but your polars example could be even more concise. You don’t have to use a list for the
select
method and you also don’t need to wrap the column names inpl.col
unless you’re going to manipulate them in some way.Ah, yes, definitely. In my newer project, I’ve got a class defined with all of our expected columns as
pl.Expr
frompl.col()
. It goes through some mapping I have yet to refactor.Eventually, we’re going refactor to inline all of the mappings. Something ~cool about the columns is being able to do
Columns.adventurer_name.meta.output_name()
to get the column’s name as a string for functions that require a string, e.g.groupby()
and stuff in Plotly that expects string column names like graphx
andy
andcolor
arguments.