Data manipulation is one of the most common and essential tasks in data analysis. Whether you are working with tabular data, time series, spatial data, or any other kind of data, you need to be able to perform operations such as filtering, grouping, aggregating, joining, reshaping, and more. In the R world, the dplyr package is one of the most popular and powerful tools for data manipulation. It provides a consistent and expressive syntax based on the tidyverse principles, which make it easy to write readable and maintainable code. Many other packages extend its functionality, allowing you to work with large and diverse data sources.
However, if you are a Python user, you may not have a clear and easy choice for data manipulation. The (by far) most popular package is pandas but I always found it’s syntax confusing, especially if you have a strong dplyr background.
That is, until now. After reading Emily Riederer’s excellent blog I discovered polars, a new Python library for data manipulation that aims to fill the gap between dplyr and pandas. Polars is a fast and expressive library that offers a syntax similar to the tidyverse, while being extremely fast and scalable. It leverages Apache Arrow as its underlying data structure and is written in Rust, which enables efficient memory management and interoperability with other tools. It also supports lazy evaluation, parallel processing, and query optimization, which make it suitable for working with large and complex data sets. According to some benchmarks, polars is one of the fastest tools for handling and manipulating data.
In this blog post, we will compare the dplyr and polars libraries and see how they can help us perform common data manipulation tasks. We will use a simple data set of the palmerpenguins package, the “new Iris data set”. We will see how to load, filter, group, join, and reshape the data using both libraries, and compare their syntax and output.
Reading data
Reading a csv file in polars is extremely easy (and similar to pandas). The following code reads in the palmerpenguins data set (from a github page). In polars, NA values are null so it’s important to specify this using the null_values parameter.
import polars as pl
df = pl.read_csv("https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv", null_values="NA")
polars also supports a scan_csv method for lazy loading, this is extremely useful for large datasets.
In R, for convenience, we use the palmerpenguins package that already contains the penguins data set
library(tidyverse)
library(palmerpenguins)
attach(penguins)
Basic commands
by entering df
we get the first few rows of the data set, similar to the tidyverse
> df
shape: (344, 9)
┌───────┬───────────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬────────┬──────┐
│ rowid ┆ species ┆ island ┆ bill_length_mm ┆ … ┆ flipper_length_mm ┆ body_mass_g ┆ sex ┆ year │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 ┆ ┆ i64 ┆ i64 ┆ str ┆ i64 │
╞═══════╪═══════════╪═══════════╪════════════════╪═══╪═══════════════════╪═════════════╪════════╪══════╡
│ 1 ┆ Adelie ┆ Torgersen ┆ 39.1 ┆ … ┆ 181 ┆ 3750 ┆ male ┆ 2007 │
│ 2 ┆ Adelie ┆ Torgersen ┆ 39.5 ┆ … ┆ 186 ┆ 3800 ┆ female ┆ 2007 │
│ 3 ┆ Adelie ┆ Torgersen ┆ 40.3 ┆ … ┆ 195 ┆ 3250 ┆ female ┆ 2007 │
│ 4 ┆ Adelie ┆ Torgersen ┆ null ┆ … ┆ null ┆ null ┆ null ┆ 2007 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 341 ┆ Chinstrap ┆ Dream ┆ 43.5 ┆ … ┆ 202 ┆ 3400 ┆ female ┆ 2009 │
│ 342 ┆ Chinstrap ┆ Dream ┆ 49.6 ┆ … ┆ 193 ┆ 3775 ┆ male ┆ 2009 │
│ 343 ┆ Chinstrap ┆ Dream ┆ 50.8 ┆ … ┆ 210 ┆ 4100 ┆ male ┆ 2009 │
│ 344 ┆ Chinstrap ┆ Dream ┆ 50.2 ┆ … ┆ 198 ┆ 3775 ┆ female ┆ 2009 │
└───────┴───────────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴────────┴──────┘
The following table compares the main functions of polars with the R package dplyr:
dplyr | polars | |
---|---|---|
first n rows |
head(df, n) |
df.head(n) |
pick column | select(df, x) |
df.select(pl.col("x")) |
pick multiple columns | select(df, x, y) |
df.select(pl.col("x", "y")) |
pick rows | filter(df, x > 4) |
df.filter(pl.col("x") > 4 ) |
sort column | arrange(df, x) |
df.sort("x") |
You see, these commands are basically the same between dplyr and polars.
For example, we want to get the bill_length_mm
of all penguins with body_mass_g
below 3800:
> df.filter(pl.col("body_mass_g") < 3800).select(pl.col("bill_length_mm"))
shape: (129, 1)
┌────────────────┐
│ bill_length_mm │
│ --- │
│ f64 │
╞════════════════╡
│ 39.1 │
│ 40.3 │
│ 36.7 │
│ 39.3 │
│ … │
│ 45.7 │
│ 43.5 │
│ 49.6 │
│ 50.2 │
└────────────────┘
Advanced filter
and select
Like in dplyr, polars filter
and select
have many more capabilities:
dplyr | polars | |
---|---|---|
select all columns except x | select(df, -x) |
df.select(pl.exclude("x")) |
select all columns that start with “str” | select(df, starts_with("str")) |
df.select(pl.col("^bill.*$")) or df.select(cs.starts_with("str")) [1] |
select numeric columns | select(df, where(is.numeric)) |
df.select(cs.float(), cs.integer()) [1,2] |
filter range of values | filter(df, between(x, lo, hi)) |
df.filter(pl.col("x").is_between(lo, hi)) |
[1] requires import polars.selectors as cs
[2] Please note that you can also cast() columns from one type to another (e.g. Float to Int).
For example, return all columns starting with “bill” for the penguin species “Gentoo”:
> df.filter(pl.col("species") == "Gentoo").select(pl.col("^bill.*$"))
shape: (124, 2)
┌────────────────┬───────────────┐
│ bill_length_mm ┆ bill_depth_mm │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞════════════════╪═══════════════╡
│ 46.1 ┆ 13.2 │
│ 50.0 ┆ 16.3 │
│ 48.7 ┆ 14.1 │
│ 50.0 ┆ 15.2 │
│ … ┆ … │
│ 46.8 ┆ 14.3 │
│ 50.4 ┆ 15.7 │
│ 45.2 ┆ 14.8 │
│ 49.9 ┆ 16.1 │
└────────────────┴───────────────┘
Creating and renaming new columns
One of the most used commands in my dplyr workflow is the mutate
function for creating columns.
The polars equivalent is called with_columns
and works similarly:
dplyr | polars | |
---|---|---|
create new column | mutate(df, x_mean = mean(x)) |
df.with_columns(pl.col("x").mean().alias("x_mean")) |
rename column | rename(df, new_x = x) |
df.rename({"x": "new_x"}) |
For example, let’s create a new variable with the bill/flipper ratio called bill_flipper_ratio
:
> df.with_columns((pl.col("bill_length_mm") / pl.col("flipper_length_mm")).alias("bill_flipper_ratio"))
shape: (344, 10)
┌───────┬───────────┬───────────┬────────────────┬───┬─────────────┬────────┬──────┬────────────────────┐
│ rowid ┆ species ┆ island ┆ bill_length_mm ┆ … ┆ body_mass_g ┆ sex ┆ year ┆ bill_flipper_ratio │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 ┆ ┆ i64 ┆ str ┆ i64 ┆ f64 │
╞═══════╪═══════════╪═══════════╪════════════════╪═══╪═════════════╪════════╪══════╪════════════════════╡
│ 1 ┆ Adelie ┆ Torgersen ┆ 39.1 ┆ … ┆ 3750 ┆ male ┆ 2007 ┆ 0.216022 │
│ 2 ┆ Adelie ┆ Torgersen ┆ 39.5 ┆ … ┆ 3800 ┆ female ┆ 2007 ┆ 0.212366 │
│ 3 ┆ Adelie ┆ Torgersen ┆ 40.3 ┆ … ┆ 3250 ┆ female ┆ 2007 ┆ 0.206667 │
│ 4 ┆ Adelie ┆ Torgersen ┆ null ┆ … ┆ null ┆ null ┆ 2007 ┆ null │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 341 ┆ Chinstrap ┆ Dream ┆ 43.5 ┆ … ┆ 3400 ┆ female ┆ 2009 ┆ 0.215347 │
│ 342 ┆ Chinstrap ┆ Dream ┆ 49.6 ┆ … ┆ 3775 ┆ male ┆ 2009 ┆ 0.256995 │
│ 343 ┆ Chinstrap ┆ Dream ┆ 50.8 ┆ … ┆ 4100 ┆ male ┆ 2009 ┆ 0.241905 │
│ 344 ┆ Chinstrap ┆ Dream ┆ 50.2 ┆ … ┆ 3775 ┆ female ┆ 2009 ┆ 0.253535 │
└───────┴───────────┴───────────┴────────────────┴───┴─────────────┴────────┴──────┴────────────────────┘
Grouping and aggregating
Aggregating and grouping data are essential skills for data analysis, as they allow you to summarize, transform, and manipulate data in meaningful ways. Again, these commands are very similar between dplyr and polars:
dplyr | polars | |
---|---|---|
group | group_by(df, x) |
df.group_by("x") |
summarize | summarize(df, x_n = n()) |
df.agg(pl.count().alias("x_n")) |
For example, group the data by species and count the number of penguins of each species, then sort in descending order:
> df.group_by("species").agg(pl.count().alias("counts")).sort("counts", descending=True)
shape: (3, 2)
┌───────────┬────────┐
│ species ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════════╪════════╡
│ Adelie ┆ 152 │
│ Gentoo ┆ 124 │
│ Chinstrap ┆ 68 │
└───────────┴────────┘
Another one, for each species, find the penguin with the lowest body mass:
> df.group_by("species").agg(pl.all().sort_by("body_mass_g").first())
shape: (3, 9)
┌───────────┬───────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬────────┬──────┐
│ species ┆ rowid ┆ island ┆ bill_length_mm ┆ … ┆ flipper_length_mm ┆ body_mass_g ┆ sex ┆ year │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ f64 ┆ ┆ i64 ┆ i64 ┆ str ┆ i64 │
╞═══════════╪═══════╪═══════════╪════════════════╪═══╪═══════════════════╪═════════════╪════════╪══════╡
│ Adelie ┆ 4 ┆ Torgersen ┆ null ┆ … ┆ null ┆ null ┆ null ┆ 2007 │
│ Chinstrap ┆ 315 ┆ Dream ┆ 46.9 ┆ … ┆ 192 ┆ 2700 ┆ female ┆ 2008 │
│ Gentoo ┆ 272 ┆ Biscoe ┆ null ┆ … ┆ null ┆ null ┆ null ┆ 2009 │
└───────────┴───────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴────────┴──────┘
You can see that this result contains the null values, so let’s remove them:
> df.group_by("species").agg(pl.all().sort_by("body_mass_g").drop_nulls().first())
shape: (3, 9)
┌───────────┬───────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬────────┬──────┐
│ species ┆ rowid ┆ island ┆ bill_length_mm ┆ … ┆ flipper_length_mm ┆ body_mass_g ┆ sex ┆ year │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ f64 ┆ ┆ i64 ┆ i64 ┆ str ┆ i64 │
╞═══════════╪═══════╪═══════════╪════════════════╪═══╪═══════════════════╪═════════════╪════════╪══════╡
│ Adelie ┆ 4 ┆ Torgersen ┆ 36.5 ┆ … ┆ 181 ┆ 2850 ┆ female ┆ 2007 │
│ Chinstrap ┆ 315 ┆ Dream ┆ 46.9 ┆ … ┆ 192 ┆ 2700 ┆ female ┆ 2008 │
│ Gentoo ┆ 272 ┆ Biscoe ┆ 42.7 ┆ … ┆ 208 ┆ 3950 ┆ female ┆ 2009 │
└───────────┴───────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴────────┴──────┘
The dplyr equivalent would be something like this:
> penguins |>
group_by(species) |>
arrange(body_mass_g) |>
summarize(body_mass_g = first(body_mass_g))
# A tibble: 3 × 2
species body_mass_g
<fct> <int>
1 Adelie 2850
2 Chinstrap 2700
3 Gentoo 3950
Joining dataframes
polars has the same left_join
, right_join
and inner_join
functionality as dplyr, for examples please refer to the docs.
dplyr | polars | |
---|---|---|
join dataframes | left_join(df1, df2, by=x) |
df1.join(df2, on="x", how="left") |
Conclusion
In conclusion, polars is a fast and expressive DataFrame library for Python that can handle large-scale data analysis with ease and efficiency. It offers a familiar and intuitive interface, similar to R’s dplyr. Whether you are a data scientist, a data analyst, or a data enthusiast, polars can help you unleash the power of your data.