More polars

krz · February 23, 2024

After reading Rasmus Bååth’s blog Why pandas feels clunky when coming from R, I decided to practice more polars and translate his dplyr examples to polars.

Reading the data is pretty straightforward:

import polars as pl
purchases = pl.read_csv("purchases.csv")

get the sum of amount:

# dplyr:
# purchases$amount |> sum()
# actually the "pure" tidy approach would be:
# purchases |> select(amount) |> sum 😉

purchases.select("amount").sum()

same by country:

# dplyr:
# purchases |>
#  group_by(country) |>
#  summarize(total = sum(amount))

(
    purchases
    .group_by("country")
    .agg(pl.sum("amount").alias("total"))
)

deduct the discount:

# dplyr:
# purchases |> 
#  group_by(country) |> 
#  summarize(total = sum(amount - discount))

(
    purchases
    .group_by("country")
    .agg(pl.sum("amount") - pl.sum("discount")
         .alias("total"))
)

Note that polars does not automatically sort the country column (dplyr does). If you want the sorting, add .sort("country")

Now we remove everything 10x larger than the median:

# dplyr:
# purchases |>
#  filter(amount <= median(amount) * 10) |>
#  group_by(country) |> 
#  summarize(total = sum(amount - discount))

(
    purchases
    .filter(pl.col("amount") < pl.median("amount") * 10)
    .group_by("country")
    .agg(pl.sum("amount") - pl.sum("discount")
         .alias("total"))
)

Finally, we should use the median within each country:

# dplyr:
# purchases |>
#  group_by(country) |>                     
#  filter(amount <= median(amount) * 10) |> 
#  summarize(total = sum(amount - discount))

(
    purchases
    .filter(pl.col("amount") < (pl.median("amount") * 10).over("country"))
    .group_by("country")
    .agg(pl.sum("amount") - pl.sum("discount")
         .alias("total"))
    .sort("country")
)

The last one was a bit tricky. Simply switching the group_by and filter lines like in dplyr does not work in polars.

Twitter, Facebook