`causal-learn`

offers a powerful toolkit for causal discovery,
enabling researchers and practitioners to uncover these relationships from observational data.
In this blog post, we’ll explore how to leverage causal-learn to identify and analyze causal structures,
providing a step-by-step guide to get you started on your causal discovery journey.
First, let’s generate some data (inspired by this notebook):

```
import numpy as np
from causallearn.search.FCMBased import lingam
from causallearn.search.FCMBased.lingam.utils import make_dot
import networkx as nx
import seaborn as sns
N = 1000
q = np.random.uniform(0, 2, N)
w = np.random.randn(N)
x = np.random.gumbel(0, 1, N) + w
y = 0.6 * q + 0.8 * w + np.random.uniform(0, 1, N)
z = 0.5 * x + np.random.randn(N)
data = np.stack([x, y, w, z, q]).T
```

This is the DAG of our data:

```
nodes = ['X', 'Y', 'W', 'Z', 'Q']
edges = [
('W', 'X'),
('W', 'Y'),
('Q', 'Y'),
('X', 'Z'),
]
fci_graph = nx.DiGraph()
fci_graph.add_nodes_from(nodes)
fci_graph.add_edges_from(edges)
nx.draw(
G=fci_graph,
node_color='#00B0F0',
nodelist=['X', 'Y', 'W', 'Z', 'Q'],
with_labels=True,
pos=nx.circular_layout(fci_graph)
)
```

Now, let’s estimate the causal graph using the LiNGAM method:

```
model = lingam.DirectLiNGAM(random_state=42)
model.fit(data)
```

`causal-learn`

has a built-in function to visualize the estimated DAG:

```
make_dot(model.adjacency_matrix_, labels=nodes)
```

We can also use `networkx`

to visualize the result:

```
G = nx.DiGraph(model.adjacency_matrix_.T)
nx.draw(G,
with_labels=True,
pos=nx.circular_layout(G))
```

We see that LiNGAM estimates the true DAG quite accurately! Another way to visualize the adjacency matrix is a heatmap:

```
sns.heatmap(model.adjacency_matrix_.T, cmap="rocket_r", cbar=False)
```

Note that the edge from X to Z has a low probability, if we set the threshold to 0.5 it would actually disappear:

```
sns.heatmap(model.adjacency_matrix_.T>0.5, cmap="rocket_r", cbar=False)
```

Enter SwiftData, a powerful framework introduced by Apple to streamline data management in Swift applications. SwiftData not only simplifies the process of saving and retrieving various types of data but also integrates seamlessly with SwiftUI, allowing developers to create robust and scalable apps with minimal effort.

In this blog post, we will delve into a practical example of using SwiftData to save images selected by the user through a `PhotosPicker`

. We will walk through the entire process, from setting up your data model to displaying a list of saved images with descriptions. By the end of this guide, you’ll have a solid understanding of how to leverage SwiftData to enhance your app’s functionality and provide a smooth user experience. Let’s get started!

Using SwiftCode requires the following steps:

The class has to use the `@Model`

macro and the `@Attribute(.externalStorage)`

to save the image. Don’t forget to import `SwiftData`

.
In our app, we want to save the image and a description of the image. The class looks like this:

```
import Foundation
import SwiftData
import SwiftUI
@Model
class Item {
var descript: String
@Attribute(.externalStorage)
var image: Data?
init(descript: String = "", image: Data? = nil) {
self.descript = descript
self.image = image
}
}
```

You need to add a `modelContainer`

to the `WindowGroup`

of your app:

```
import SwiftData
import SwiftUI
@main
struct PhotoRememberApp: App {
var body: some Scene {
WindowGroup {
ContentView()
}
.modelContainer(for: Item.self)
}
}
```

```
@Environment(\.modelContext) private var modelContext
```

To select a Photo, we define a `PhotosPicker`

:

```
PhotosPicker(
selection: $selectedPhoto,
matching: .images,
photoLibrary: .shared()
) {
Text("Select a Photo")
}
.onChange(of: selectedPhoto) {
Task {
await loadImageData(from: selectedPhoto)
}
}
```

Then we build a save Button that calls a function so save this image to SwiftData:

```
private func saveImage(_ descript: String, _ data: Data) {
let newItem = Item(descript: descript, image: data)
modelContext.insert(newItem)
do {
try modelContext.save()
} catch {
print("Failed to save context: \(error)")
}
// Reset selected photo and imageData after saving
selectedPhoto = nil
imageData = nil
imageDescription = ""
}
```

Let’s put everything together:

```
import SwiftUI
import PhotosUI
import SwiftData
struct ContentView: View {
@Environment(\.modelContext) private var modelContext
@Query private var items: [Item]
@State private var selectedPhoto: PhotosPickerItem?
@State private var imageData: Data?
@State private var imageDescription: String = ""
var body: some View {
NavigationView {
VStack {
List(items) { item in
HStack {
if let imageData = item.image, let uiImage = UIImage(data: imageData) {
Image(uiImage: uiImage)
.resizable()
.scaledToFill()
.frame(width: 50, height: 50)
.clipShape(RoundedRectangle(cornerRadius: 8))
} else {
Rectangle()
.fill(Color.gray)
.frame(width: 50, height: 50)
.clipShape(RoundedRectangle(cornerRadius: 8))
}
Text(item.descript)
}
}
.navigationTitle("Photo Selector")
if let imageData = imageData, let uiImage = UIImage(data: imageData) {
Image(uiImage: uiImage)
.resizable()
.scaledToFit()
.frame(height: 200)
} else {
Text("No Image Selected")
.frame(height: 200)
}
PhotosPicker(
selection: $selectedPhoto,
matching: .images,
photoLibrary: .shared()
) {
Text("Select a Photo")
}
.onChange(of: selectedPhoto) {
Task {
await loadImageData(from: selectedPhoto)
}
}
if imageData != nil {
TextField("Enter image description", text: $imageDescription)
.textFieldStyle(RoundedBorderTextFieldStyle())
.padding()
Button("Save Image") {
if let imageData = imageData {
saveImage(imageDescription, imageData)
}
}
.padding()
}
}
.padding()
}
}
private func loadImageData(from item: PhotosPickerItem?) async {
if let data = try? await item?.loadTransferable(type: Data.self) {
self.imageData = data
}
}
private func saveImage(_ descript: String, _ data: Data) {
let newItem = Item(descript: descript, image: data)
modelContext.insert(newItem)
do {
try modelContext.save()
} catch {
print("Failed to save context: \(error)")
}
// Reset selected photo and imageData after saving
selectedPhoto = nil
imageData = nil
imageDescription = ""
}
}
struct ContentView_Previews: PreviewProvider {
static var previews: some View {
ContentView()
}
}
```

In this example, we’ll import the following JSON data set of current astronauts in space as a string. This data is avaible from this repo of awesome JSON datasets.

```
{"people": [{"name": "Jasmin Moghbeli", "craft": "ISS"},
{"name": "Andreas Mogensen", "craft": "ISS"},
{"name": "Satoshi Furukawa", "craft": "ISS"},
{"name": "Konstantin Borisov", "craft": "ISS"},
{"name": "Oleg Kononenko", "craft": "ISS"},
{"name": "Nikolai Chub", "craft": "ISS"},
{"name": "Loral O'Hara", "craft": "ISS"}]}
```

To do this, we first have to create a struct that coheres to the JSON hierarchical structure of the data.
In this case, the hierarchy is an array of `people`

that contains the `name`

and `craft`

of each astronaut.

```
struct Astronauts: Codable {
let people: [People]
}
struct People: Codable {
let name: String
let craft: String
}
```

Note the `Codable`

modifier. It is a type alias for the Encodable and Decodable protocols. These protocols allow you to convert yourself into and out of an external representation, such as JSON.
With Codable, you can use the JSONDecoder class to decode JSON data into Swift objects, and then assign those objects to properties in SwiftUI views. This way, you can load JSON data efficiently and update your user interface accordingly.

Now we can import the data as a string:

```
let input = """
{"people": [{"name": "Jasmin Moghbeli", "craft": "ISS"},
{"name": "Andreas Mogensen", "craft": "ISS"},
{"name": "Satoshi Furukawa", "craft": "ISS"},
{"name": "Konstantin Borisov", "craft": "ISS"},
{"name": "Oleg Kononenko", "craft": "ISS"},
{"name": "Nikolai Chub", "craft": "ISS"},
{"name": "Loral O'Hara", "craft": "ISS"}]}
"""
```

and decode it:

```
let data = Data(input.utf8)
let decoder = JSONDecoder()
```

and finally show the data:

```
if let astronauts = try? decoder.decode(Astronauts.self, from: data) {
List(astronauts.people, id: \.name) { person in
VStack(alignment: .leading) {
Text(person.name)
.font(.headline)
Text(person.craft)
.font(.subheadline)
}
}
} else {
Text("Failed to load astronauts.")
}
```

In this section we will load the exact same data, but from a file and not a string variable.
First, copy the data to a file `astronauts.json`

and add it to your Xcode project.

In SwiftUI, you can use `Bundle`

to access files bundled with your app, so we write an extension for it
to decode JSON files:

```
extension Bundle {
func decode(_ file: String) -> Astronauts {
guard let url = self.url(forResource: file, withExtension: nil) else {
fatalError("Failed to locate \(file).")
}
guard let data = try? Data(contentsOf: url) else {
fatalError("Failed to load \(file).")
}
guard let loaded = try? JSONDecoder().decode(Astronauts.self, from: data) else {
fatalError("Failed to decode \(file).")
}
return loaded
}
}
```

This makes reading the data extremely easy:

```
let astronauts = Bundle.main.decode("astronauts.json")
```

and we can use the same `List`

structure as previously to display the data.

Let’s consider this structural graph:

In this example, we want to estimate the effect of \(X\) on \(Y\).
However, examining the model structure reveals a clear **causal independence** between variables \(X\) and \(Y\). There’s no arrow between them, nor is there a directed path that would connect them indirectly. We will now construct **four models** and investigate the impact of controlling for different variables on the emergence of spurious relationships between \(X\) and \(Y\):

- The first one is a simple model that regresses \(Y\) on \(X\): \(Y \sim X\)
- Then, we’ll add \(A\) to this model: \(Y \sim X + A\)
- Next, we’ll fit a model without \(A\), but \(B\): \(Y \sim X + B\)
- Finally, we’ll build a model with all four variables: \(Y \sim X + A + B\)

Based on your intuition, which of the four models do you believe will accurately represent the causal independence between \(X\) and \(Y\)?

Let’s generate some data:

```
import numpy as np
import statsmodels.api as sm
np.random.seed(42)
NSAMPLES = 3000
a = np.random.randn(NSAMPLES)
x = 2 * a + np.random.randn(NSAMPLES)
y = 2 * a + np.random.randn(NSAMPLES)
b = 1.7 * x + 0.8 * y
```

This is the simple model that regresses \(Y\) on \(X\):

```
X1 = sm.add_constant(x)
model_1 = sm.OLS(y, X1).fit()
```

This results in

```
print(model_1.summary(xname=['const', 'x']))
```

```
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.630
Model: OLS Adj. R-squared: 0.630
Method: Least Squares F-statistic: 5114.
Date: Mon, 04 Mar 2024 Prob (F-statistic): 0.00
Time: 12:59:41 Log-Likelihood: -5173.1
No. Observations: 3000 AIC: 1.035e+04
Df Residuals: 2998 BIC: 1.036e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0465 0.025 1.874 0.061 -0.002 0.095
x 0.8030 0.011 71.510 0.000 0.781 0.825
==============================================================================
Omnibus: 3.944 Durbin-Watson: 2.047
Prob(Omnibus): 0.139 Jarque-Bera (JB): 4.150
Skew: 0.043 Prob(JB): 0.126
Kurtosis: 3.161 Cond. No. 2.21
==============================================================================
```

Apparently, this model finds a spurious effect of \(X\) on \(Y\), indicated by the low p-value (< 0.001).

What about model 2, we add \(A\) as a covariate:

```
X2 = sm.add_constant(np.stack([x, a]).T)
model_2 = sm.OLS(y, X2).fit()
print(model_2.summary(xname=['const', 'x', 'a']))
```

```
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.789
Model: OLS Adj. R-squared: 0.789
Method: Least Squares F-statistic: 5600.
Date: Mon, 04 Mar 2024 Prob (F-statistic): 0.00
Time: 13:11:43 Log-Likelihood: -4333.0
No. Observations: 3000 AIC: 8672.
Df Residuals: 2997 BIC: 8690.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0036 0.019 0.190 0.850 -0.033 0.040
x 0.0192 0.019 1.034 0.301 -0.017 0.056
a 1.9714 0.042 47.435 0.000 1.890 2.053
==============================================================================
Omnibus: 0.439 Durbin-Watson: 2.023
Prob(Omnibus): 0.803 Jarque-Bera (JB): 0.394
Skew: 0.024 Prob(JB): 0.821
Kurtosis: 3.028 Cond. No. 5.70
==============================================================================
```

This model seems to recognize the causal independence of \(X\) and \(Y\) correctly (large p-value for \(X\), suggesting the lack of significance).

In this model, we add \(B\) as a covariate instead of \(A\):

```
X3 = sm.add_constant(np.stack([x, b]).T)
model_3 = sm.OLS(y, X3).fit()
print(model_3.summary(xname=['const', 'x', 'b']))
```

```
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.002e+32
Date: Mon, 04 Mar 2024 Prob (F-statistic): 0.00
Time: 13:18:27 Log-Likelihood: 92892.
No. Observations: 3000 AIC: -1.858e+05
Df Residuals: 2997 BIC: -1.858e+05
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.932e-16 1.58e-16 1.858 0.063 -1.63e-17 6.03e-16
x -2.1250 3.48e-16 -6.11e+15 0.000 -2.125 -2.125
b 1.2500 1.45e-16 8.61e+15 0.000 1.250 1.250
==============================================================================
Omnibus: 1.629 Durbin-Watson: 2.010
Prob(Omnibus): 0.443 Jarque-Bera (JB): 1.563
Skew: 0.038 Prob(JB): 0.458
Kurtosis: 3.082 Cond. No. 13.6
==============================================================================
```

Again, this model finds a spurious effect of \(X\) on \(Y\), indicated by the low p-value (< 0.001). Interestingly, compared with model 1, this time the effect is negative.

Time for the last model which includes both \(A\) on \(B\).

```
X4 = sm.add_constant(np.stack([x, a, b]).T)
model_4 = sm.OLS(y, X4).fit()
print(model_4.summary(xname=['const', 'x', 'a', 'b']))
```

```
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.487e+32
Date: Mon, 04 Mar 2024 Prob (F-statistic): 0.00
Time: 13:22:49 Log-Likelihood: 94094.
No. Observations: 3000 AIC: -1.882e+05
Df Residuals: 2996 BIC: -1.882e+05
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.515e-16 1.06e-16 2.377 0.018 4.4e-17 4.59e-16
x -2.1250 2.45e-16 -8.69e+15 0.000 -2.125 -2.125
a 6.661e-16 3.1e-16 2.147 0.032 5.79e-17 1.27e-15
b 1.2500 1.29e-16 9.7e+15 0.000 1.250 1.250
==============================================================================
Omnibus: 0.182 Durbin-Watson: 2.031
Prob(Omnibus): 0.913 Jarque-Bera (JB): 0.171
Skew: -0.018 Prob(JB): 0.918
Kurtosis: 3.003 Cond. No. 19.0
==============================================================================
```

This model finds a spurious effect of \(X\) on \(Y\), again with a negative effect.

The only model that recognized the causal independence of \(X\) and \(Y\) correctly (large p-value for \(X\), suggesting the lack of significance) is the second model \(Y \sim X + A\). Interestingly, all other statistical control schemes yielded invalid results, including the model without any additional variables accounted for.

Why did controlling for \(A\) succeed while other approaches failed? There are three key factors to consider:

**Confounding Control**: \(A\) serves as a confounder between \(X\) and \(Y\), and we need to control for it in order to remove confounding.**Collider Effect**: \(X\), \(Y\), and \(B\) exhibit a pattern known as a collider. Remarkably, this pattern facilitates the flow of information between the parent variables (\(X\) and \(Y\)) when the child variable (\(B\)) is controlled for—a stark contrast to the outcome when \(A\) is controlled for.**Effect of Variable Control**: Interestingly, not controlling for any variable produces the same outcome regarding the significance of \(X\) as controlling for both \(A\) and \(B\). While the coefficient results may differ, focusing on the structural properties of the system reveals that the effects of controlling for \(A\) and \(B\) are diametrically opposite, effectively nullifying each other’s impact.

Cinelli, C., Forney, A., & Pearl, J. (2022). A Crash Course in Good and Bad Controls. Sociological Methods & Research, 0 (0), 1-34 https://ftp.cs.ucla.edu/pub/stat_ser/r493.pdf

]]>`dplyr`

examples to `polars`

.
Reading the data is pretty straightforward:

```
import polars as pl
purchases = pl.read_csv("purchases.csv")
```

get the sum of `amount`

:

```
# dplyr:
# purchases$amount |> sum()
# actually the "pure" tidy approach would be:
# purchases |> select(amount) |> sum 😉
purchases.select("amount").sum()
```

same by country:

```
# dplyr:
# purchases |>
# group_by(country) |>
# summarize(total = sum(amount))
(
purchases
.group_by("country")
.agg(pl.sum("amount").alias("total"))
)
```

deduct the `discount`

:

```
# dplyr:
# purchases |>
# group_by(country) |>
# summarize(total = sum(amount - discount))
(
purchases
.group_by("country")
.agg(pl.sum("amount") - pl.sum("discount")
.alias("total"))
)
```

Note that `polars`

does not automatically sort the `country`

column (dplyr does).
If you want the sorting, add `.sort("country")`

Now we remove everything 10x larger than the median:

```
# dplyr:
# purchases |>
# filter(amount <= median(amount) * 10) |>
# group_by(country) |>
# summarize(total = sum(amount - discount))
(
purchases
.filter(pl.col("amount") < pl.median("amount") * 10)
.group_by("country")
.agg(pl.sum("amount") - pl.sum("discount")
.alias("total"))
)
```

Finally, we should use the median *within* each country:

```
# dplyr:
# purchases |>
# group_by(country) |>
# filter(amount <= median(amount) * 10) |>
# summarize(total = sum(amount - discount))
(
purchases
.filter(pl.col("amount") < (pl.median("amount") * 10).over("country"))
.group_by("country")
.agg(pl.sum("amount") - pl.sum("discount")
.alias("total"))
.sort("country")
)
```

The last one was a bit tricky. Simply switching the `group_by`

and `filter`

lines like in dplyr does not work in polars.

To generalize our understanding, let’s introduce notation for the potential outcomes framework:

- \(Z\) denotes the treatment/exposure condition, where \(Z = 1\) represents treatment and \(Z = 0\) represents control.
- \(Y\) represents the observed outcome variable.

In this framework, we explore how observed \(Y\) values would vary with different treatments:

- For binary \(Z\), we denote potential outcomes as \(Y^1\) (under treatment, \(Z = 1\)) and \(Y^0\) (under control, \(Z = 0\)).

In reality, individuals are assigned to only one treatment group at a time, so we can’t know both potential outcomes for an individual simultaneously. We’ll address this later; for now, assume we know both \(Y^0\) and \(Y^1\) values.

If we had access to both potential outcomes for each individual, we could derive various statistics to assess treatment effects:

**Individual Treatment Effect (ITE)**: Calculated as \(Y^1 - Y^0\), directly comparing potential outcomes.**Average Treatment Effect (ATE)**: Obtained by averaging all individual treatment effects, representing the difference between the averages of \(Y^1\) and \(Y^0\).

However, a crucial challenge arises: how do we compute ITE or the true ATE when we can’t observe the counterfactual outcome? This highlights the core issue of causal inference, which is essentially grappling with missing data. Since we only witness the actual outcome, the counterfactual remains elusive.

The following table displays data for 12 hospital patients who either interacted with a stress-relief toy (\(Z = 1\)) or did not interact with a stress-relief toy (\(Z = 0\)). The theoretical data contains both potential outcomes, allowing us to compute the true ATE by subtracting the potential outcome averages. The true ATE is -5.8, indicating that interaction with stress-relief toys results in an average decrease in cortisol levels of 5.8 units.

Z | Y¹ | Y° | Y¹ - Y° |
---|---|---|---|

1 | 18 | 23 | -5 |

0 | 17 | 22 | -5 |

1 | 15 | 23 | -8 |

1 | 16 | 24 | -8 |

0 | 17 | 23 | -6 |

1 | 15 | 21 | -6 |

0 | 9 | 14 | -5 |

1 | 8 | 15 | -7 |

1 | 10 | 14 | -4 |

0 | 8 | 13 | -5 |

0 | 8 | 14 | -6 |

1 | 10 | 15 | -5 |

Average | 12.6 | 18.4 | -5.8 |

Real data includes only one potential outcome per individual. Notably, \(Y^1\) values are absent for individuals in the control group (\(Z = 0\)), while \(Y^0\) values are absent for those in the treatment group (\(Z = 1\)). Consequently, in practical scenarios, we consistently lack half of the necessary data to compute the true ATE.

Z | Y¹ | Y° | Y¹ - Y° |
---|---|---|---|

1 | 18 | ? | ? |

0 | ? | 22 | ? |

0 | ? | 23 | ? |

1 | 16 | ? | ? |

0 | ? | 23 | ? |

1 | 15 | ? | ? |

0 | ? | 14 | ? |

1 | 8 | ? | ? |

1 | 10 | ? | ? |

0 | ? | 13 | ? |

0 | ? | 14 | ? |

1 | 10 | ? | ? |

Average | ? | ? | ? |

To estimate causal effects without knowing both potential outcomes for individuals, we rely on **randomization**, a method of treatment assignment akin to a coin flip. This ensures similarity between treatment groups, except for the treatment itself. While we can’t observe individual counterfactual outcomes, we can reasonably assume similar individuals received each treatment, allowing us to estimate unobserved potential outcomes.

With randomization, we estimate the ATE by comparing average observed outcomes between treatment and control groups. In our example, the estimated ATE is -5.4, closely aligning with the true ATE of -5.8 calculated earlier.

**Average \(Y\) for \(Z = 1\)**: (18 + 16 + 15 + 8 + 10 + 10) / 6 = 12.8

**Average \(Y\) for \(Z = 0\)**: (22 + 23 + 23 + 14 + 13 + 14) / 6 = 18.2

**Estimated ATE from Randomized Treatment**: 12.8 - 18.2 = -5.4

When randomization isn’t feasible due to ethical or practical reasons, bias can affect causal effect estimates. Selection bias is a primary concern, occurring when individuals are assigned to treatment or control groups non-randomly.

In our stress-relief toy example, selection bias may occur if:

- Individuals self-select for therapy.
- Control group individuals originate from a different hospital.
- Access to therapy is tied to external factors like insurance.

Variables associated with treatment assignment that also affect the outcome (cortisol level) are confounders. These can lead to erroneous conclusions about treatment impact.

To estimate the ATE in the presence of confounders, we must address how treatment assignment impacts outcomes when randomization isn’t feasible. Consider our stress-relief example:

Instead of random assignment, let’s imagine patients can choose therapy. Additionally, we have data on a new confounding variable \(X\) indicating anxiety diagnosis (\(X = 1\)) or absence (\(X = 0\)). Anxiety status influences treatment choice and cortisol levels: those with anxiety may opt for therapy and have higher cortisol levels.

This scenario poses a problem as it could lead to an imbalance in anxiety levels between treatment groups. Conditional exchangeability is crucial here:

- Conditional exchangeability ensures treatment groups are comparable when considering confounding variables.
- This concept, also known as ignorability or unconfoundedness, prevents biased estimates.

By accounting for anxiety diagnosis (variable \(X\)), we avoid biased cortisol level estimates between therapy-receiving and non-receiving groups.

For example, the following table includes data for twelve hospital patients who self-selected therapy toys. Variables include anxiety diagnosis (\(X=1\) for anxiety, \(X=0\) for no anxiety) and treatment assignment (\(Z=1\) for therapy, \(Z=0\) for no therapy). As this reflects reality, only the observed outcome \(Y\) (cortisol level) is available.

Without considering anxiety (\(X\)), computing the estimated ATE as if treatment were randomized yields -2.9. This suggests therapy toys services reduce cortisol levels by an average of 2.9 units.

X | Z | Y |
---|---|---|

1 | 1 | 18 |

1 | 1 | 17 |

1 | 1 | 15 |

1 | 1 | 16 |

1 | 0 | 23 |

1 | 0 | 21 |

0 | 1 | 9 |

0 | 1 | 8 |

0 | 0 | 14 |

0 | 0 | 13 |

0 | 0 | 14 |

0 | 0 | 15 |

**Average \(Y\) for \(Z=1\)**: (18+17+15+16+9+8)/6 = 13.8

**Average \(Y\) for \(Z=0\)**: (23+21+14+13+14+15)/6 = 16.7

**Estimated ATE**: 13.8 - 16.7 = -2.9

To address the anxiety variable, we compute the ATE specifically for patients diagnosed with anxiety disorder (\(X=1\)). By subtracting the average cortisol level for control patients from the treated group, we aim to ensure similar cortisol levels before therapy. This yields an estimated ATE of -5.5 for individuals with anxiety, significantly higher than the initial estimate of -2.9.

X | Z | Y |
---|---|---|

1 | 1 | 18 |

1 | 1 | 17 |

1 | 1 | 15 |

1 | 1 | 16 |

1 | 0 | 23 |

1 | 0 | 21 |

**Average \(Y\) for \(Z=1\)**: (18+17+15+16)/4 = 16.5

**Average \(Y\) for \(Z=0\)**: (23+21)/2= 22.0

**Estimated ATE for \(X=1\)**: 16.5 - 22.0 = -5.5

Next, we calculate the ATE specifically for patients without an anxiety disorder diagnosis (\(X=0\)). The estimated ATE for individuals without anxiety is also -5.5.

X | Z | Y |
---|---|---|

0 | 1 | 9 |

0 | 1 | 8 |

0 | 0 | 14 |

0 | 0 | 13 |

0 | 0 | 14 |

0 | 0 | 15 |

**Average \(Y\) for \(Z=1\)**: (9+8)/2 = 8.5

**Average \(Y\) for \(Z=0\)**: (14+13+14+15)/4 = 14.0

**Estimated ATE for \(X=0\)**: 8.5 - 14.0 = -5.5

When considering \(X\), averaging the ATEs of both groups yields an estimated ATE of -5.5. This is closer to the true ATE of -5.8, indicating that therapy animal services reduce cortisol levels by an average of 5.8 units. Accounting for confounding significantly improved our estimate’s accuracy.

ATE Estimate | Value |
---|---|

Ignoring X | -2.9 |

Accounting for X | -5.5 |

True Value | -5.8 |

This is a work in progress.

```
# load libraries and import mpg data
import seaborn.objects as so
import polars as pl
mpg = pl.read_csv("https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/mpg.csv")
```

there’s also `so.Dots()`

which sometimes looks nicer, but I stick to `so.Dot()`

to keep it similar to ggplot.

**ggplot**

```
ggplot(mpg, aes(displ, hwy)) +
geom_point()
```

**seaborn**

```
(
so.Plot(mpg, x="displ", y="hwy")
.add(so.Dot())
)
```

map the class variable to colour

**ggplot**

```
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()
```

**seaborn**

```
(
so.Plot(mpg, x="displ", y="hwy", color="class")
.add(so.Dot())
)
```

map the class variable to pointsize

**ggplot**

```
ggplot(mpg, aes(displ, hwy, size = class)) +
geom_point()
```

**seaborn**

```
(
so.Plot(mpg, x="displ", y="hwy", pointsize="class")
.add(so.Dot())
)
```

map the class variable to alpha

**ggplot**

```
ggplot(mpg, aes(displ, hwy, alpha = class)) +
geom_point()
```

**seaborn**

```
(
so.Plot(mpg, x="displ", y="hwy", alpha="class")
.add(so.Dot())
)
```

map the class variable to shape

**ggplot**

```
ggplot(mpg, aes(displ, hwy, shape = class)) +
geom_point()
```

**seaborn**

```
(
so.Plot(mpg, x="displ", y="hwy", marker="class")
.add(so.Dot())
)
```

**ggplot**

```
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(~class)
```

**seaborn**

the `wrap=3`

argument limits it to 3 plots per column, like the ggplot example

```
(
so.Plot(mpg, x="displ", y="hwy")
.add(so.Dot())
.facet("class", wrap=3)
)
```

**ggplot**

```
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()
```

**seaborn**

Unfortunately, seaborn objects does not have an option for a confidence band yet (StackOverflow discussion) and the smoother is not LOESS (as in ggplot) but a polynomial fit

```
(
so.Plot(mpg, x="displ", y="hwy")
.add(so.Dot(color="black"))
.add(so.Line(), so.PolyFit(order=5))
)
```

seaborn objects does not yet support boxplots and violinplots

**ggplot**

```
ggplot(mpg, aes(hwy)) + geom_histogram()
```

**seaborn**

```
(
so.Plot(mpg, x="hwy")
.add(so.Bars(), so.Hist(bins=30))
)
```

seaborn does not support frequency polygons, so we use KDE instead

**ggplot**

```
ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth = 1)
```

**seaborn**

```
(
so.Plot(mpg, x="hwy")
.add(so.Area(), so.KDE(bw_adjust=0.2))
)
```

**ggplot**

```
ggplot(mpg, aes(manufacturer)) +
geom_bar()
```

**seaborn**

```
(
so.Plot(mpg, x="manufacturer")
.add(so.Bar(), so.Hist())
)
```

**ggplot**

```
ggplot(economics, aes(date, uempmed)) +
geom_line()
```

**seaborn**

```
economics = pl.read_csv("https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/economics.csv", try_parse_dates=True, dtypes={"pop": pl.Float32})
(
so.Plot(economics.to_pandas(), x="date", y="uempmed")
.add(so.Path())
)
```

However, if you are a Python user, you may not have a clear and easy choice for data manipulation. The (by far) most popular package is **pandas** but I always found it’s syntax confusing, especially if you have a strong dplyr background.

That is, until now. After reading Emily Riederer’s excellent blog I discovered **polars**, a new Python library for data manipulation that aims to fill the gap between dplyr and pandas.
Polars is a fast and expressive library that offers a syntax similar to the tidyverse, while being extremely fast and scalable.
It leverages Apache Arrow as its underlying data structure and is written in Rust, which enables efficient memory management and interoperability with other tools.
It also supports lazy evaluation, parallel processing, and query optimization, which make it suitable for working with large and complex data sets.
According to some benchmarks, polars is one of the fastest tools for handling and manipulating data.

In this blog post, we will compare the dplyr and polars libraries and see how they can help us perform common data manipulation tasks.
We will use a simple data set of the **palmerpenguins** package, the “new Iris data set”. We will see how to load, filter, group, join, and reshape the data using both libraries, and compare their syntax and output.

Reading a csv file in **polars** is extremely easy (and similar to pandas).
The following code reads in the *palmerpenguins* data set (from a github page).
In polars, **NA** values are **null** so it’s important to specify this using the **null_values** parameter.

```
import polars as pl
df = pl.read_csv("https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv", null_values="NA")
```

polars also supports a scan_csv method for lazy loading, this is extremely useful for large datasets.

In **R**, for convenience, we use the **palmerpenguins** package that already contains the penguins data set

```
library(tidyverse)
library(palmerpenguins)
attach(penguins)
```

by entering `df`

we get the first few rows of the data set, similar to the tidyverse

```
> df
shape: (344, 9)
┌───────┬───────────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬────────┬──────┐
│ rowid ┆ species ┆ island ┆ bill_length_mm ┆ … ┆ flipper_length_mm ┆ body_mass_g ┆ sex ┆ year │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 ┆ ┆ i64 ┆ i64 ┆ str ┆ i64 │
╞═══════╪═══════════╪═══════════╪════════════════╪═══╪═══════════════════╪═════════════╪════════╪══════╡
│ 1 ┆ Adelie ┆ Torgersen ┆ 39.1 ┆ … ┆ 181 ┆ 3750 ┆ male ┆ 2007 │
│ 2 ┆ Adelie ┆ Torgersen ┆ 39.5 ┆ … ┆ 186 ┆ 3800 ┆ female ┆ 2007 │
│ 3 ┆ Adelie ┆ Torgersen ┆ 40.3 ┆ … ┆ 195 ┆ 3250 ┆ female ┆ 2007 │
│ 4 ┆ Adelie ┆ Torgersen ┆ null ┆ … ┆ null ┆ null ┆ null ┆ 2007 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 341 ┆ Chinstrap ┆ Dream ┆ 43.5 ┆ … ┆ 202 ┆ 3400 ┆ female ┆ 2009 │
│ 342 ┆ Chinstrap ┆ Dream ┆ 49.6 ┆ … ┆ 193 ┆ 3775 ┆ male ┆ 2009 │
│ 343 ┆ Chinstrap ┆ Dream ┆ 50.8 ┆ … ┆ 210 ┆ 4100 ┆ male ┆ 2009 │
│ 344 ┆ Chinstrap ┆ Dream ┆ 50.2 ┆ … ┆ 198 ┆ 3775 ┆ female ┆ 2009 │
└───────┴───────────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴────────┴──────┘
```

The following table compares the main functions of *polars* with the R package *dplyr*:

dplyr | polars | |
---|---|---|

first `n` rows |
`head(df, n)` |
`df.head(n)` |

pick column | `select(df, x)` |
`df.select(pl.col("x"))` |

pick multiple columns | `select(df, x, y)` |
`df.select(pl.col("x", "y"))` |

pick rows | `filter(df, x > 4)` |
`df.filter(pl.col("x") > 4 )` |

sort column | `arrange(df, x)` |
`df.sort("x")` |

You see, these commands are basically the same between *dplyr* and *polars*.

**For example**, we want to get the `bill_length_mm`

of all penguins with `body_mass_g`

below 3800:

```
> df.filter(pl.col("body_mass_g") < 3800).select(pl.col("bill_length_mm"))
shape: (129, 1)
┌────────────────┐
│ bill_length_mm │
│ --- │
│ f64 │
╞════════════════╡
│ 39.1 │
│ 40.3 │
│ 36.7 │
│ 39.3 │
│ … │
│ 45.7 │
│ 43.5 │
│ 49.6 │
│ 50.2 │
└────────────────┘
```

`filter`

and `select`

Like in dplyr, polars `filter`

and `select`

have many more capabilities:

dplyr | polars | |
---|---|---|

select all columns except x | `select(df, -x)` |
`df.select(pl.exclude("x"))` |

select all columns that start with “str” | `select(df, starts_with("str"))` |
`df.select(pl.col("^bill.*$"))` or `df.select(cs.starts_with("str"))` [1] |

select numeric columns | `select(df, where(is.numeric))` |
`df.select(cs.float(), cs.integer())` [1,2] |

filter range of values | `filter(df, between(x, lo, hi))` |
`df.filter(pl.col("x").is_between(lo, hi))` |

[1] requires `import polars.selectors as cs`

[2] Please note that you can also cast() columns from one type to another (e.g. Float to Int).

**For example**, return all columns starting with “bill” for the penguin species “Gentoo”:

```
> df.filter(pl.col("species") == "Gentoo").select(pl.col("^bill.*$"))
shape: (124, 2)
┌────────────────┬───────────────┐
│ bill_length_mm ┆ bill_depth_mm │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞════════════════╪═══════════════╡
│ 46.1 ┆ 13.2 │
│ 50.0 ┆ 16.3 │
│ 48.7 ┆ 14.1 │
│ 50.0 ┆ 15.2 │
│ … ┆ … │
│ 46.8 ┆ 14.3 │
│ 50.4 ┆ 15.7 │
│ 45.2 ┆ 14.8 │
│ 49.9 ┆ 16.1 │
└────────────────┴───────────────┘
```

One of the most used commands in my dplyr workflow is the `mutate`

function for creating columns.
The polars equivalent is called `with_columns`

and works similarly:

dplyr | polars | |
---|---|---|

create new column | `mutate(df, x_mean = mean(x))` |
`df.with_columns(pl.col("x").mean().alias("x_mean"))` |

rename column | `rename(df, new_x = x)` |
`df.rename({"x": "new_x"})` |

**For example**, let’s create a new variable with the bill/flipper ratio called `bill_flipper_ratio`

:

```
> df.with_columns((pl.col("bill_length_mm") / pl.col("flipper_length_mm")).alias("bill_flipper_ratio"))
shape: (344, 10)
┌───────┬───────────┬───────────┬────────────────┬───┬─────────────┬────────┬──────┬────────────────────┐
│ rowid ┆ species ┆ island ┆ bill_length_mm ┆ … ┆ body_mass_g ┆ sex ┆ year ┆ bill_flipper_ratio │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 ┆ ┆ i64 ┆ str ┆ i64 ┆ f64 │
╞═══════╪═══════════╪═══════════╪════════════════╪═══╪═════════════╪════════╪══════╪════════════════════╡
│ 1 ┆ Adelie ┆ Torgersen ┆ 39.1 ┆ … ┆ 3750 ┆ male ┆ 2007 ┆ 0.216022 │
│ 2 ┆ Adelie ┆ Torgersen ┆ 39.5 ┆ … ┆ 3800 ┆ female ┆ 2007 ┆ 0.212366 │
│ 3 ┆ Adelie ┆ Torgersen ┆ 40.3 ┆ … ┆ 3250 ┆ female ┆ 2007 ┆ 0.206667 │
│ 4 ┆ Adelie ┆ Torgersen ┆ null ┆ … ┆ null ┆ null ┆ 2007 ┆ null │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 341 ┆ Chinstrap ┆ Dream ┆ 43.5 ┆ … ┆ 3400 ┆ female ┆ 2009 ┆ 0.215347 │
│ 342 ┆ Chinstrap ┆ Dream ┆ 49.6 ┆ … ┆ 3775 ┆ male ┆ 2009 ┆ 0.256995 │
│ 343 ┆ Chinstrap ┆ Dream ┆ 50.8 ┆ … ┆ 4100 ┆ male ┆ 2009 ┆ 0.241905 │
│ 344 ┆ Chinstrap ┆ Dream ┆ 50.2 ┆ … ┆ 3775 ┆ female ┆ 2009 ┆ 0.253535 │
└───────┴───────────┴───────────┴────────────────┴───┴─────────────┴────────┴──────┴────────────────────┘
```

Aggregating and grouping data are essential skills for data analysis, as they allow you to summarize, transform, and manipulate data in meaningful ways.
Again, these commands are very similar between *dplyr* and *polars*:

dplyr | polars | |
---|---|---|

group | `group_by(df, x)` |
`df.group_by("x")` |

summarize | `summarize(df, x_n = n())` |
`df.agg(pl.count().alias("x_n"))` |

**For example**, group the data by species and count the number of penguins of each species, then sort in descending order:

```
> df.group_by("species").agg(pl.count().alias("counts")).sort("counts", descending=True)
shape: (3, 2)
┌───────────┬────────┐
│ species ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════════╪════════╡
│ Adelie ┆ 152 │
│ Gentoo ┆ 124 │
│ Chinstrap ┆ 68 │
└───────────┴────────┘
```

Another one, for each species, find the penguin with the lowest body mass:

```
> df.group_by("species").agg(pl.all().sort_by("body_mass_g").first())
shape: (3, 9)
┌───────────┬───────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬────────┬──────┐
│ species ┆ rowid ┆ island ┆ bill_length_mm ┆ … ┆ flipper_length_mm ┆ body_mass_g ┆ sex ┆ year │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ f64 ┆ ┆ i64 ┆ i64 ┆ str ┆ i64 │
╞═══════════╪═══════╪═══════════╪════════════════╪═══╪═══════════════════╪═════════════╪════════╪══════╡
│ Adelie ┆ 4 ┆ Torgersen ┆ null ┆ … ┆ null ┆ null ┆ null ┆ 2007 │
│ Chinstrap ┆ 315 ┆ Dream ┆ 46.9 ┆ … ┆ 192 ┆ 2700 ┆ female ┆ 2008 │
│ Gentoo ┆ 272 ┆ Biscoe ┆ null ┆ … ┆ null ┆ null ┆ null ┆ 2009 │
└───────────┴───────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴────────┴──────┘
```

You can see that this result contains the null values, so let’s remove them:

```
> df.group_by("species").agg(pl.all().sort_by("body_mass_g").drop_nulls().first())
shape: (3, 9)
┌───────────┬───────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬────────┬──────┐
│ species ┆ rowid ┆ island ┆ bill_length_mm ┆ … ┆ flipper_length_mm ┆ body_mass_g ┆ sex ┆ year │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ f64 ┆ ┆ i64 ┆ i64 ┆ str ┆ i64 │
╞═══════════╪═══════╪═══════════╪════════════════╪═══╪═══════════════════╪═════════════╪════════╪══════╡
│ Adelie ┆ 4 ┆ Torgersen ┆ 36.5 ┆ … ┆ 181 ┆ 2850 ┆ female ┆ 2007 │
│ Chinstrap ┆ 315 ┆ Dream ┆ 46.9 ┆ … ┆ 192 ┆ 2700 ┆ female ┆ 2008 │
│ Gentoo ┆ 272 ┆ Biscoe ┆ 42.7 ┆ … ┆ 208 ┆ 3950 ┆ female ┆ 2009 │
└───────────┴───────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴────────┴──────┘
```

The dplyr equivalent would be something like this:

```
> penguins |>
group_by(species) |>
arrange(body_mass_g) |>
summarize(body_mass_g = first(body_mass_g))
# A tibble: 3 × 2
species body_mass_g
<fct> <int>
1 Adelie 2850
2 Chinstrap 2700
3 Gentoo 3950
```

polars has the same `left_join`

, `right_join`

and `inner_join`

functionality as dplyr, for examples please refer to the docs.

dplyr | polars | |
---|---|---|

join dataframes | `left_join(df1, df2, by=x)` |
`df1.join(df2, on="x", how="left")` |

In conclusion, **polars** is a fast and expressive DataFrame library for Python that can handle large-scale data analysis with ease and efficiency. It offers a familiar and intuitive interface, similar to R’s dplyr. Whether you are a data scientist, a data analyst, or a data enthusiast, polars can help you unleash the power of your data.

read in the data (available on the challenge website)

```
library(tidyverse)
noahs_customers <- read_csv("../noahs-csv/5784/noahs-customers.csv")
```

clean the data, especially the names, then split into first and last name

```
noahs_customers <- noahs_customers |>
mutate(name = str_remove_all(name, " Jr.")) |>
mutate(name = str_remove_all(name, " III")) |>
mutate(name = str_remove_all(name, " II")) |>
mutate(name = str_remove_all(name, " IV")) |>
mutate(phone = str_remove_all(phone, "-")) |>
extract(name, into = c('FirstName', 'LastName'), '(.*)\\s+([^ ]+)$')
```

function to map letters to their phone buttons

```
replace_chars <- function(input_string) {
mapping <- letters
replacement <- c("2", "2", "2", "3", "3", "3", "4",
"4", "4", "5", "5", "5", "6", "6", "6", "7", "7",
"7", "7", "8", "8", "8", "9", "9", "9", "9")
for (i in seq_along(mapping)) {
input_string <- gsub(mapping[i], replacement[i], tolower(input_string))
}
return(input_string)
}
```

apply this function

```
noahs_customers$LastName_Number <- sapply(noahs_customers$LastName, replace_chars)
```

find the name where phone number and last name matches

```
noahs_customers |> filter(LastName_Number == phone)
```

here’s another solution in `Julia`

```
using CSV, DataFrames
noahs_customer = CSV.File("../5784/noahs-customers.csv") |> DataFrame
transform!(noahs_customer, :name => ByRow(x -> join(split(x, " ")[2:end], " ")) => :last_name)
transform!(noahs_customer, :last_name => ByRow(x -> replace(x, " III" => "", " IV" => "", " Jr." => "")) => :last_name)
function replace_chars(input_string)
letters = join(Char.(97:122))
mapping = Dict(zip(letters, ["2", "2", "2", "3", "3", "3", "4", "4", "4", "5", "5", "5", "6", "6", "6", "7", "7", "7", "7", "8", "8", "8", "9", "9", "9", "9"]))
for (k, v) in mapping
input_string = replace(lowercase(input_string), k => v)
end
return input_string
end
transform!(noahs_customer, :last_name => ByRow(x -> replace_chars(x)) => :last_name_transformed)
transform!(noahs_customer, :phone => ByRow(x -> replace(x, "-" => "")) => :phone)
filter(row -> row.phone == row.last_name_transformed, noahs_customer)
```

`R`

solution:

```
noahs_customers <- read.csv("../5784/noahs-customers.csv")
noahs_orders <- read.csv("../5784/noahs-orders.csv")
noahs_orders_items <- read_csv("../5784/noahs-orders_items.csv")
noahs_products <- read_csv("../5784/noahs-products.csv")
noahs_customers |>
mutate(name = str_remove_all(name, " Jr.")) |>
mutate(name = str_remove_all(name, " III")) |>
mutate(name = str_remove_all(name, " II")) |>
mutate(name = str_remove_all(name, " IV")) |>
extract(name, into = c('FirstName', 'LastName'), '(.*)\\s+([^ ]+)$') |>
mutate(initials = str_c(str_sub(FirstName, 1, 1), str_sub(LastName, 1, 1))) |>
left_join(noahs_orders, by="customerid") |>
left_join(noahs_orders_items, by="orderid") |>
left_join(noahs_products, by="sku") |>
mutate(order_year = year(shipped)) |>
filter(initials == "JP" & order_year == 2017 & str_detect(desc, "Bagel")) |>
select(phone)
```

`Julia`

solution:

```
@chain noahs_customers begin
leftjoin(noahs_orders, on=:customerid, matchmissing=:equal)
leftjoin(noahs_orders_items, on=:orderid, matchmissing=:equal)
leftjoin(noahs_products, on=:sku, matchmissing=:equal)
transform(:name => ByRow(x -> join(split(x, " ")[2:end], " ")) => :last_name)
transform(:name => ByRow(x -> join(split(x, " ")[1], " ")) => :first_name)
transform(:last_name => ByRow(x -> replace(x, " III" => "", " IV" => "", " Jr." => "")) => :last_name)
@rsubset occursin.("Bagel", coalesce.(:desc, ""))
transform(:ordered => ByRow(x -> DateTime(x, dateformat"yyyy-mm-dd HH:MM:SS")) => :ordered)
filter(row -> startswith(row.first_name, "J") && startswith(row.last_name, "P") && year(row.ordered) == 2017, _)
end
# Joshua Peterson 332-274-4185
```

`R`

solution:

```
noahs_customers <- read.csv("../5784/noahs-customers.csv")
noahs_customers |>
# year of rabbit
filter(year(birthdate) %in% c(1927, 1939, 1951, 1963, 1975, 1987, 1999, 2011, 2023)) |>
# cancer
filter(month(birthdate) %in% c(6,7)) |>
# address of person from puzzle 2
filter(str_detect(citystatezip, "Jamaica, NY 11435"))
# select first one because male
```

`Julia`

solution:

```
using CSV, DataFrames, DataFramesMeta, Dates
noahs_customers = CSV.File(".../5784/noahs-customers.csv") |> DataFrame
@chain noahs_customers begin
@rsubset(year(:birthdate) in [1927, 1939, 1951, 1963, 1975, 1987, 1999, 2011, 2023])
@rsubset(month(:birthdate) in [6, 7])
@rsubset occursin.("Jamaica, NY 11435", :citystatezip)
end
```

This one was a bit tricky, the result still contains a list of potential candidates that I tried out manually.

`R`

solution:

```
noahs_customers <- read_csv("../5784/noahs-customers.csv")
noahs_orders <- read_csv("../5784/noahs-orders.csv")
noahs_orders_items <- read_csv("../5784/noahs-orders_items.csv")
noahs_products <- read_csv("../5784/noahs-products.csv")
noahs_customers |>
left_join(noahs_orders, by="customerid") |>
left_join(noahs_orders_items, by="orderid") |>
left_join(noahs_products, by="sku") |>
# filter bakery items by SKU number
filter(str_detect(sku, "BKY")) |>
# the person likes to shop early
filter(hour(ordered) < 5 & hour(shipped) < 5) |>
# I assumed that the person is not too old because she likes to bike a lot
filter(year(birthdate) > 1970) |>
# I assumed the person lives in a neighbouring are of Jamaica, NY, because she biked there
filter(str_detect(citystatezip, "Brooklyn") | str_detect(citystatezip, "Queens"))
```

`Julia`

solution:

```
using CSV, DataFrames, DataFramesMeta, Dates
noahs_customers = CSV.File("../5784/noahs-customers.csv") |> DataFrame
noahs_orders = CSV.File("../5784/noahs-orders.csv") |> DataFrame
noahs_orders_items = CSV.File("../5784/noahs-orders_items.csv") |> DataFrame
noahs_products = CSV.File("../5784/noahs-products.csv") |> DataFrame
@chain noahs_customers begin
leftjoin(noahs_orders, on=:customerid, matchmissing=:equal)
leftjoin(noahs_orders_items, on=:orderid, matchmissing=:equal)
leftjoin(noahs_products, on=:sku, matchmissing=:equal)
@rsubset occursin.("BKY", coalesce.(:sku, ""))
@rsubset(hour(DateTime(:ordered, "yyyy-mm-dd HH:MM:SS")) < 5)
@rsubset(hour(DateTime(:shipped, "yyyy-mm-dd HH:MM:SS")) < 5)
@rsubset(year(:birthdate) > 1970)
@rsubset occursin.("Brooklyn", coalesce.(:citystatezip, "")) || occursin.("Manhattan", coalesce.(:citystatezip, ""))
end
# Renee Harmon 607-231-3605
```

`Julia`

solution

```
@chain noahs_customers begin
leftjoin(noahs_orders, on=:customerid, matchmissing=:equal)
leftjoin(noahs_orders_items, on=:orderid, matchmissing=:equal)
leftjoin(noahs_products, on=:sku, matchmissing=:equal)
@rsubset occursin.("Cat", coalesce.(:desc, ""))
@rsubset occursin.("Staten Island", coalesce.(:citystatezip, ""))
@rsubset :qty >= 10
end
# Nicole Wilson 631-507-6048
```

`R`

solution

```
noahs_customers |>
left_join(noahs_orders, by="customerid") |>
left_join(noahs_orders_items, by="orderid") |>
left_join(noahs_products, by="sku") |>
group_by(orderid) |>
mutate(total_cost = sum(wholesale_cost),
profit = total - total_cost,
negative_profit_count = sum(profit < 0)) |>
arrange(desc(negative_profit_count)) |>
first()
# Sherri Long 585-838-9161
```

`Julia`

solution

```
@chain noahs_customers begin
leftjoin(noahs_orders, on=:customerid, matchmissing=:equal)
leftjoin(noahs_orders_items, on=:orderid, matchmissing=:equal)
leftjoin(noahs_products, on=:sku, matchmissing=:equal)
groupby(:orderid)
@transform(:total_cost = sum(:wholesale_cost); ungroup=false)
@transform(:profit = :total .- :total_cost; ungroup=false)
@transform(:negative_profit_count = sum(:profit .< 0))
@orderby -:negative_profit_count
end
# Sherri Long 585-838-9161
```

`Julia`

solution

```
df = @chain noahs_customers begin
leftjoin(noahs_orders, on=:customerid, matchmissing=:equal)
leftjoin(noahs_orders_items, on=:orderid, matchmissing=:equal)
leftjoin(noahs_products, on=:sku, matchmissing=:equal)
@rsubset occursin.("COL", coalesce.(:sku, ""))
transform(:ordered => ByRow(x -> DateTime(x, dateformat"yyyy-mm-dd HH:MM:SS")) => :ordered)
# remove colors in parenthesis
transform(:desc => ByRow(x -> replace(x, r"\(\w+\)" => "")) => :desc_clean)
end
sherri_items = @chain df begin
subset(:name => ByRow(x -> x == "Sherri Long"))
_.desc_clean
end
sherri_dates = @chain df begin
subset(:name => ByRow(x -> x == "Sherri Long"))
_.ordered
end
@chain df begin
@rsubset(Date(:ordered) in Date.(sherri_dates))
@rsubset(:desc_clean in sherri_items)
end
# Carlos Myers 838-335-7157
```

`Julia`

solution

```
@chain noahs_customers begin
leftjoin(noahs_orders, on=:customerid, matchmissing=:equal)
leftjoin(noahs_orders_items, on=:orderid, matchmissing=:equal)
leftjoin(noahs_products, on=:sku, matchmissing=:equal)
@rsubset occursin.("COL", coalesce.(:sku, ""))
groupby(:name)
transform(:name => length)
@orderby -:name_length
first()
end
# James Smith 212-547-3518
```

**How is that possible?**

It’s because compression and language models are both based on information theory. They both try to encode text efficiently, but with different goals. Compression wants to make the file size as small as possible, without losing any information. Language models want to capture meaningful linguistic features, without restoring the original text.

So gzip is a good language model, even if it doesn’t care about meaning. It can pass many language benchmarks, because it encodes text in a way that is similar to Huffman coding, a common compression algorithm.

Well, it’s not to replace neural language models with gzip. Gzip is not trainable or adaptable, so it can’t handle different domains or tasks. Neural networks can encode text in different ways, depending on the data. But it’s still fascinating to see how a simple compression algorithm can do so well on complex language tasks.

Gzip is a software application that can compress and decompress files using the Deflate algorithm. Deflate is a combination of two techniques: LZ77 and Huffman coding. LZ77 reduces redundancy by finding repeated sequences of bytes and replacing them with references to their previous occurrences. Huffman coding assigns shorter codes to more frequent symbols and longer codes to less frequent ones, thus minimizing the number of bits needed to represent the data.

Huffman coding is based on information theory, which studies how to measure and communicate information. One of the key concepts in information theory is entropy, which measures the uncertainty or randomness of a source of data. The more predictable the data is, the lower its entropy is, and the more it can be compressed.

A language model is a probabilistic model that assigns probabilities to sequences of words or symbols. A good language model should capture the regularities and patterns of natural language, such as syntax, semantics and pragmatics. A good language model should also have low entropy, meaning that it can predict the next word or symbol with high accuracy.

**So how does gzip use Huffman coding as a language model?**

The answer is that Huffman coding implicitly encodes some linguistic information in the compressed data. For example, if a word or symbol is very frequent in the data, it will have a short code in the Huffman tree. This means that it will also have a high probability in the language model. Conversely, if a word or symbol is very rare in the data, it will have a long code in the Huffman tree, and a low probability in the language model.

This way, gzip can capture some basic features of natural language, such as word frequency, Zipf’s law and n-gram statistics. However, gzip does not care about meaning or context, so it cannot handle more complex linguistic phenomena, such as synonyms, antonyms, idioms or metaphors.

Therefore, gzip is not a very sophisticated language model, but it is still surprisingly effective for some tasks.

]]>