Using Clustering to Learn More About Penguins

Courtesy: S Richter et al/Journal of Physics D: Applied Physics

Clustering is a machine learning technique used to group similar data togehter into subsets based on inherent similarities and patterns. It is a techineque that is wildly used in explaratory data analysis in order to reveal insights about data. It has a significant overlap with classification as many algorithms that are used in clustering are also used for classification. Such algorithms include K-Means-Clustering, K-Nearest-Neighbor, and many others.

In this blog, we will be exploring clustering as applied to the penguins dataset.

About the Data

The penguins dataset is a very popular dataset used for both clustering and classification algorithm testing and learning. It was collected as part of a study on penguins’ sexual dimorphism. The data contains the following columns:

species: (Chinstrap, Adélie, or Gentoo).
culmen_length_mm: the length of the dorsal ridge of the penguin’s bill in millimeters.
culmen_depth_mm: the depth of the dorsal ridge of the penguin’s bill in millimeters.
flipper_length_mm: the length of the penguin’s flippers in millimeters.
body_mass_g: the mass of the penguin in grams.
island: the island on which the penguin was located (Dream, Torgersen, or Biscoe).
sex: the sex of the penguin.

Let us load the data:

Loading and Preprocessing

import matplotlib.pyplot as plt
import pandas as pd

penguins_data = pd.read_csv("penguins_size.csv")
penguins_data.head()

	species	island	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	MALE
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	FEMALE
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	FEMALE
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	FEMALE

Right of the bat, we can see that that there are rows in the dataset that contain NaNs. One way we could deal with it is by imputing the values. However, it would be much simpler to just drop them as there are only a few such rows.

penguins_data = penguins_data.dropna()

Next, we want to encode the said labels for simplicity.

penguins_data["species"], species_scheme = pd.factorize(penguins_data["species"])

In this case species_scheme is an array that contains the previous labels with index that corresponds to their new label. If it sounded like a bunch of jibberish, here is contents:

species_scheme

Index(['Adelie', 'Chinstrap', 'Gentoo'], dtype='object')

Learning more about the data

Let’s take a look at the distribution of all the numeric values in the dataset:

numeric_cols = [
    "culmen_length_mm",
    "culmen_depth_mm",
    "flipper_length_mm",
    "body_mass_g",
]
graph = pd.plotting.scatter_matrix(
    penguins_data[numeric_cols],
    figsize=(10, 10),
    c=penguins_data["species"],
    label=species_scheme,
    diagonal="kde",
)

We can see that there exists a pretty clear separation between the distributions of all the subsets. This matrix is a lot of data to look at - at once. Instead we can condense most of the data into two dimentions using principal component analysis (PCA). This will reduce the dimensionality of the data into two, dummy, dimentions. In essence, it will be a projection of the data from the four dimentional space into two dimentions.

First, thing we want to do before that is to standardize the data. This is done fairly easily by using StandardScaler from Scikit-Learn.

from sklearn.preprocessing import StandardScaler
scaled_numeric_penguins_data = StandardScaler().fit_transform(penguins_data[numeric_cols])

Performing PCA is pretty straight forward as well.

from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=42)
two_component_penguins_data = pca.fit_transform(scaled_numeric_penguins_data)

Now that we have the projection of the data, let’s graph it.

for i, target_class in enumerate(species_scheme):
    indices = penguins_data["species"] == i
    plt.scatter(two_component_penguins_data[indices, 0], two_component_penguins_data[indices, 1], label=target_class)

plt.legend()

<matplotlib.legend.Legend at 0x2925394d0>

We can see that Gentoo are pretty distinct from the other two types of penguins. There is also some seperation between Adelies and Chinstraps’s, but it is smaller in comparison. This will make it more difficult for our clustering algorithm to distinguish between them.

Clustering

We will try to distinguish between the penguins using K-Means-Clustering (K-Means). K-Means works by randomly initiating centers of each cluster (in our case it will initiate 3). It will then assign each point to a cluster based on which center the point is closest to. The centers of each cluster are then recalculated, and the steps are repeated until convergence or a step limit is reached. This is a good clustering algorithm for this dataset as the datapoints resemble unordered blobs with no specific shape that are somewhat distant from one another.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
predictions = kmeans.fit_predict(scaled_numeric_penguins_data)

/Users/danielsabanov/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)

Let’s just verify that the labels that we got were in the range from 0 to 2.

set(predictions)

{0, 1, 2}

Now, we can plot the projection of the data again, this time using the predicted labels.

for i, target_class in enumerate(set(predictions)):
    indices = predictions == i
    plt.scatter(two_component_penguins_data[indices, 0], two_component_penguins_data[indices, 1], label=target_class)

plt.legend()

<matplotlib.legend.Legend at 0x296cd94d0>

It looks fairly similar. It also gives us an understanding which label corresponds to which species based on their distribution in the graph. We can see that 0 corresponds to chinstrap, 1 corresponds to gentoo, and 2 corresponds to adelie.

class_to_prediction = {"adelie": 2, "chinstrap": 0, "gentoo": 1}

We can store the data back into the dataframe for easier access.

penguins_data["predictions"] = predictions

Let’s take a closer look at the distribtions in the data.

fig1, ax1 = plt.subplots(ncols=2, figsize=(15, 7))
fig2, ax2 = plt.subplots(ncols=2, figsize=(15, 7))
fig3, ax3 = plt.subplots(ncols=2, figsize=(15, 7))
fig4, ax4 = plt.subplots(ncols=2, figsize=(15, 7))

penguins_data.groupby("species")["culmen_length_mm"].plot.kde(ax=ax1[0], title="Culmen Length in mm - True")
penguins_data.groupby("species")["culmen_depth_mm"].plot.kde(ax=ax2[0], title="Culmen Depth in mm - True")
penguins_data.groupby("species")["flipper_length_mm"].plot.kde(ax=ax3[0], title="Flipper Length in mm - True")
penguins_data.groupby("species")["body_mass_g"].plot.kde(ax=ax4[0], title="Body Mass in g - True")

penguins_data.groupby("predictions")["culmen_length_mm"].plot.kde(ax=ax1[1], title="Culmen Length in mm - Predictions")
penguins_data.groupby("predictions")["culmen_depth_mm"].plot.kde(ax=ax2[1], title="Culmen Depth in mm - Predictions")
penguins_data.groupby("predictions")["flipper_length_mm"].plot.kde(ax=ax3[1], title="Flipper Length in mm - Predictions")
penguins_data.groupby("predictions")["body_mass_g"].plot.kde(ax=ax4[1], title="Body Mass in g - Predictions")


plt.plot()

[]

Alternatively, we can also overlay the distributions for easier comparison.

for col in numeric_cols:
    fig1, axes = plt.subplots(ncols=3, figsize=(15,7))
    ax1, ax2, ax3 = axes
    
    penguins_data.loc[penguins_data["species"] == 0][col].plot.kde(ax=ax1, linestyle="-.", title=f"{col}- Adelie vs. Prediction")
    penguins_data.loc[penguins_data["predictions"] == class_to_prediction["adelie"]][col].plot.kde(ax=ax1, linestyle="--")
    
    penguins_data.loc[penguins_data["species"] == 1][col].plot.kde(ax=ax2, linestyle="-.", title=f"{col} - Chinstrap vs. Prediction")
    penguins_data.loc[penguins_data["predictions"] == class_to_prediction["chinstrap"]][col].plot.kde(ax=ax2, linestyle="--")
    
    penguins_data.loc[penguins_data["species"] == 2][col].plot.kde(ax=ax3, linestyle="-.", title=f"{col} - Gentoo vs. Prediction")
    penguins_data.loc[penguins_data["predictions"] == class_to_prediction["gentoo"]][col].plot.kde(ax=ax3, linestyle="--")

    plt.plot()

We can see pretty clearly in the data that one group of penguins is predicted perfectly, the genttos, as the distributions are identical, while the two others were not predicted as perfectly as the distributions do not match as well.

Trying to improve the resutls

We could try to improve the previous results by also introducing the categorical data into the dataset, as previously we were primarily working with only the continous data. This will hopefully give the algorithm more data to distinguish between the groups.

Once again, we will encode the data.

encoded = pd.get_dummies(penguins_data, columns=["island", "sex"])
encoded.head()

	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	predictions	island_Biscoe	island_Dream	island_Torgersen	sex_.	sex_FEMALE	sex_MALE
0	39.1	18.7	181.0	3750.0	2	False	False	True	False	False	True
1	39.5	17.4	186.0	3800.0	2	False	False	True	False	True	False
2	40.3	18.0	195.0	3250.0	2	False	False	True	False	True	False
4	36.7	19.3	193.0	3450.0	2	False	False	True	False	True	False
5	39.3	20.6	190.0	3650.0	2	False	False	True	False	False	True

We actually do not need as many columns for the categorical data. We can remove a few columns since the data for these columns can be extrapolated from the other columns.

encoded = encoded.drop(columns=["sex_.", "sex_MALE", "island_Torgersen", "predictions"])
encoded.head()

	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	island_Biscoe	island_Dream	sex_FEMALE
0	39.1	18.7	181.0	3750.0	False	False	False
1	39.5	17.4	186.0	3800.0	False	False	True
2	40.3	18.0	195.0	3250.0	False	False	True
4	36.7	19.3	193.0	3450.0	False	False	True
5	39.3	20.6	190.0	3650.0	False	False	False

Once again, we want to scale the data.

scaled_penguins_data = encoded
scaled_penguins_data[numeric_cols] = StandardScaler().fit_transform(scaled_penguins_data[numeric_cols])
scaled_penguins_data["island_Biscoe"] = scaled_penguins_data["island_Biscoe"].astype(int)
scaled_penguins_data["island_Dream"] = scaled_penguins_data["island_Dream"].astype(int)
scaled_penguins_data.head()

	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex_FEMALE
0	-0.897653	0.783487	-1.429521	-0.571229	False
1	-0.824290	0.121896	-1.072408	-0.509011	True
2	-0.677564	0.427246	-0.429605	-1.193405	True
4	-1.337831	1.088836	-0.572450	-0.944535	True
5	-0.860972	1.750427	-0.786718	-0.695664	False

Once again, let’s perform PCA to understand how the overall data was affected by the introduction of the new variables.

pca = PCA(n_components=2, random_state=42)
two_component_penguins_data = pca.fit_transform(scaled_penguins_data)

for i, target_class in enumerate(species_scheme):
    indices = penguins_data["species"] == i
    plt.scatter(two_component_penguins_data[indices, 0], two_component_penguins_data[indices, 1], label=target_class)

plt.legend()

<matplotlib.legend.Legend at 0x29fc214d0>

This time, the difference between Adelies and Chinstraps is slightly more distinct as we can see that the Chistrap distribution is located slightly to the left of the Adelie distribution. This was not as apparent in the previous PCA run. Hopefully, we can achieve better clustering results due to this.

kmeans = KMeans(n_clusters=3, random_state=42)
predictions = kmeans.fit_predict(scaled_penguins_data.drop(columns=["species"]))

/Users/danielsabanov/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)

Let us graph the resutls.

for i, target_class in enumerate(set(predictions)):
    indices = predictions == i
    plt.scatter(two_component_penguins_data[indices, 0], two_component_penguins_data[indices, 1], label=target_class)

plt.legend()

class_to_prediction = {"adelie": 0, "chinstrap": 2, "gentoo": 1}
encoded["predictions"] = predictions

This time the label 0 correspnds to adelie, 1 corresponds to gentoo, and 2 corresponds to chinstrap. It appears that the classification this time went sligtly better as there is a more significant overlap between the adelie and chinstrap distributions. Be sure let’s examine the distributions in more details.

fig1, ax1 = plt.subplots(ncols=2, figsize=(20, 10))
fig2, ax2 = plt.subplots(ncols=2, figsize=(20, 10))
fig3, ax3 = plt.subplots(ncols=2, figsize=(20, 10))
fig4, ax4 = plt.subplots(ncols=2, figsize=(20, 10))

encoded.groupby("species")["culmen_length_mm"].plot.kde(ax=ax1[0], title="Culmen Length in mm - True")
encoded.groupby("species")["culmen_depth_mm"].plot.kde(ax=ax2[0], title="Culmen Depth in mm - True")
encoded.groupby("species")["flipper_length_mm"].plot.kde(ax=ax3[0], title="Flipper Length in mm - True")
encoded.groupby("species")["body_mass_g"].plot.kde(ax=ax4[0], title="Body Mass in g - True")

encoded.groupby("predictions")["culmen_length_mm"].plot.kde(ax=ax1[1], title="Culmen Length in mm - Predictions")
encoded.groupby("predictions")["culmen_depth_mm"].plot.kde(ax=ax2[1], title="Culmen Depth in mm - Predictions")
encoded.groupby("predictions")["flipper_length_mm"].plot.kde(ax=ax3[1], title="Flipper Length in mm - Predictions")
encoded.groupby("predictions")["body_mass_g"].plot.kde(ax=ax4[1], title="Body Mass in g - Predictions")


plt.plot()

[]

for col in numeric_cols:
    fig1, axes = plt.subplots(ncols=3, figsize=(15,7))
    ax1, ax2, ax3 = axes
    
    encoded.loc[encoded["species"] == 0][col].plot.kde(ax=ax1, linestyle="-.", title=f"{col}- Adelie vs. Prediction")
    encoded.loc[encoded["predictions"] == class_to_prediction["adelie"]][col].plot.kde(ax=ax1, linestyle="--")
    
    encoded.loc[encoded["species"] == 1][col].plot.kde(ax=ax2, linestyle="-.", title=f"{col} - Chinstrap vs. Prediction")
    encoded.loc[encoded["predictions"] == class_to_prediction["chinstrap"]][col].plot.kde(ax=ax2, linestyle="--")
    
    encoded.loc[encoded["species"] == 2][col].plot.kde(ax=ax3, linestyle="-.", title=f"{col} - Gentoo vs. Prediction")
    encoded.loc[encoded["predictions"] == class_to_prediction["gentoo"]][col].plot.kde(ax=ax3, linestyle="--")

    plt.plot()

This time it appears that there is a much better overlap between the real and predicted distributions of the chinstrap and adelie respenctive populations. This is actually a sensible results since sex makes a difference in the sizes of the penguins, so given that data, we are better able to tell the difference if a penguinqu is of a certain species.

Refernces

Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081