# Comparison of Partition Around Medoid R programming Implementations

This article is originally published at

Back in September
2016 I
implemented the ClusterR
package. One of the algorithms included in *ClusterR* was the ‘Partition
Around Medoids’
(Cluster_Medoids)
algorithm which was based on the paper “Anja Struyf, Mia Hubert, Peter
J. Rousseeuw, (Feb. 1997), Clustering in an Object-Oriented Environment,
Journal of Statistical Software, Vol 1, Issue 4” (at that time I didn’t
have access to the book of *Kaufman and Rousseeuw, Finding Groups in
Data (1990)* where the exact algorithm was described), thus I
implemented the code and compared my results with the output of the
cluster::pam()
function,
which was available at that time. Thus, my method was not an *exact* but
an *approximate* one. Recently, a user of the ClusterR package opened
an issue mentioning that the results were not
optimal compared to the
*cluster::pam()* function and this allowed me to go through my code once
again and also to compare my results to new R packages that were not
existent at that time. Most of these R packages include a new version of
the ‘Partition Around Medoids’ algorithm, “Erich Schubert, Peter J.
Rousseeuw,”Faster k-Medoids Clustering: Improving the PAM, CLARA, and
CLARANS Algorithms” 2019, <doi:10.1007/978-3-030-32047-8_16>”.

In this blog-post, I’ll use the following R packages,

to compare between the (as of December 2022) existing ‘Partition Around
Medoids’ implementations in terms of output **dissimilarity cost** and
**elapsed time**.

```
# required R packages
require(ClusterR)
```

```
## Loading required package: ClusterR
## Loading required package: gtools
```

```
require(cluster)
```

```
## Loading required package: cluster
```

```
require(kmed)
```

```
## Loading required package: kmed
```

```
require(fastkmedoids)
```

```
## Loading required package: fastkmedoids
##
## Attaching package: 'fastkmedoids'
## The following object is masked from 'package:cluster':
##
## pam
```

```
require(sf)
```

```
## Loading required package: sf
## Linking to GEOS 3.7.1, GDAL 2.2.2, PROJ 4.9.2; sf_use_s2() is TRUE
```

```
require(data.table)
```

```
## Loading required package: data.table
```

```
require(geodist)
```

```
## Loading required package: geodist
```

```
require(glue)
```

```
## Loading required package: glue
```

```
require(magrittr)
```

```
## Loading required package: magrittr
```

```
require(ggplot2)
```

```
## Loading required package: ggplot2
```

```
require(mapview)
```

```
## Loading required package: mapview
```

```
require(knitr)
```

```
## Loading required package: knitr
```

### Datasets

For comparison purposes, I’ll use the following datasets,

- a 2-column dataset using values of the
*Normal Distribution*(*standard deviation*is equal to 0.25 and the*mean*parameter takes the value of 0 or 1) - the
*‘dietary_survey_IBS’*dataset which exists in the*ClusterR*package - the
*‘soybean’*dataset which exists in the*ClusterR*package - the
*‘agriculture’*dataset which exists in the*cluster*package - a
*‘geospatial’*dataset that shows how nicely the ‘Partition Around Medoids’ algorithm can cluster coordinate points based on a*‘geodesic’*distance matrix (assuming we can visually decide on the number of clusters)

The next function contains the previously mentioned datasets,

```
datasets_clust = function(data_name) {
if (data_name == 'rnorm_data') {
n = 100
set.seed(1)
x = rbind(matrix(rnorm(n, mean = 0, sd = 0.25), ncol = 2),
matrix(rnorm(n, mean = 1, sd = 0.25), ncol = 2))
lst_out = list(x = x, dataset = data_name)
}
else if (data_name == 'dietary_survey_IBS') {
data(dietary_survey_IBS, package = 'ClusterR')
x = dietary_survey_IBS[, -ncol(dietary_survey_IBS)]
lst_out = list(x = x, dataset = data_name)
}
else if (data_name == 'soybean') {
data(soybean, package = 'ClusterR')
x = soybean[, -ncol(soybean)]
lst_out = list(x = x, dataset = data_name)
}
else if (data_name == 'agriculture') {
data(agriculture, package = 'cluster')
x = agriculture
lst_out = list(x = x, dataset = data_name)
}
else if (data_name == 'geospatial') {
wkt_lst = list(black_sea = "POLYGON ((31.47957 43.64944, 31.47957 42.82356, 36.17885 42.82356, 36.17885 43.64944, 31.47957 43.64944))",
caspian_sea = "POLYGON ((49.23243 42.75324, 49.23243 41.99335, 51.54848 41.99335, 51.54848 42.75324, 49.23243 42.75324))",
mediterranean_sea = "POLYGON ((16.55062 35.08059, 16.55062 34.13804, 20.20771 34.13804, 20.20771 35.08059, 16.55062 35.08059))",
red_sea = "POLYGON ((36.61694 23.61829, 36.61694 22.60845, 38.18655 22.60845, 38.18655 23.61829, 36.61694 23.61829))")
nam_wkts = names(wkt_lst)
CRS_wkt = 4326
sample_point_size = 250 # sample that many random points from each sea WKT polygon
count = 1
all_dfs = list()
for (nam in nam_wkts) {
WKT = wkt_lst[[nam]]
read_wkt_inp = sf::st_as_sfc(WKT, crs = sf::st_crs(CRS_wkt))
# to sample random points see: https://stackoverflow.com/a/70073632/8302386
random_lat_lons = sf::st_sample(x = read_wkt_inp, size = sample_point_size, type = "random", crs = sf::st_crs(CRS_wkt))
random_df = sf::st_coordinates(random_lat_lons)
random_df = data.table::as.data.table(random_df)
colnames(random_df) = c('lon', 'lat')
random_df$sea = rep(x = nam, times = nrow(random_df))
random_df$code = rep(x = count, times = nrow(random_df))
all_dfs[[count]] = random_df
count = count + 1
}
dat = data.table::rbindlist(all_dfs)
dat_sf = sf::st_as_sf(dat, coords = c('lon', 'lat'), crs = CRS_wkt)
# add also an outlier which is between the 'mediterranean', 'black' and 'red' sea
# and will be assigned to the one that is closest in terms of distance
outlier_lon = 34.8988917972
outlier_lat = 35.0385655983
dat_sf_update = data.frame(lon = outlier_lon, lat = outlier_lat)
x_mt_update = geodist::geodist(x = dat[, 1:2], y = dat_sf_update, measure = 'geodesic')
x_mt_update = data.table::data.table(x_mt_update)
x_mt_update$sea = dat$sea
x_mt_update = x_mt_update[order(x_mt_update$V1, decreasing = F), ]
dat_sf_update_obs = data.frame(lon = outlier_lon,
lat = outlier_lat,
sea = x_mt_update$sea[1],
code = unique(subset(dat, sea == x_mt_update$sea[1])$code)) # it is assigned (based on distance) to the 'black sea' although it lies in the meditteranean
dat_use = rbind(dat_sf_update_obs, dat)
# leaflet map
dat_sf_upd = sf::st_as_sf(dat_use, coords = c('lon', 'lat'), crs = CRS_wkt)
mp = mapview::mapview(dat_sf_upd, zcol = 'sea', legend = TRUE)
x = dat_use[, 1:2]
lst_out = list(x = x,
dataset = data_name,
sea_names = dat_use$sea,
code = dat_use$code,
leaflet_map = mp)
}
else {
stop(glue::glue("The dataset '{data_name}' does not exist!"), call. = FALSE)
}
return(lst_out)
}
```

### ‘Partion Around Medoids’ R package implementations

The next function includes the

**cluster::pam(do.swap = TRUE, variant = ‘original’)****cluster::pam(do.swap = TRUE, variant = ‘faster’)****kmed::skm()****fastkmedoids::pam()****fastkmedoids::fastpam(initializer = “LAB”)****ClusterR::Cluster_Medoids(swap_phase = TRUE)**

implementations and will allow comparing the *dissimilarity cost* and
*elapsed time* for the mentioned datasets. In the **ClusterR** package, I
included the
ClusterR::cost_clusters_from_dissim_medoids()
function that takes

- a
*dissimilarity matrix* - the
*output medoids*of each implementation

and returns the *total dissimilarity cost*.

```
compare_medoid_implementations = function(x,
y = NULL,
max_k = 15,
geodesic_dist = FALSE,
round_digits = 5,
compute_rand_index = FALSE) {
if (!geodesic_dist) {
x = ClusterR::center_scale(data = x, mean_center = TRUE, sd_scale = TRUE)
x_mt = ClusterR::distance_matrix(x, method = "euclidean", upper = TRUE, diagonal = TRUE)
}
else {
x_mt = geodist::geodist(x = x, measure = 'geodesic')
}
dv = as.vector(stats::dist(x)) # compute distance matrix for 'fastkmedoids' (lower triangular part)
results = cluster_out = list()
for (k in 2:max_k) {
# cluster [ 'original' ]
t_start = proc.time()
set.seed(k)
pm = cluster::pam(x = x_mt, k, metric = "euclidean", do.swap = TRUE, stand = FALSE, variant = 'original')
pm_secs = as.numeric((proc.time() - t_start)["elapsed"])
pm_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = pm$id.med)$cost
# cluster [ 'faster' ]
t_start = proc.time()
set.seed(k)
pm_fast = cluster::pam(x = x_mt, k, metric = "euclidean", do.swap = TRUE, stand = FALSE, variant = 'faster')
pm_fast_secs = as.numeric((proc.time() - t_start)["elapsed"])
pm_fast_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = pm_fast$id.med)$cost
# kmed
t_start = proc.time()
km = kmed::skm(distdata = x_mt, ncluster = k, seeding = k, iterate = 10)
km_secs = as.numeric((proc.time() - t_start)["elapsed"])
km_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = km$medoid)$cost
# fastkmedoids ('pam' function)
t_start = proc.time()
set.seed(k)
fkm = fastkmedoids::pam(rdist = dv, n = nrow(x), k = k)
fkm_secs = as.numeric((proc.time() - t_start)["elapsed"])
fkm_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = fkm@medoids + 1)$cost # output indices correspond to Cpp indexing, thus add 1
# fastkmedoids ('fastpam' function with 'initializer' set to 'LAB')
t_start = proc.time()
fkm_lab = fastkmedoids::fastpam(rdist = dv, n = nrow(x), k = k, initializer = "LAB", seed = k)
fkm_lab_secs = as.numeric((proc.time() - t_start)["elapsed"])
fkm_lab_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = fkm_lab@medoids + 1)$cost # output indices correspond to Cpp indexing, thus add 1
# ClusterR
t_start = proc.time()
clst = ClusterR::Cluster_Medoids(data = x_mt, clusters = k, verbose = FALSE, swap_phase = TRUE, seed = k)
clst_secs = as.numeric((proc.time() - t_start)["elapsed"])
clst_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = clst$medoids)$cost
clust_algos = c('cluster_pam',
'cluster_pam_fast',
'kmed_skm',
'fastkmedoids_pam',
'fastkmedoids_fastpam',
'ClusterR_Medoids')
dissim_costs = c(round(pm_cost, digits = round_digits),
round(pm_fast_cost, digits = round_digits),
round(km_cost, digits = round_digits),
round(fkm_cost, digits = round_digits),
round(fkm_lab_cost, digits = round_digits),
round(clst_cost, digits = round_digits))
time_bench = c(round(pm_secs, digits = round_digits),
round(pm_fast_secs, digits = round_digits),
round(km_secs, digits = round_digits),
round(fkm_secs, digits = round_digits),
round(fkm_lab_secs, digits = round_digits),
round(clst_secs, digits = round_digits))
dtbl = list(pam_R_function = clust_algos,
dissim_cost = dissim_costs,
timing = time_bench,
k = rep(k, length(clust_algos)))
if (compute_rand_index) {
# rand-index (or accuracy)
clust_acc = list(k = k,
cluster_pam = ClusterR::external_validation(true_labels = y, clusters = pm$clustering, method = 'rand_index'),
cluster_pam_fast = ClusterR::external_validation(true_labels = y, clusters = pm_fast$clustering, method = 'rand_index'),
kmed_skm = ClusterR::external_validation(true_labels = y, clusters = km$cluster, method = 'rand_index'),
fastkmedoids_pam = ClusterR::external_validation(true_labels = y, clusters = fkm@assignment, method = 'rand_index'),
fastkmedoids_fastpam = ClusterR::external_validation(true_labels = y, clusters = fkm_lab@assignment, method = 'rand_index'),
ClusterR_Medoids = ClusterR::external_validation(true_labels = y, clusters = clst$clusters, method = 'rand_index'))
data.table::setDT(clust_acc)
cluster_out[[k]] = clust_acc
}
data.table::setDT(dtbl)
results[[k]] = dtbl
}
results = data.table::rbindlist(results)
if (compute_rand_index) {
return(list(results = results, cluster_out = cluster_out))
}
else {
return(results)
}
}
```

In a for-loop, we’ll iterate over the datasets and the ‘partition around medoids’ implementations and we’ll save the output results to a list object,

```
dataset_names = c('rnorm_data', 'dietary_survey_IBS', 'soybean', 'agriculture', 'geospatial')
lst_all = list()
geo_flag = FALSE
y = NULL
for (dat_name in dataset_names) {
iter_dat = datasets_clust(data_name = dat_name)
cat(glue::glue("Dataset: '{dat_name}' Number of rows: {nrow(iter_dat$x)}"), '\n')
if (dat_name == 'geospatial') {
mp_view = iter_dat$leaflet_map
y = iter_dat$code
geo_flag = TRUE
}
else {
geo_flag = FALSE
}
iter_out = suppressWarnings(compare_medoid_implementations(x = iter_dat$x,
y = y,
max_k = 10,
geodesic_dist = geo_flag,
round_digits = 5,
compute_rand_index = geo_flag))
lst_all[[dat_name]] = iter_out
}
```

```
## Dataset: 'rnorm_data' Number of rows: 100
## Dataset: 'dietary_survey_IBS' Number of rows: 400
## Dataset: 'soybean' Number of rows: 307
## Dataset: 'agriculture' Number of rows: 12
## Dataset: 'geospatial' Number of rows: 1001
```

then we’ll use a *ggplot2 visualization function* to visualize the
*dissimilarity cost* and *elapsed time* for each *k* from *2 to 10*

```
multi_plot_data = function(data_index, y_column, digits = 2, size_geom_bar_text = 7) {
data_index$k = factor(data_index$k) # convert the 'k' column to factor
levels(data_index$k) = as.factor(glue::glue("k = {levels(data_index$k)}"))
data_index[[y_column]] = round(data_index[[y_column]], digits = digits)
plt = ggplot2::ggplot(data_index, ggplot2::aes(x = factor(pam_R_function), y = .data[[y_column]], fill = factor(pam_R_function))) +
ggplot2::geom_bar(stat = "identity") +
ggplot2::facet_wrap(~k, scales = "free_y") +
ggplot2::geom_text(ggplot2::aes(label = .data[[y_column]]), position = ggplot2::position_dodge(width = 0.9), col = 'black', fontface = 'bold', size = size_geom_bar_text) +
ggplot2::theme(panel.background = ggplot2::element_rect(fill = "white", colour = "#6D9EC1", size = 3, linetype = "solid"),
strip.text.x = ggplot2::element_text(size = 20, colour = 'black', face = 'bold'),
strip.background = element_rect(fill = 'orange', colour = 'black'),
plot.title = ggplot2::element_text(size = 19, hjust = 0.5),
axis.text = ggplot2::element_text(size = 19, face = 'bold'),
axis.text.x = ggplot2::element_text(size = 25, face = 'bold', angle = 35, hjust = 1),
axis.title = ggplot2::element_text(size = 19, face = 'bold'),
axis.title.x = ggplot2::element_blank(),
panel.grid = ggplot2::element_blank(),
legend.position = "none")
return(plt)
}
```

### Visualization of the *dissimilarity cost* for each one of the datasets

**rnorm_data** dataset

```
y_axis = 'dissim_cost'
data_name = 'rnorm_data'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
```

**dietary_survey_IBS** dataset

```
data_name = 'dietary_survey_IBS'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
```

**soybean** dataset

```
data_name = 'soybean'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
```

**agriculture** dataset

```
data_name = 'agriculture'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
```

**geospatial** dataset

```
data_name = 'geospatial'
multi_plot_data(data_index = lst_all[[data_name]]$results, y_column = y_axis, size_geom_bar_text = 5)
```

For the *‘geospatial’* data I also computed the *Rand Index* (or
*accuracy*) and for **k = 4** (which is the actual number of clusters as
the following leaflet map shows) we have a perfect clustering (or a
rand-index of 1.0). I also included a single outlier (a pair of
coordinates) between the ‘mediterranean’, ‘black’ and ‘red’ sea, which
shall be assigned to the ‘black’ sea based on the distance (which happens in all implementations),

```
dtbl_rand_index = data.table::rbindlist(lst_all[[data_name]]$cluster_out)
knitr::kable(round(dtbl_rand_index, digits = 3))
```

k | cluster_pam | cluster_pam_fast | kmed_skm | fastkmedoids_pam | fastkmedoids_fastpam | ClusterR_Medoids |
---|---|---|---|---|---|---|

2 | 0.624 | 0.624 | 0.624 | 0.624 | 0.624 | 0.624 |

3 | 0.874 | 0.874 | 0.875 | 0.846 | 0.846 | 0.875 |

4 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |

5 | 0.969 | 0.969 | 0.969 | 0.969 | 0.969 | 0.969 |

6 | 0.938 | 0.938 | 0.937 | 0.938 | 0.938 | 0.938 |

7 | 0.927 | 0.906 | 0.907 | 0.927 | 0.927 | 0.927 |

8 | 0.917 | 0.896 | 0.896 | 0.917 | 0.917 | 0.917 |

9 | 0.886 | 0.886 | 0.865 | 0.885 | 0.885 | 0.885 |

10 | 0.875 | 0.854 | 0.855 | 0.875 | 0.875 | 0.855 |

The Leaflet map shows the 4 clusters of the **geospatial** dataset
which gives a rand-index of 1.0,

```
mp_view
```

### Visualization of the *elapsed time* for each one of the datasets

**rnorm_data** dataset

```
y_axis = 'timing'
data_name = 'rnorm_data'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
```

**dietary_survey_IBS** dataset

```
data_name = 'dietary_survey_IBS'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
```

**soybean** dataset

```
data_name = 'soybean'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
```

**agriculture** dataset

```
data_name = 'agriculture'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
```

**geospatial** dataset

```
data_name = 'geospatial'
multi_plot_data(data_index = lst_all[[data_name]]$results, y_column = y_axis)
```

### Conclusions

- Using different datasets and a range of values for k (from 2 to 10) we see that even the exact algorithm does
not always give the lowest
*dissimilarity cost*. Moreover, there is a difference in the*dissimilarity cost*for the different*‘k’*values and the included*‘datasets’*between the various implementations. This probably has to do with the fact that the majority of these implementations are*approximate*and also differently implemented. - Regarding the
*elapsed time*the bar-plots show that the*fastkmedoids::fastpam()*function returns the results faster (compared to the other implementations) in all cases and this is more obvious in the ‘geospatial’ dataset where we have a*medium sized dataset*consisting of 1000 observations.

### Notes

All ‘Partition Around Medoid’ functions take a dissimilarity matrix as input and not the initial input data, therefore the elapsed time does not include the computation of the dissimilarity (or distance) matrix.

Thanks for visiting r-craft.org

This article is originally published at

Please visit source website for post related comments.