COVID-19 disease spread hit the World really globally and also the field of mathematicians/ statisticians/ machine learning researchers and related.
These experts want to help to understand for example future trends (forecast) of the coronavirus spread.
My motivation, in this case, was to create interactive dashboard about COVID-19 to inform about various scenarios in every country and compare them through data mining methods.
I created CoronaDashshinydashboard application that is hosted on petolau.shinyapps.io RStudio platform.
The dashboard provides various data mining/ visualization techniques for comparing countries’ COVID-19 data statistics as:
extrapolating total confirmed cases by exponential smoothing model,
trajectories of cases/ deaths spread,
multidimensional clustering of countries’ data/ statistics - with dendrogram and table of clusters averages,
aggregated views for the whole World,
hierarchical clustering of countries’ trajectories based on DTW distance and preprocessing by SMA (+ normalization), for fast comparison of a large number of countries’ COVID-19 magnitudes and trends.
The blog post will be about the last bullet of the above list - clustering of countries’ trajectories.
This use case is challenging because of the clustering time series with different lengths.
CovidR contest
I submitted my shiny application also to the interesting initiative of eRum 2020 organizers - CovidR Contest.
Preprocessing COVID-19 open-data
Firstly, load all the needed packages for an analysis.
I prepared time series data at 2020-05-24 snapshot with various statistics also computed per 1 million population (so much better comparable), so let’s read them.
Since I want to analyze (cluster) trajectories of countries’ active cases spread, I need to set starting position for every countries’ time series - in this case (and in other many analyses out there) 100-th cumulative confirmed case is set as starting point.
I will also use only top 82 affected countries (+ Slovakia as my home country) for the whole analysis.
Let’s transform our data for ‘since first 100-th case’ countries’ trajectories (with the same lengths!).
You can see that we got nicely the same length time series for every country.
Now, preparation of trajectories’ data for clustering is coming…
We have to remove missing rows/ columns if there are so + I will preprocess time series with Simple Moving Average (SMA) to little bit smooth our trajectories (removes noise) - the function repr_sma is implemented in my TSrepr package.
Clustering trajectories with the hierarchical method with DTW distance
Since we use data with different lengths, we have to use different distance measures than Euclidean (or Manhattan, etc.).
Here comes very handy Dynamic Time Warping distance measure that can compute distances between time series with various lags and different lengths.
Let’s define clustering function with DTW distance with additional data preprocessing steps necessary for dtwclust package. I allow user also vary number of clusters and normalization of time series before clustering.
Let’s cluster data with 14 clusters and normalization of countries’ trajectories for extracting clusters with same trends (curves) - not magnitudes! It is very important thing before every clustering/ classification task.
Let’s prepare clustered data for visualization:
You can also search for your preferred country in the datatable.
Here comes finally plot of cluster members with ggplot2 package (log scale is used for better comparison of trends):
We can see nicely distinguishable clusters with various active cases trends (settled, rapid/ steady increase/ decrease).
Here, I picked two clusters (2 and 6) with nice decreasing trends - there are countries mostly from Central/ West Europe.
Let’s see also clusters with increasing trends of active cases per 1 mil. population:
We can see that on this day, the increasing trend of active cases has countries mostly in Western Asia, South America, and Africa.
Post-analysis visualizations with dendrograms and MDS
In order to see whole connectivity between countries’ clusters as a tree, we can use for example dendrogram.
Here, we can simply use object of clustering result to generate the tree:
In order to see for example connections between countries in 2D scatter plot, we can use dimensionality reduction method Multidimensional scaling (MDS). It uses (stored) distance matrix between objects - and we have it in our clustering result object (Yey clust_res@distmat!). For countries labels, I use great package ggrepel.
In both graphs (dendrogram and MDS scatter plot), we can see clearly how far (or close) are countries from each other based on DTW distance.
Summary
In this blog post, I showed you how to cluster time series with different lengths with DTW distance and hierarchical method, and how to visualize the results of such an analysis.
As a use case, I picked data of countries’ COVID-19 active cases trajectories computed per 1 mil. population to see trends of the disease spread.
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.