Python Data Analysis: covid-19 data, part 1 [data wrangling]
This article is originally published at https://www.sharpsightlabs.com
If you’re like most people around the world, you’re probably concerned with the coronavirus, AKA, sars-cov-19, which for the sake of simplicity, I’ll refer to as covid-19.
As you’re reading this, there’s a reasonable chance that you’re stuck working from home or otherwise quarantined in some way.
Welcome to 2020 … it’s crazy.
Assuming that you’re quarantined right now, you probably have some extra time on your hands.
Great.
This is a perfect time to learn a new skill or do something productive.
In fact, it’s a great time to master data science in Python.
With that in mind, I’m putting together a series of tutorials to show you how to analyze covid-19 data in Python.
Want to learn Pandas? Want to learn data science in Python?
If you’re stuck at home, you’ve got no more excuses.
Let’s get after it.
A quick review of the process
In this series of tutorials, we’ll analyze some covid19 data with Python.
Specifically, we’re going to use a few toolkits.
Ultimately, over the course of this series of tutorials, we’ll use Pandas, matplotlib, seaborn, and possibly a little Numpy.
You’ll probably want to be familiar with all of them, but if not, that’s okay.
You’ll still be able to run the code, and I’ll link to some other tutorials that explain individual techniques in more depth.
That being said, you’ll be able to play with the code regardless. But if you’re really serious about this, you’ll want to eventually master data science in Python.
What we’ll cover in this tutorial
As I mentioned, we’re going to work with this data over a series of tutorials.
This tutorial is the first.
And in this tutorial, we’re just going to start with some simple data wrangling.
We’re going to get our data and “wrangle” it into shape. Here’s a quick table of contents, so you can navigate to different sections.
Table of Contents:
- Import packages
- Get data
- Rename columns
- Reshape data
- Convert dates
- Rearange columns and sort data
- Set index
- Check data
- Next steps
Import Packages
First, we’re just going to import some Packages.
For the time being, we’ll just import Pandas, although we’ll need to import other packages in future tutorials.
#================ # IMPORT PACKAGES #================ import pandas as pd
Get raw data
Next, we’re going to get the raw csv
data.
This data comes from a github repository for Covid-19 data, created by Johns Hopkins.
We’re going to use the “raw” data, which is easier to retrieve.
To get the raw data, we’re going to use the Pandas read_csv function. The read_csv function does exactly what it sounds like …. it reads in csv data.
#============= # GET RAW DATA #============= url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv' covid_data_RAW = pd.read_csv(url)
Inspect
Now, let’s just inspect a few rows.
First, we’ll take a look at the columns:
covid_data_RAW.columns
OUT:
Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20', '1/30/20', '1/31/20', '2/1/20', '2/2/20', '2/3/20', '2/4/20', '2/5/20', '2/6/20', '2/7/20', '2/8/20', '2/9/20', '2/10/20', '2/11/20', '2/12/20', '2/13/20', '2/14/20', '2/15/20', '2/16/20', '2/17/20', '2/18/20', '2/19/20', '2/20/20', '2/21/20', '2/22/20', '2/23/20', '2/24/20', '2/25/20', '2/26/20', '2/27/20', '2/28/20', '2/29/20', '3/1/20', '3/2/20', '3/3/20', '3/4/20', '3/5/20', '3/6/20', '3/7/20', '3/8/20', '3/9/20', '3/10/20', '3/11/20', '3/12/20', '3/13/20', '3/14/20', '3/15/20', '3/16/20', '3/17/20', '3/18/20'], dtype='object')
Ok. This is frankly hard to look at. We’ll have to clean these up.
We can also print out a few rows with the head()
method.
covid_data_RAW.head()
OUT:
Province/State Country/Region Lat ... 3/16/20 3/17/20 3/18/20 0 NaN Thailand 15.0000 ... 147 177 212 1 NaN Japan 36.0000 ... 825 878 889 2 NaN Singapore 1.2833 ... 243 266 313 3 NaN Nepal 28.1667 ... 1 1 1 4 NaN Malaysia 2.5000 ... 566 673 790 [5 rows x 61 columns]
Keep in mind that your data might look a little different. I’m retrieving this data on March 19, 2020, so the dataset has records up to 3/18/20
. If you run this code on a different date, your data will be more up-to-date.
Rename Columns
Now, we’re going to rename the columns.
To do this, we’re going to use the Pandas rename method. Notice that in order to do this, we need to specify new “key/value pairs” with a Python dictionary. Each “key” in the dictionary is the old column name, and the “value” in the dictionary is the new name.
So, you can read the items in the dictionary as {"old_name":"new_name"}
.
#--------------- # RENAME COLUMNS #--------------- covid_data = covid_data_RAW.rename( columns = {'Province/State':'subregion' ,'Country/Region':'country' ,'Lat':'lat' ,'Long':'long' } )
Notice that we’re saving the output of this method with the name covid_data
.
That means that after running this code, our original dataset, covid_data_RAW
, will be in its original state. I’m doing that just to keep the raw data, just in case we make any mistakes and need to start over with the raw data. (Although, that might not be a huge concern, since we can re-read the data with read_csv
.)
The data that we’ll be working with going forward is covid_data
.
Reshape Data
Next, we’re going to reshape the data.
Why?
One thing that you’ll notice about the data is that the dates all exist as separate columns. This is not what we need.
When we use many data analysis or visualization techniques, like groupby, agg, and Seaborn functions, we need the dates to exist as values underneath a single column.
So right now we need to transpose the data so that those dates exist under one single “date
” column.
To do this, we need to use the Pandas melt function. Pandas has two functions for transposing data, melt and pivot, but melt is used when we need to transform data from wide form to long form.
So here, we’re going to use the melt function to reshape our data:
#================================== # RESHAPE THE DATA # - melt the data into 'tidy' shape #================================== covid_data = (covid_data.melt(id_vars = ['country','subregion','lat','long'] ,var_name = 'date_RAW' ,value_name = 'confirmed' ) )
Notice that when we did this, we created two new variables: date_RAW
and confirmed
.
The first is a date field (which we’ll need to wrangle further), and the second is the number of confirmed cases.
Now that we’re finished with that step, let’s print some data to take a look:
print(covid_data)
OUT:
country subregion lat long date_RAW confirmed 0 Thailand NaN 15.0000 101.0000 1/22/20 2 1 Japan NaN 36.0000 138.0000 1/22/20 2 2 Singapore NaN 1.2833 103.8333 1/22/20 0 3 Nepal NaN 28.1667 84.2500 1/22/20 0 4 Malaysia NaN 2.5000 112.5000 1/22/20 0 ... ... ... ... ... ... 26329 Netherlands Aruba 12.5186 -70.0358 3/18/20 4 26330 Zambia NaN -15.4167 28.2833 3/18/20 2 26331 Djibouti NaN 11.8251 42.5903 3/18/20 1 26332 Gambia, The NaN 13.4432 -15.3101 3/18/20 1 26333 United Kingdom Montserrat 16.7425 -62.1874 3/18/20 1
Take a look at the output. We now have two new variables named date_RAW
and confirmed
.
The confirmed
variable is the number of confirmed covid-19 cases, for a particular place, on a particular date.
The date_RAW
variable is a string-based date variable, which means that it’s not in the form that we need it to be in.
So next, we’ll convert the dates into proper date/time data.
Convert Dates
First, let’s just print out some dates.
Here, we’re going to do this with the Pandas filter method to select the date_RAW
column.
#-------------- # CONVERT DATES #-------------- # INSPECT DATES (covid_data .filter(['date_RAW']) )
OUT:
date_RAW 0 1/22/20 1 1/22/20 2 1/22/20 3 1/22/20 4 1/22/20 ... 26329 3/18/20 26330 3/18/20 26331 3/18/20 26332 3/18/20 26333 3/18/20
The dates are in a form that with a one-digit month, two-digit day, and a two-digit year.
Let’s try to convert these dates to proper datetime data.
Test date conversion
Instead of directly converting the data, I actually want to test this out first.
Here, we’re going to use several Pandas methods in a “chain” to test our date conversion and give us a preview of the results. (Note that this is similar to dplyr pipes in R.)
Keep in mind, because we’re not storing the output with the equal sign, this will not modify the original DataFrame. That’s good … it will give us the ability to test the operation first.
#--------------------- # TEST DATE CONVERSION #--------------------- (covid_data .assign(date = pd.to_datetime(covid_data.date_RAW, format='%m/%d/%y')) .filter(['date','date_RAW','confirmed']) .groupby(['date','date_RAW']) .agg('sum') .sort_values('date') )
And here is a few rows of output:
confirmed date date_RAW 2020-01-22 1/22/20 555 2020-01-23 1/23/20 653 2020-01-24 1/24/20 941 2020-01-25 1/25/20 1434 2020-01-26 1/26/20 2118 2020-01-27 1/27/20 2927 2020-01-28 1/28/20 5578 2020-01-29 1/29/20 6166 2020-01-30 1/30/20 8234 2020-01-31 1/31/20 9927 2020-02-01 2/1/20 12038 2020-02-02 2/2/20 16787 2020-02-03 2/3/20 19881 2020-02-04 2/4/20 23892 2020-02-05 2/5/20 27635
What happened here?
The code created a new variable in the output called date
.
Then we retrieved only a few columns (date
, date_RAW
, confirmed
) using the filter method.
Then we grouped and aggregated the data. The grouping is actually the critical step. What I want to see here is all of the unique combinations of date
and date_RAW
, side by side. The way I chose to do that is by using groupby
, then agg
.
So again, why did I do this?
I did it because I want to see the old date (date_RAW
) and the new date (date
) side by side. I want to compare and “spot check” to make sure that the assign
operation using pd.to_datetime
worked properly.
When you compare the old date and new date side by side, it looks like they match. That is, it looks like our new datetime field, date
, was created properly.
But remember: the code we just ran did NOT change the original dateset yet. It was just a “test”.
Now that we tested this date conversion, we’ll run it properly and save the output so that we have our new date.
Create date variable
Now, we’ll actually convert our date.
Before we do that though, we’ll create a copy of our date, just to back it up.
This might be unnecessary and will take up extra storage. But I sometimes do this at intermediate points in my scripts, just to create a backup before I make changes.
You can chose to skip the backup if you want.
# BACKUP #covid_data_backup_BEFOREDATE = covid_data.copy()
Ok. Now that we backed up our data, we’ll modify covid_data
and create our new variable, date
.
#------------- # CONVERT DATE #------------- covid_data = covid_data.assign( date = pd.to_datetime(covid_data.date_RAW, format='%m/%d/%y') )
At this point, covid_data
contains a date variable that’s formatted as a proper datetime.
Rearange columns and sort
Now, we’re just going to clean up our data a little.
I’m going to rearrange the columns with filter, and sort the data with sort_values.
#===================== # SORT & REARANGE DATA #===================== covid_data = (covid_data .filter(['country', 'subregion', 'date', 'lat', 'long', 'confirmed']) .sort_values(['country','subregion','lat','long','date']) )
And let’s print the data:
print(covid_data)
OUT:
country subregion date lat long confirmed 26 Afghanistan NaN 2020-01-22 33.0000 65.0000 0 488 Afghanistan NaN 2020-01-23 33.0000 65.0000 0 950 Afghanistan NaN 2020-01-24 33.0000 65.0000 0 1412 Afghanistan NaN 2020-01-25 33.0000 65.0000 0 1874 Afghanistan NaN 2020-01-26 33.0000 65.0000 0 ... ... ... ... ... ... 24482 Zambia NaN 2020-03-14 -15.4167 28.2833 0 24944 Zambia NaN 2020-03-15 -15.4167 28.2833 0 25406 Zambia NaN 2020-03-16 -15.4167 28.2833 0 25868 Zambia NaN 2020-03-17 -15.4167 28.2833 0 26330 Zambia NaN 2020-03-18 -15.4167 28.2833 2
Alright, alright, alright.
This is starting to look goooood.
Set Index
Let’s do one last thing.
Here, we’ll set the index to ‘country
‘. This will probably be temporary (we’ll probably change the index in future tutorials), but for the time being, this will give us the ability to retrieve data based on country name.
To set the index for the DataFrame, we’ll use the Pandas set_index method.
# SET INDEX covid_data.set_index('country', inplace = True)
Notice that when we do this, we’re setting inplace = True
. This will directly modify our dataset, covid_data
.
Check Data
Finally, I’m just going to look at a few things.
I want to get a list of the country names, because we’re currently using country
as our index.
To do this, I’m going to chain together several Pandas methods, including reset_index, filter, and drop_duplicates.
Essentially, we’re creating a list of the unique values of country.
# GET COUNTRY NAMES pd.set_option('display.max_rows', 155)
(covid_data .reset_index() .filter(['country']) .drop_duplicates() .head(n = 200) )
pd.reset_option('display.max_rows')
Here is a truncated list of the countries.
country 0 Afghanistan 57 Albania 114 Algeria 171 Andorra 228 Antigua and Barbuda 285 Argentina ... ... 11571 US 25650 Ukraine 25707 United Arab Emirates 25764 United Kingdom 26049 Uruguay 26106 Uzbekistan 26163 Venezuela 26220 Vietnam 26277 Zambia
Note that the official name for the United States in the data is “US
“.
Pull records for United States
Ok. Now, one last thing.
We’ll pull the records for the United States using the Pandas loc method.
# PULL DATA FOR UNITED STATES covid_data.loc['US']
OUT:
subregion date lat long confirmed country US Adams, IN 2020-01-22 39.8522 -77.2865 0 US Adams, IN 2020-01-23 39.8522 -77.2865 0 US Adams, IN 2020-01-24 39.8522 -77.2865 0 US Adams, IN 2020-01-25 39.8522 -77.2865 0 US Adams, IN 2020-01-26 39.8522 -77.2865 0 ... ... ... ... ... US Yolo County, CA 2020-03-14 38.7646 -121.9018 0 US Yolo County, CA 2020-03-15 38.7646 -121.9018 0 US Yolo County, CA 2020-03-16 38.7646 -121.9018 0 US Yolo County, CA 2020-03-17 38.7646 -121.9018 0 US Yolo County, CA 2020-03-18 38.7646 -121.9018 0
Here, you can see that we have data for the United States at the country level and the subregion level (individual cities). We have multiple dates, and the latitude/longitude for the data. And ultimately, we have the number of confirmed cases.
This gives us a lot that we can potentially do.
Next steps
There’s probably more data wrangling that we need to do. And after that, we’ll want to analyze the data with Pandas, and also with data visualization tools like matplotlib or Seaborn.
In terms of data manipulation, we’ll probably want to add the data for covid-19 deaths and recovered cases. (Remember, the data that we prepared in this tutorial only has confirmed cases.)
I probably also want to redo the index. Right now, the index is country name, but the names are hard to remember. And some of the names are long, or contain columns. That all makes for a bad index. Instead, I’ll probably want to change the index to the three letter ISO code (i.e., ‘GBR
‘ instead of ‘United Kingdom
‘).
And as I said, there’s a lot more data analysis and data visualization to do after we finalize the dataset.
Sign up to learn more
Do you want to see part 2 and the other tutorials in this series?
Sign up for our email list now.
When you sign up, you’ll get our tutorials delivered directly to your inbox.
The post Python Data Analysis: covid-19 data, part 1 [data wrangling] appeared first on Sharp Sight.
Thanks for visiting r-craft.org
This article is originally published at https://www.sharpsightlabs.com
Please visit source website for post related comments.