Using Google Places data to analyze changes in mobility during the COVID-19 pandemic
This article is originally published at https://datascience.blog.wzb.eu
During the COVID-19 pandemic, it’s apparent that location data gathered by private IT companies and telcos is a primary source for many studies about the effect of mobility restrictions on people’s behaviors and movements. In this blog post, I’d like to have a look at the “popular times” data provided by Google Places. I explain the limitations of this data, show how to gather it and provide some results from data that I fetched during March and April.
I first stumbled upon the possibility of using the Google place popularity data to measure the effect social distancing efforts on mobility by Philipp Kreißel’s take on a “social distancing dashboard” for Germany (this later evolved into the EveryoneCounts project that uses several publicly available data sources for this). The basic idea is that Google generates a “live popularity” measure for some places, because they track people’s Android smartphones positions via GPS and/or WiFi (as long as you don’t turn positioning off). You may have noticed these bars in Google Maps:
This live data is an hourly percentage value that is relative to the usual popularity of that place at that time of this weekday. The usual popularity is an average value calculated from historical data for this place. It reaches its peak 100% at some hour of each day. This means for example for a bar the usual popularity may be at 70% at 6pm and 100% at 8pm on Tuesdays, but on Saturdays it may be at 100% at 11pm – this doesn’t say anything about whether more people visit this bar on Saturdays or Tuesdays, it only tells you something about the hourly distribution for a given day. By comparing this to the live popularity, you can see whether a given place is more or less busy than usual at the current hour and weekday.
I started gathering this popularity data on March 22nd and stopped on April 15th. At the beginning I queried about 700 places world-wide daily every two hours during usual business hours (depending on the type of place). I added more places at the beginning of April and ended up with 3650 places for which I then queried the data every Tuesday, Thursday and Saturday for every three hours (it’s all a matter of cost as I will explain later).
Google, directly sitting at the source, could of course do all this much easier, more accurately and with a much larger scope. They released their COVID-19 Community Mobility Reports on April 3rd which is based on data from people that allowed Google to store their location history. The data however is aggregated at country, federal state and – for the US – at county level, whereas you can query data from the Places API for individual localities, which allows you to go down to neighborhood level. Furthermore, the Mobility Reports were released in PDF format which is the opposite of “open data.” Luckily, the UK Office for National Statistics has released the mobility data from their PDF prison. (Update: By now, Google also provides a CSV file of the data and Apple now also offers similar mobility data.)
There are several limitations and problems with both the location history data used in the Mobility Reports and the Places API data. Google points out some of these limitations in their reports. They caution that the “[l]ocation accuracy and the understanding of categorized places will vary between regions and so advice against using it to compare changes between countries, or between regions with different characteristics (e.g. rural versus urban areas).”
Regarding the popular times data, I couldn’t find exact information about how the popularity is exactly measured (I will come to that), but from the experiments I have made with the API, it is clear that there’s definitely a strong difference between urban and rural areas and in the type of locality that you query. There’s a strong bias in for which places you get a current popularity measurement and for which you don’t. You will likely get a current popularity measure for a much frequented restaurant in a city, but not in a rural area. On the other hand, you will likely get these measurements for many supermarkets in both urban and rural areas.
Whether there is popularity data provided for a place or not depends on the “privacy threshold” that is mentioned but not further explained in the Mobility Reports. A similar threshold is also applied for the popular times data, so you will only get these measurements “if Google has sufficient visit data for your business.” This also brings me to another potential source of bias: Since Google doesn’t include the data for places that are currently low frequented, the live popularity data is biased towards more positive outcomes. I’m sure Google adjusted for this in their reports, but I think it’s impossible to mitigate this in the popular times data, as long as Google doesn’t document what “sufficient visit data” exactly means.
Apart from that, there’s a selection bias that is of course always present in these kind of data: The sample can only potentially include people with smartphones (more specifically Android smartphones).
As I already said, it is unclear to how the popular times data is exactly measured. For example, do only real “visits” to a place count (e.g. you’re inside a bar or supermarket for a certain time), is enough to spend some time nearby or even to just go past a place? Unfortunately, I couldn’t get any information on that from documentation or Google support. Another important question is, how the hourly usual popularity values are calculated and there is at least some vague information on this: “Popular times are based on average popularity over the last several weeks.” This is an important information when you want to compare current with usual popularity values (as explained in the beginning) during an event such as the Corona virus crisis, since the usual popularity values will slowly adapt to the “new normal.”
Additionally, there may be seasonal effects to the popularity of certain types of places, e.g. when looking at data about touristic sights, parks or transit stations (holidays!). You may adjust for this by looking at historical data. However, you will need a different data source for that since you can’t get historical popular times data from the API. You will only get the current popularity with respect to the local time of that place.
So we can see that this data comes with quite some limitations. Still, it may be useful and valid to use under some circumstances, for example when you compare data that was collected at similar time frames within comparable spatial areas, e.g. neighborhoods in a city. The big bonus is that the data you can get from the Places API has much better spatial and temporal resolution then what you can get from the Mobility Reports.
Gathering the data: API requests and costs
Gathering the popularity data from the Google APIs is a three-step process:
Step 1: You need to identify potential “places of interest”. This is, you essentially make automated Google Maps search queries for the type of places in the respective cities, city districts, etc. An example query would be “fast food in Kreuzberg, Berlin” and you would repeat that for other cities and their districts. Of course you can be more or less fine-grained in your queries, e.g. only query at city level and not district level, but when you want to find many places in a big city in makes sense to break it down to city districts or even smaller administrative units. This first step can be implemented with Google’s official Python client library for the Maps API. More specifically, you could use the
places() function from this package. It makes sense to augment this function with a search location (latitude/longitude) and a search radius to get more precise results. So you may additionally need to geo-code your search spots (e.g. by using the Geocode API).
Step 2: You need to find out, for which of these potential places of interest you can get a popularity measure. As pointed out in the previous section, not all places provide this information because of the privacy threshold, i.e. they need a minimum level of activity. So you should probe the places you found in step one for this popularity measure at a sensible local time (e.g. you won’t get a popularity level for a supermarket when it’s midnight there). The popular times data is not accessible from the official Google API, but there is a Python package populartimes that crawls this data for a given place. You can repeat steps one and two several times and store a set of places which returned popularity values. This, the places of interest, will be the input of the next step.
Step 3: You can now repeatedly fetch the popularity values from the places of interest at different times of the day and different weekdays. You can store the current popularity along with the usual popularity at this weekday and hour of the day. If you do this globally, keep in mind that you need to use the local time according to the timezone of a given place!
All this doesn’t come for free. Google charges every API request and every request may consist of different “SKUs”. See this price list. For the search queries in step one, you need the “Find Place” SKU which is currently at $17 per 1000 requests. For getting the popularity data in steps two and three, you need the SKUs “Basic”, “Place Details”, “Atmosphere” and “Contact Data”, which sums up to $25 per 1000 requests. It doesn’t sound much, but it adds up quickly. Luckily, Google provided me some free credits for me after applying for it.
Comparison with mobility reports
I used a linear mixed model to estimate the geometric mean of the ratio current popularity : usual popularity per place category in Germany. The following plot shows the results:
You can see that during March 22nd and April 15th only parks had an increase in popularity, in this case by about 70%. Public transport went down to about a third of its original popularity. This is quite similar to gastronomy which in the dataset however only consisted of places that could still open their business (mostly fast food places). In the commerce category, there are also mainly supermarkets and grocery stores, because other types of places had to close. For these stores, popularity decreased only by about 25%, because people still did their necessary grocery shopping.
We can try to compare this with the Google Mobility Report for Germany from April 17th. It’s not completely comparable since Google used different categories and they probably sampled many more cities than the five that I could sample (Berlin, Cologne, Hamburg, Munich and Dresden). We can directly compare the categories parks and public transport which show similar trends but more extreme values in the popularity data that I have collected. Commerce can be partly compared with the “grocery & pharmacy” category in the Mobility Reports. Here we have a larger discrepancy (-25% vs. -4%). Note however that Google doesn’t report any uncertainty measure.
Due to the high temporal resolution of the data (I started gathering hourly data, then switched to bi-hourly, interpolating the values in between), we can also have a look at daily patterns for different kinds of places on weekends and working days:
Again, parks are standing out, with an increase in visits throughout the day with a peak of a more than threefold increase in visits in the late afternoon. Unsurprisingly, public transport is especially low during typical commute times on working days and in the evening, when people stayed home instead of going out. Accordingly, evening shopping in the commerce category is also low during working days. However, on weekends there is a slight increase in the morning and evening. One may speculate if this is due to people visiting supermarkets at atypical times in order to avoid crowds. Gastronomy (again: this means mostly fast food places) also saw an increase in visits in the morning and evening hours, however, this estimate comes with some uncertainty.
Although Google advises against drawing comparisons between countries due to differences in location accuracy and place categorizations, we can assume that these differences are minimal across most European countries. Comparing the overall popularity change (i.e. taking aside the different place categories), we can see the following:
A map of the mean values gives a better sense for the geographic pattern, which shows a strong difference between North and South Europe, with Germany and Netherlands being exceptions in Central Europe:
The comparatively high value for Germany, however, is especially driven by the exceptionally high popularity for parks during the given time period. We can see this when further investigating the place categories in each country:
Or as a map to show the geographic patterns (click to enlarge image):
For these kinds of comparisons, however, the Google Mobility Reports data is probably better suited since it is more accurate at this aggregated level.
Although the Google Places popularity data comes with several limitations, it may be useful e.g. in order to understand the effect of social distancing measures on mobility. Another use case may be the attempt to explain the differences in social distancing practice in different spatial or temporal contexts. Since the Google Places data provides high resolution in these dimensions, it may be applied successfully there.
Datasets and code for this post are available on GitHub.
Thanks for visiting r-craft.org
This article is originally published at https://datascience.blog.wzb.eu
Please visit source website for post related comments.