COVID-19 Time Series

This tutorial will visualize COVID-19 data as a time series, and along the way, show what a workflow might look like when using ahlive.

Click here to see the full animation.

load data

To start, we will import ahlive and abbrieviate as ah. Then, we open up the COVID-19 global cases dataset and display it.

raw is set to True for demonstration purposes, i.e. how to preprocess a “wide” dataset into a “tidy” dataset (this is done automatically if raw=False).

verbose is set to True to display the direct URL where the data is retrieved from, in addition to the source and base URL.

[1]:
import ahlive as ah
import pandas as pd
df = ah.tutorial.open_dataset('covid19_global_cases', raw=True, verbose=True)
display(df)
COVID19 GLOBAL CASES | Source: JHU CSSE COVID-19 | https://github.com/CSSEGISandData/COVID-19
Data: https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 1/28/21 1/29/21 1/30/21 1/31/21 2/1/21 2/2/21 2/3/21 2/4/21 2/5/21 2/6/21
0 NaN Afghanistan 33.939110 67.709953 0 0 0 0 0 0 ... 54891 54939 55008 55023 55059 55121 55174 55231 55265 55330
1 NaN Albania 41.153300 20.168300 0 0 0 0 0 0 ... 75454 76350 77251 78127 78992 79934 80941 81993 83082 84212
2 NaN Algeria 28.033900 1.659600 0 0 0 0 0 0 ... 106610 106887 107122 107339 107578 107841 108116 108381 108629 108629
3 NaN Andorra 42.506300 1.521800 0 0 0 0 0 0 ... 9779 9837 9885 9937 9972 10017 10070 10137 10172 10206
4 NaN Angola -11.202700 17.873900 0 0 0 0 0 0 ... 19672 19723 19782 19796 19829 19900 19937 19996 20030 20062
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
268 NaN Vietnam 14.058324 108.277199 0 2 2 2 2 2 ... 1651 1657 1767 1817 1850 1882 1948 1957 1976 1985
269 NaN West Bank and Gaza 31.952200 35.233200 0 0 0 0 0 0 ... 157593 158168 158559 158962 159443 159956 160426 161087 161559 162029
270 NaN Yemen 15.552727 48.516388 0 0 0 0 0 0 ... 2120 2120 2120 2121 2122 2122 2122 2122 2124 2127
271 NaN Zambia -13.133897 27.849332 0 0 0 0 0 0 ... 50319 51624 53352 54217 55042 56233 57489 59003 60427 61427
272 NaN Zimbabwe -19.015438 29.154857 0 0 0 0 0 0 ... 32646 32952 33273 33388 33548 33814 33964 34171 34331 34487

273 rows × 386 columns

transform data

This data is in wide form, but ahlive expects input data to be in “tidy” form, which is defined below:

  1. Each variable must have its own column.

  2. Each observation must have its own row.

  3. Each type of observational unit forms a table.

Fortunately, it’s easy to convert to “tidy” form using pd.melt.

[2]:
df_tidy = df.melt(
    id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'],
    var_name='Date', value_name='Cases')
display(df_tidy)
Province/State Country/Region Lat Long Date Cases
0 NaN Afghanistan 33.939110 67.709953 1/22/20 0
1 NaN Albania 41.153300 20.168300 1/22/20 0
2 NaN Algeria 28.033900 1.659600 1/22/20 0
3 NaN Andorra 42.506300 1.521800 1/22/20 0
4 NaN Angola -11.202700 17.873900 1/22/20 0
... ... ... ... ... ... ...
104281 NaN Vietnam 14.058324 108.277199 2/6/21 1985
104282 NaN West Bank and Gaza 31.952200 35.233200 2/6/21 162029
104283 NaN Yemen 15.552727 48.516388 2/6/21 2127
104284 NaN Zambia -13.133897 27.849332 2/6/21 61427
104285 NaN Zimbabwe -19.015438 29.154857 2/6/21 34487

104286 rows × 6 columns

subset data

Now, we can now use this “tidy” pd.DataFrame with ahlive. However, oftentimes it’s best to subset and focus on a few datapoints, and also for this tutorial’s simplicity sake.

[3]:
countries = ['US', 'China', 'New Zealand', 'United Kingdom',
             'Brazil', 'India', 'Zambia', 'Pakistan']
df_subset = df_tidy.loc[df_tidy['Country/Region'].isin(countries)]
display(df_subset)
Province/State Country/Region Lat Long Date Cases
30 NaN Brazil -14.235000 -51.925300 1/22/20 0
58 Anhui China 31.825700 117.226400 1/22/20 1
59 Beijing China 40.182400 116.414200 1/22/20 14
60 Chongqing China 30.057200 107.874000 1/22/20 6
61 Fujian China 26.078900 117.987400 1/22/20 1
... ... ... ... ... ... ...
104273 Isle of Man United Kingdom 54.236100 -4.548100 2/6/21 434
104274 Montserrat United Kingdom 16.742498 -62.187366 2/6/21 15
104275 Turks and Caicos Islands United Kingdom 21.694000 -71.797900 2/6/21 1654
104276 NaN United Kingdom 55.378100 -3.436000 2/6/21 3929835
104284 NaN Zambia -13.133897 27.849332 2/6/21 61427

19100 rows × 6 columns

Since this dataset was originally grouped by Province/States, to further simplify and reduce crowdedness, we can group by Country/Region instead. Also since testing started mostly in March to April timeframe, we will begin animating in March.

[4]:
df_countries = df_subset.groupby(['Date', 'Country/Region'])[['Cases']].sum().reset_index()
df_countries = df_countries.loc[df_countries['Date'] >= '3/01/21']
display(df_countries)
Date Country/Region Cases
1344 3/1/20 Brazil 2
1345 3/1/20 China 79932
1346 3/1/20 India 3
1347 3/1/20 New Zealand 1
1348 3/1/20 Pakistan 4
... ... ... ...
3051 9/9/20 New Zealand 1792
3052 9/9/20 Pakistan 300030
3053 9/9/20 US 6361638
3054 9/9/20 United Kingdom 357613
3055 9/9/20 Zambia 13112

1712 rows × 3 columns

serialize data

Pass the preprocessed pd.DataFrame, the xs column name (what we want plotted on the x-axis), the ys column name (what we want plotted on the y-axis) to instantiate the ah.DataFrame class.

[5]:
ah_df = ah.DataFrame(df_countries, 'Date', 'Cases')
print(ah_df)
<ahlive.Data>
Subplot:         (1, 1)
Dimensions:      (item: 1, state: 1712)
Data variables:
    chart    (item) <U4 'line'
    label    (item) <U1 ''
    group    (item) <U1 ''
    interp   (item) <U6 'linear'
    ease     (item) <U6 'in_out'
    x        (item, state) object '3/1/20' '3/1/20' ... '9/9/20' '9/9/20'
    y        (item, state) int64 2 79932 3 1 4 ... 300030 6361638 357613 13112


However, if we examine the output, we notice an abnormally large number of states, or number of frames in the animation, and only 1 item. In other words, ahlive is unaware of the different Country/Regions; thus we need to pass that in as the label.

[6]:
ah_df = ah.DataFrame(df_countries, 'Date', 'Cases', label='Country/Region')
print(ah_df)
<ahlive.Data>
Subplot:         (1, 1)
Dimensions:      (item: 8, state: 214)
Data variables:
    chart    (item) <U4 'line' 'line' 'line' 'line' 'line' 'line' 'line' 'line'
    label    (item) <U14 'Brazil' 'China' 'India' ... 'United Kingdom' 'Zambia'
    group    (item) <U1 '' '' '' '' '' '' '' ''
    interp   (item) <U6 'linear' 'linear' 'linear' ... 'linear' 'linear'
    ease     (item) <U6 'in_out' 'in_out' 'in_out' ... 'in_out' 'in_out'
    x        (item, state) object '3/1/20' '3/10/20' ... '9/8/20' '9/9/20'
    y        (item, state) int64 2 31 38 52 151 ... 12776 12836 12952 13112


With that done, we can almost generate the the first animation by calling the render method, but before doing so, it’s good to set animate to a list of states. This limits the number of frames in the animation so we can get a preview of what the full animation looks like and ensure everything looks correct. We can also specify fps to prevent the animation from flashing.

[7]:
ah_df = ah.DataFrame(df_countries, 'Date', 'Cases', label='Country/Region',
                     animate=[20, 50, 250, 300], fps=1)
ah_df.render()
[########################################] | 100% Completed | 19.9s
[7]:

tweak animation

By doing so, we can immediately notice that:

  1. The ylabel is cut-off.

  2. The dates are crowded.

  3. COVID tests did not start until March.

  4. it’s hard to intuitively grasp the large numbers.

  5. The legend isn’t sorted by max.

To remedy this:

  1. Increase the width of the figure through figsize.

  2. Replace date strs with np.datetime64 objects.

  3. Slice dataframe to begin around March.

  4. Scale the values by a million (1e6) and rerun.

  5. Set sortby='y' in config.

Also, to reduce the size of this tutorial page, we will resample to every week.

[8]:
df_dts = df_countries.copy()
df_dts['Date'] = pd.to_datetime(df_dts['Date'])
df_dts = df_dts.loc[df_dts['Date'] >= '2020-03-01']
df_dts = df_dts.groupby([
    pd.Grouper(key='Date', freq='7D'), 'Country/Region'
])['Cases'].last().reset_index()
df_scale = df_dts.sort_values('Date')
df_scale['Cases'] /= 1e6

We can rerun with the newly preprocessed pd.DataFrame. We can also use any str in animate for a subset. It will also automatically set fps.

[9]:
ah_df = ah.DataFrame(
    df_scale, 'Date', 'Cases', label='Country/Region',
    figsize=(10, 7), animate='test'
).config('legend', sortby='y')
ah_df.render()
[########################################] | 100% Completed | 11.2s
[9]:

We now note that there are extraneous decimal points so we can use config to set the format. We may also want to add state_labels and inline_labels to see directly see the x and y values.

[10]:
ah_df = ah.DataFrame(
    df_scale, 'Date', 'Cases', label='Country/Region',
    state_labels='Date', inline_labels='Cases',
    figsize=(10, 7), ylabel='Confirmed Cases [million]',
    animate='test'
).config('yticks', format='%.0f').config('legend', sortby='y')
ah_df.render()
[########################################] | 100% Completed | 13.3s
[10]:

We can also set animate as head_28 to animate the first 28 frames.

[11]:
ah_df = ah.DataFrame(
    df_scale, 'Date', 'Cases', label='Country/Region',
    state_labels='Date', inline_labels='Cases',
    figsize=(10, 7), ylabel='Confirmed Cases [million]',
    animate='head_28'
).config('yticks', format='%.0f').config('legend', sortby='y')
ah_df.render()
[########                                ] | 22% Completed | 10.4s
/mnt/c/Users/Solactus/GOOGLE~1/Bash/ahlive/ahlive/animation.py:1352: UserWarning: AutoDateLocator was unable to pick an appropriate interval for this date range. It may be necessary to add an interval value to the AutoDateLocator's intervald dictionary. Defaulting to 12.
  plt.savefig(buf, **savefig_kwds)
[########################################] | 100% Completed | 35.6s
[11]:

Since there’s not much to see in the first 28 frames, we can also animate from the tail using tail_28.

[12]:
ah_df = ah.DataFrame(
    df_scale, 'Date', 'Cases', label='Country/Region',
    state_labels='Date', inline_labels='Cases',
    figsize=(10, 7), ylabel='Confirmed Cases [million]',
    animate='tail_28'
).config('yticks', format='%.0f').config('legend', sortby='y')
ah_df.render()
[########################################] | 100% Completed | 33.3s
[12]: