COVID-19 Bar Chart Race

This tutorial will visualize COVID-19 data as a bar chart race and will use a preprocessed dataset. For understanding how to process and serialize a dataset step by step, see COVID-19 Time Series tutorial (albeit the preprocessed dataset here differs slightly).

Click here to see the full animation.

load data

Here, we import ahlive and open a partially preprocessed dataset.

For the sake of this tutorial’s file size, the animation will begin from January 2021.

[1]:
import ahlive as ah
import pandas as pd
df = ah.tutorial.open_dataset('covid19_global_cases')
df = df.loc[
    (df['date'] >= '2021-01-01') &
    (df['date'] < '2021-02-01')
]
display(df)
COVID19 GLOBAL CASES | Source: JHU CSSE COVID-19 | https://github.com/CSSEGISandData/COVID-19
province_state country_region lat long date cases
94185 NaN Afghanistan 33.939110 67.709953 2021-01-01 51526
94186 NaN Albania 41.153300 20.168300 2021-01-01 58316
94187 NaN Algeria 28.033900 1.659600 2021-01-01 99897
94188 NaN Andorra 42.506300 1.521800 2021-01-01 8117
94189 NaN Angola -11.202700 17.873900 2021-01-01 17568
... ... ... ... ... ... ...
102643 NaN Vietnam 14.058324 108.277199 2021-01-31 1817
102644 NaN West Bank and Gaza 31.952200 35.233200 2021-01-31 158962
102645 NaN Yemen 15.552727 48.516388 2021-01-31 2121
102646 NaN Zambia -13.133897 27.849332 2021-01-31 54217
102647 NaN Zimbabwe -19.015438 29.154857 2021-01-31 33388

8463 rows × 6 columns

preprocess data

To have a bit more variety from the previous tutorial, we can preprocess the dataset to be a 7 day rolling average of new daily confirmed cases. It’s also better if we convert the values to integer because 0.75 case doesn’t really make sense. Finally we subset a timeframe, starting in March.

[2]:
df_diff = df.pivot_table(
    'cases', columns='country_region', index='date'
).diff()
df_roll = df_diff.rolling('7D').mean().dropna().astype(int)
df_melt = df_roll.dropna().reset_index().melt(
    'date', value_name='new_cases'
).sort_values('date')
df_new = df_melt.loc[df_melt['date'] >= '2020-03-01']
display(df_new)
date country_region new_cases
0 2021-01-02 Afghanistan 0
570 2021-01-02 Bhutan 21
5430 2021-01-02 United Arab Emirates 1963
3060 2021-01-02 Lithuania 1507
4800 2021-01-02 South Africa 15002
... ... ... ...
3599 2021-01-31 Montenegro 477
3569 2021-01-31 Mongolia 24
3539 2021-01-31 Monaco 18
3809 2021-01-31 Nicaragua 7
5759 2021-01-31 Zimbabwe 295

5760 rows × 3 columns

serialize data

Then we serialize the country_region as the x and cases as the y, also setting country_region as the label to group them as separate items. And for this tutorial, we will plot the confirmed new cases with chart='bar' and preset='race', i.e. a bar chart race. We can go for a test run by setting animate='test' and calling render.

[3]:
ah_df = ah.DataFrame(
    df_new, 'country_region', 'new_cases', label='country_region',
    chart='bar', preset='race', scheduler='processes', workers=4,
    animate='test'
)
print(ah_df)
ah_df.render()
<ahlive.Data>
Subplot:         (1, 1)
Dimensions:      (item: 192, state: 30)
Data variables:
    chart    (item) <U3 'bar' 'bar' 'bar' 'bar' ... 'bar' 'bar' 'bar' 'bar'
    label    (item) <U32 'Afghanistan' 'Bhutan' ... 'Antigua and Barbuda'
    group    (item) <U1 '' '' '' '' '' '' '' '' '' ... '' '' '' '' '' '' '' ''
    interp   (item) <U6 'linear' 'linear' 'linear' ... 'linear' 'linear'
    ease     (item) <U6 'in_out' 'in_out' 'in_out' ... 'in_out' 'in_out'
    x        (item, state) object 'Afghanistan' ... 'Antigua and Barbuda'
    y        (item, state) int64 0 0 495 394 315 280 258 267 ... 1 1 1 3 3 3 3 5


[########################################] | 100% Completed |  6.6s
[3]:

add labels

Not too shabby for a test run, but we can make a some improvements.

  1. Scale the new_cases by 1000 to be more intuitive and update ylabel.

  2. Use barh instead of bar.

  3. Set ylims='explore' to lessen crowding of bar_label.

  4. Add state_labels to show the date.

  5. Add inline_labels to show the cases.

  6. Add title to highlight the data shown is a 7-day rolling mean.

  7. Add a note to cite the data.

  8. Increase figsize to prevent left side from being cut-off.

  9. Add “k” suffix to inline_label through config.

[4]:
df_scale = df_new.copy()
df_scale['new_cases'] /= 1000

ah_df = ah.DataFrame(
    df_scale, 'country_region', 'new_cases', label='country_region',
    chart='barh', preset='race', ylabel='New Cases [thousand]',
    ylims='explore', state_labels='date', inline_labels='new_cases',
    title='New Confirmed COVID-19 Cases per Day, 7-Day Rolling Average',
    note='Source: JHU CSSE COVID-19', figsize=(15, 10),
    scheduler='processes', workers=4, animate='tail'
).config('inline', suffix='k')
ah_df.render()
[########################################] | 100% Completed | 11.8s
[4]:

The bars’ labels are jumping around instantaneously here. frames can be set to a higher number to show a proper animation, but for the sake of this tutorials’ filesize, it will be left until the end.

tweak further

We can normalize by population (per 100k) as well.

[5]:
df_pop = ah.tutorial.open_dataset('covid19_population')[['combined_key', 'population']]
df_norm = df_scale.merge(df_pop, left_on='country_region', right_on='combined_key')
df_norm['new_cases'] = df_norm['new_cases'] * 1000 / df_norm['population']
df_norm['new_cases'] *= 1e5
COVID19 POPULATION | Source: JHU CSSE COVID-19 | https://github.com/CSSEGISandData/COVID-19

And also increase the number of bars shown with limit and fix xlim0s=0.

[6]:
ah_df = ah.DataFrame(
    df_norm, 'country_region', 'new_cases', label='country_region',
    chart='barh', preset='race', ylabel='New Cases / 100k People',
    ylims='explore', state_labels='date', inline_labels='new_cases',
    title='New Confirmed COVID-19 Cases per Day, 7-Day Rolling Average',
    note='Source: JHU CSSE COVID-19', figsize=(15, 10), xlim0s=0,
    scheduler='processes', workers=4, animate='tail'
).config('inline', suffix='k').config('preset', limit=7)
ah_df.render()
The history saving thread hit an unexpected error (OperationalError('disk I/O error')).History will not be written to the database.
[########################################] | 100% Completed | 13.4s
[6]:

We can fix the length of the country_region labels and manually edit the finalized xr.Dataset, hiding bar_label where values are less than 10.

[7]:
df_short = df_norm.copy()
df_short['country_region'] = df_short['country_region'].str[:20]

ah_df = ah.DataFrame(
    df_short, 'country_region', 'new_cases', label='country_region',
    chart='barh', preset='race', ylabel='New Cases / 100k People',
    ylims='explore', state_labels='date', inline_labels='new_cases',
    title='New Confirmed COVID-19 Cases per Day, 7-Day Rolling Average',
    note='Source: JHU CSSE COVID-19', figsize=(15, 10), xlim0s=0,
    scheduler='processes', workers=4, animate='test'
).config('inline', suffix='/ 100k').config('preset', limit=7)
ah_df = ah_df.finalize()
ds = ah_df.data[1, 1]
ds['bar_label'] = ds['bar_label'].where(ds['y'] > 10, '')
ah_df.data[1, 1] = ds
ah_df.render()
[########################################] | 100% Completed | 13.9s
[7]: