Automated Coronavirus Related Data collection with Python

This post is based on an article by Federico Riveroll https://towardsdatascience.com/gather-all-the-coronavirus-data-with-python-19aa22167dea

The exponential growth in the coronavirus epidemic has been high alarming. I am trying to collect quality research and provide links to that research.

Sources: CSSE COVID-19 live map 

https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

Objective

The objective of this article is to get the needed data for research and gain proactive visibility on COVID-19.

Steps

  • Step 1: Set up technical prerequisites
  • Step 2: Gather COVID-19 confirmed cases
  • Step 3: Gather News of COVID
  • Step 4: Gather Financial and other Indicators
  • Step 5: Blend all the data together

Step 1. Prerequisites

  • Have Python 2.6+ or 3.1+ installed
  • Install pandas, matplotlib, openblender and wordcloud(with pip)

$ pip install pandas OpenBlender matplotlib wordcloud

Step 2. Gather COVID-19 confirmed cases

The CSSE is doing the amazing job of uploading the daily data here. However, it is wildly untidy and on many different datasets, so I sorted it and uploaded it as a single OpenBlender dataset:

Let’ pull the data into a Pandas dataframe by running this script:

action = 'API_getObservationsFromDataset'parameters = { 
'token':'YOUR_TOKEN_HERE',
'id_dataset':'5e6ac97595162921fda18076',
'date_filter':{
"start_date":"2020-01-01T06:00:00.000Z",
"end_date":"2020-03-11T06:00:00.000Z"} ,

 

}

df_confirmed = pd.read_json(json.dumps(OpenBlender.call(action, parameters)[‘sample’]), convert_dates=False, convert_axes=False).sort_values(‘timestamp’, ascending=False)
df_confirmed.reset_index(drop=True, inplace=True)
df_confirmed.head(10)

  • Note: To get a token you need to create an account on openblender.io (free), you’ll find it in the ‘Account’ tab on your profile icon.

So now we have the -aggregated by day and by location- number of confirmed casesdeaths and recoveries.

Here we can see the outburst of confirmed cases in Iran, Italy and Korea. We can also see Spain, France and Germany starting to rise.

Step 3. Gather News of COVID-19

We’ll gather COVID news and texts from these sources: ABC NewsWall Street JournalCNN News and USA Today Twitter (you can look for other sources)

So let’s fetch the data.

action = 'API_getOpenTextData'parameters = {
'token':'YOUR_TOKEN_HERE',
'date_filter':{"start_date":"2020-01-01T06:00:00.000Z",
"end_date":"2020-03-10T06:00:00.000Z"},
'sources':[
# Wall Street Journal
{'id_dataset' : '5e2ef74e9516294390e810a9',
'features' : ['text']},
# ABC News Headlines
{'id_dataset':"5d8848e59516294231c59581",
'features' : ["headline", "title"]},
# USA Today Twitter
{'id_dataset' : "5e32fd289516291e346c1726",
'features' : ["text"]},
# CNN News
{'id_dataset' : "5d571b9e9516293a12ad4f5c",
'features' : ["headline", "title"]}
],
'aggregate_in_time_interval' : {
'time_interval_size' : 60 * 60 * 24
},
'text_filter_search':['covid', 'coronavirus', 'ncov']

 

}
df = pd.read_json(json.dumps(OpenBlender.call(action, parameters)[‘sample’]), convert_dates=False, convert_axes=False).sort_values(‘timestamp’, ascending=False)df.reset_index(drop=True, inplace=True)

Above, we specified the following:

  • We selected our 4 sources to gather data from. Specifically, the feature columns with the text
  • We specified we want data from Jan 1st to today (10th of March)
  • We asked to aggregate the news into 24 hour groups or observations
  • We filtered the news which mentioned ‘covid’, ‘coronavirus’ or ‘ncov’

# Let's take a look
df.head(20)

Every observation is the aggregation of news by day, we have the source which is the concatenation of all news into a string and the source_lst which is a list of the news of that interval.

The timestamp (which is returned as unix timestamp) means the news happened in the interval from the prior timestamp to the current one (strictly before):

Now let’s get the mentions count for some countries of interest.

interest_countries = ['China', 'Iran', 'Korea', 'Italy', 'France', 'Germany', 'Spain']for country in interest_countries:
df['count_news_' + country] = [len([text for text in daily_lst if country.lower() in text]) for daily_lst in df['source_lst']]
interest_countries = ['China', 'Iran', 'Korea', 'Italy', 'France', 'Germany', 'Spain']for country in interest_countries:
df_news['count_news_' + country] = [len([text for text in daily_lst if country.lower() in text]) for daily_lst in df_news['source_lst']]df_news.reindex(index=df_news.index[::-1]).plot(x = 'timestamp', y = [col for col in df_news.columns if 'count' in col], figsize=(17,7), kind='area')

Let’s look at a word cloud from the last 20 days of news:

plt.figure()
plt.imshow(WordCloud(max_font_size=50, max_words=80, background_color="white").generate(' '.join([val for val in df['source'][0: 20]])), interpolation="bilinear")
plt.axis("off")
plt.show()

 

 

*The code above generated the last wordcloud of the image.

Keywords in the early news such as “mystery”, “investigating”, “dozens” contrast with posterior keywords such as “china”, “coronavirus”, “global”. And even more with the most recent ones: “New coronavirus”, “coronavirus epidemic”, “outbreak”, “cruise”, “novel coronavirus” etc..

Step 4. Gather Financial and other Indicators

For this we can part from the Dow Jones Dataset and blend several others such as exchange rates (Yen, Euro, Pound), material prices (Crude Oil, Corn, Platinum, Tin), or stock (Coca Cola, Dow Jones).

action = 'API_getObservationsFromDataset'

 

parameters = {
‘token’:’YOUR_TOKEN_HERE‘,
‘id_dataset’:’5d4c14cd9516290b01c7d673′, ‘aggregate_in_time_interval’:{“output”:”avg”,”empty_intervals”:”impute”,”time_interval_size”:86400}, ‘blends’:[
#Yen vs USD
{“id_blend”:”5d2495169516290b5fd2cee3″,”restriction”:”None”,”blend_type”:”ts”,”drop_features”:[]}, # Euro Vs USD
{“id_blend”:”5d4b3af1951629707cc1116b”,”restriction”:”None”,”blend_type”:”ts”,”drop_features”:[]}, # Pound Vs USD
{“id_blend”:”5d4b3be1951629707cc11341″,”restriction”:”None”,”blend_type”:”ts”,”drop_features”:[]}, # Corn Price
{“id_blend”:”5d4c23b39516290b01c7feea”,”restriction”:”None”,”blend_type”:”ts”,”drop_features”:[]}, # CocaCola Price
{“id_blend”:”5d4c72399516290b02fe7359″,”restriction”:”None”,”blend_type”:”ts”,”drop_features”:[]}, # Platinum price
{“id_blend”:”5d4ca1049516290b02fee837″,”restriction”:”None”,”blend_type”:”ts”,”drop_features”:[]}, # Tin Price
{“id_blend”:”5d4caa429516290b01c9dff0″,”restriction”:”None”,”blend_type”:”ts”,”drop_features”:[]}, # Crude Oil Price
{“id_blend”:”5d4c80bf9516290b01c8f6f9″,”restriction”:”None”,”blend_type”:”ts”,”drop_features”:[]}], ‘date_filter’:{“start_date”:”2020-01-01T06:00:00.000Z”,”end_date”:”2020-03-10T06:00:00.000Z”}
}df = pd.read_json(json.dumps(OpenBlender.call(action, parameters)[‘sample’]), convert_dates=False, convert_axes=False).sort_values(‘timestamp’, ascending=False)
df.reset_index(drop=True, inplace=True)df.shape()
df.head()

So now we have a single dataset with daily observations of prices blended by time. If we want to compare them, we better normalize them between 0 and 1, so that we can better appreciate the patterns:

# Lets compress all into the (0, 1) domain
df_compress = df.dropna(0).select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).apply(lambda x: (x - x.min()) / (x.max() - x.min()))
df_compress['timestamp'] = df['timestamp']# Now we select the columns that interest us
cols_of_interest = ['timestamp', 'PLATINUM_PRICE_price', 'CRUDE_OIL_PRICE_price', 'COCACOLA_PRICE_price', 'open', 'CORN_PRICE_price', 'TIN_PRICE_price', 'PLATINUM_PRICE_price']
df_compress = df_compress[cols_of_interest]
df_compress.rename(columns={'open':'DOW_JONES_price'}, inplace=True)# An now let's plot them
from matplotlib import pyplot as plt
fig, ax = plt.subplots(figsize=(17,7))
plt = df_compress.plot(x='timestamp', y =['PLATINUM_PRICE_price', 'CRUDE_OIL_PRICE_price', 'COCACOLA_PRICE_price', 'DOW_JONES_price', 'CORN_PRICE_price', 'TIN_PRICE_price', 'PLATINUM_PRICE_price'], ax=ax)

It’s interesting how almost all of them (except for the Tin price) follow a similar pattern.

Step 5. Blend all the data together

Now, we will align the COVID-19 confirmed cases, the coronavirus news and the economical indicators data into one single dataset blended by time.

To blend the data let’s upload the datasets we created to OpenBlender:

# First the News Datasetaction = 'API_createDataset'parameters = { 
'token':'YOUR_TOKEN_HERE',
'name':'Coronavirus News',
'description':'YOUR_DATASET_DESCRIPTION',
'visibility':'private',
'tags':[],
'insert_observations':'on',
'select_as_timestamp' : 'timestamp',
'dataframe':df_news.to_json()
}

 

OpenBlender.call(action, parameters)

# And now the Financial Indicatorsaction = 'API_createDataset'parameters = { 
'token':'YOUR_TOKEN_HERE',
'name':'Financial Indicators for COVID',
'description':'YOUR_DATASET_DESCRIPTION',
'visibility':'private',
'tags':[],
'insert_observations':'on',
'select_as_timestamp' : 'timestamp',
'dataframe':df_compress.to_json()
}

 

OpenBlender.call(action, parameters)

And now we just pull the initial COVID-19 dataset and we blend the new datasets we created by replacing the “id_dataset”.

action = 'API_getObservationsFromDataset'# ANCHOR: 'COVID19 Confirmed Cases'
# BLENDS: 'Coronavirus News', 'Financial Indicators for COVID'

 

parameters = {
‘token’:’YOUR_TOKEN_HERE‘,
‘id_dataset’:’5e6ac97595162921fda18076′,
‘date_filter’:{
“start_date”:”2020-01-01T06:00:00.000Z”,
“end_date”:”2020-03-11T06:00:00.000Z”} ,

},’filter_select’ : {‘feature’ : ‘countryregion’, ‘categories’ : [‘Italy’]},’aggregate_in_time_interval’:{“output”:”avg”,”empty_intervals”:”impute”,”time_interval_size”:86400}, ‘blends’:[{“id_blend”:”5e6b19ce95162921fda1b6a7″,”restriction”:”None”,”blend_type”:”ts”,”drop_features”:[]},
{“id_blend”:”5e6b1aa095162921fda1b751″,”restriction”:”None”,”blend_type”:”ts”,”drop_features”:[]}]
}df = pd.read_json(json.dumps(OpenBlender.call(action, parameters)[‘sample’]), convert_dates=False, convert_axes=False).sort_values(‘timestamp’, ascending=False)
df.reset_index(drop=True, inplace=True)

Above we selected observations for ‘Italy’aggregated by day in seconds (8640) and blended the new data.

df.head()

And now we have a dataset with the daily information of all the data blended by time!

There is an enormous range of directions to go with this.