Covid-19 EDA: Man vs Disease

A deep dive into the war against the pandemic.

Pawan Bhandarkar
8 min readFeb 6, 2021
Image Source: CNN

Introduction

It has been over a full year since we started this difficult battle against Covid-19 and it has cost us dearly. These fast-spreading microscopic creatures that are too small to be even be seen with the naked eye have somehow shaken the entire world and changed life as we know it. People all over the world are racing against the clock to fight this pandemic. We, the bright minds of the internet and aspiring Data Scientists, can do our part by using our skills to generate powerful insight that can make the lives of those affected a little better.

Every day, there’s an abundance of data being made publicly available on the internet, each one containing different statistics surrounding the Covid-19 pandemic. In this article (and by extension, the Kaggle Notebook), I aim to bring two such datasets together and visualise the war against Covid-19.

If you like this article and are interested in taking a closer look at the data, I encourage you to check out my Kaggle notebook for the code used to generate the plots you see here.

Data Involved:

The only reason I decided to take up this analysis as an analytics project was because of two intriguing datasets I found on Kaggle:

Covid-19 Global Dataset contains data scraped from worldometers.info and presented in two CSV files. The first file stores day-to-day information about active cases, death rates and recovery statistics while the other summarises this data by country.

Covid-19 World Vaccination Progress Dataset contains data collected daily from the Our World in Data Github Repository for Covid-19.

At the time, many talented Kagglers before me had already published insightful notebooks analysing the datasets separately. But none had worked on bringing them together and generating a unified view of what I like to call, “Man vs Disease.”

Technical Background

In this part of the article, I will give a brief overview of some of the technical work involved in generating these visualisations. Unless you are interested in Programming with Python and the subject of Data Science, feel free to skip this section and jump straight into the visualisations.

Data Preprocessing

Any reputed data scientist will tell you that the most critical step in any Data-driven project is data processing. By taking these datasets from Kaggle instead of scraping it ourselves, we cut down on a lot of the hard work that usually goes into making data available for analysis.

However, we still need to perform some minor cleanup to make it suitable (and convenient!) for our use-case. As you will see in the notebook, I perform the following steps to prepare the data:

  1. Resolve conflicting Country names: For example, one dataset stores the American data under “The United States” while the other stores it as “USA” and similarly, for the UK. We also have some inconsistent casing that we need to resolve as in the case of “Isle Of Man” and “Isle of Man”.
  2. Data Summarization: As we had mentioned in the Data section above, we have both daily and summary data available for the Covid-19 dataset, but only the daily data for the Vaccination dataset. We will generate the vaccination summaries ourselves so that they can be visualized together with the Covid-19 summaries.

Additionally, we will also generate some new features from the existing ones, such as log-scaled values that make the graphs easier to interpret, percentage statistics of the available features (as a fraction of the population) etc.

Technology Stack

In this analysis, I used python as the primary programming language because of its rich palette of tools that make data analysis a cinch. Some of the python packages I used are:

  1. Plotly is an extremely versatile library of tools for generating interactive plots that are easy to interpret and customise
  2. Numpy is a popular library used for array manipulation and vector operations. It is used extensively across python projects that require scientific computing.
  3. Pandas is another library for data science that is just as popular as numpy. It provides easy to use data structures and functions to manipulate structured data.

These tools are well documented and come with several examples that make it easy to start using them. You can check out the linked documentation pages for more information.

Visualisations

This section is the crux of the project, where we take the cleaned and processed data and turn them into intuitive graphs. Note that since Medium does not support interactive graphing, I have uploaded screenshots from my Kaggle Notebook. To get the full Plotly Experience, you can check out the notebook linked in the introduction.

Visualising Summaries

Let’s kick things off by visualising the summaries we computed back in the Data Processing section. Since data is summaries by countries, the visualisations can get quite skewed when we sort the values. So for the sake of comprehension, we will only compare the top-20 countries in each section.

At first glance, it might seem that French Polynesia is suffering the worst since almost 70% of those tested for Covid-19 came back positive. However, we must consider the possibility that only those exhibiting severe symptoms were tested, in the first place. I will leave this investigation to the readers.

Above, is perhaps the scariest graph in my analysis and the one that is the largest cause for concern. These are the numbers that we are fighting to reduce. As we have heard from multiple sources online, these numbers mainly represent the old and those with pre-existing medical conditions fighting for their lives against this terrible virus.

We see that the USA and China are leading in terms of the total number of vaccinations administered. Note that this is an indicator of people who have received at least the first dose since the vaccination process generally consists of multiple doses administered over time.

The above plot might seem confusing at first, but it makes sense since we consider the people vaccinated as a percentage of the country’s population. The top 5 countries graphed above have relatively smaller populations than those that come after.

Visualising Global Statistics

Plotly comes with some quick and easy ways to plot global data on a world map. By doing so, it helps us conveniently understand the metrics for the countries of our interest. If you visit the notebook, you can hover the mouse cover over each country and get a complete breakdown of Active Cases, Total Deaths and Total Recovered.

The above plots show how the virus has impacted the world on two separate scales: Country and Continent. These are the numbers highly discussed in news articles, instant messaging apps and on social media. We see that overall, the USA and Europe are the worst hit continents, with Asia close behind.

Visualising Vaccination

We have seen how the virus has manifested in different parts of the world. Now in this section, we will visualise how the vaccination has been progressing since it’s inception in late 2020.

Different countries use different combinations of the available vaccines.

Plotly also enables us to visualize how a metric changes over time, in the form of animations. If this looks interesting as a gif, you would be please to know that in the notebook, you can stop at each time step and examine the values for each country individually, by hovering your mouse cursor.

The following line chart is another perspective of the same data. For the sake of readability, we only plot the top-10 countries here.

Visualising the race against Covid

In this final section, we will emphasise on the premise of this analysis: Man vs Disease. We will combine both of these datasets to generate plots that directly contrast the virus spreading and the efforts made to contain it.

The above graphs shows how slowly but surely, the vaccines are being administered in increasingly large numbers each day. If we look carefully, we can also identify a slight downward trend in the number of new cases each day, as the vaccinations progress. Humanity is on its way to victory!

Conclusion

COVID-19 has taken a heavy toll on humankind. We have lost far too many people and suffered too much for too long. Now is the time to fight back. Let 2021 be the year we reclaim what 2020 took from us. Regardless of what people might say, always wear a mask when out in public and maintain social distancing. DO NOT give in hearsay! Only when all of the graphs plotted in the innumerable notebooks posted by the talented people on the internet point heavily in our favour, having dwarfed the damage this pandemic has already done, will we be able to call it a victory.

TOGETHER, WE CAN!

--

--

Pawan Bhandarkar

23 | A sucker for strongly typed code. I tweet about Python, GraphQL & Typescript with the intent of making you a better developer. 日本語勉強中です。