Covid-19 EDA: Man vs Disease

A deep dive into the war against the pandemic.

Image Source: CNN


Every day, there’s an abundance of data being made publicly available on the internet, each one containing different statistics surrounding the Covid-19 pandemic. In this article (and by extension, the Kaggle Notebook), I aim to bring two such datasets together and visualise the war against Covid-19.

If you like this article and are interested in taking a closer look at the data, I encourage you to check out my Kaggle notebook for the code used to generate the plots you see here.

Data Involved:

Covid-19 Global Dataset contains data scraped from and presented in two CSV files. The first file stores day-to-day information about active cases, death rates and recovery statistics while the other summarises this data by country.

Covid-19 World Vaccination Progress Dataset contains data collected daily from the Our World in Data Github Repository for Covid-19.

At the time, many talented Kagglers before me had already published insightful notebooks analysing the datasets separately. But none had worked on bringing them together and generating a unified view of what I like to call, “Man vs Disease.”

Technical Background

Data Preprocessing

However, we still need to perform some minor cleanup to make it suitable (and convenient!) for our use-case. As you will see in the notebook, I perform the following steps to prepare the data:

  1. Resolve conflicting Country names: For example, one dataset stores the American data under “The United States” while the other stores it as “USA” and similarly, for the UK. We also have some inconsistent casing that we need to resolve as in the case of “Isle Of Man” and “Isle of Man”.
  2. Data Summarization: As we had mentioned in the Data section above, we have both daily and summary data available for the Covid-19 dataset, but only the daily data for the Vaccination dataset. We will generate the vaccination summaries ourselves so that they can be visualized together with the Covid-19 summaries.

Additionally, we will also generate some new features from the existing ones, such as log-scaled values that make the graphs easier to interpret, percentage statistics of the available features (as a fraction of the population) etc.

Technology Stack

  1. Plotly is an extremely versatile library of tools for generating interactive plots that are easy to interpret and customise
  2. Numpy is a popular library used for array manipulation and vector operations. It is used extensively across python projects that require scientific computing.
  3. Pandas is another library for data science that is just as popular as numpy. It provides easy to use data structures and functions to manipulate structured data.

These tools are well documented and come with several examples that make it easy to start using them. You can check out the linked documentation pages for more information.


Visualising Summaries

At first glance, it might seem that French Polynesia is suffering the worst since almost 70% of those tested for Covid-19 came back positive. However, we must consider the possibility that only those exhibiting severe symptoms were tested, in the first place. I will leave this investigation to the readers.

Above, is perhaps the scariest graph in my analysis and the one that is the largest cause for concern. These are the numbers that we are fighting to reduce. As we have heard from multiple sources online, these numbers mainly represent the old and those with pre-existing medical conditions fighting for their lives against this terrible virus.

We see that the USA and China are leading in terms of the total number of vaccinations administered. Note that this is an indicator of people who have received at least the first dose since the vaccination process generally consists of multiple doses administered over time.

The above plot might seem confusing at first, but it makes sense since we consider the people vaccinated as a percentage of the country’s population. The top 5 countries graphed above have relatively smaller populations than those that come after.

Visualising Global Statistics

The above plots show how the virus has impacted the world on two separate scales: Country and Continent. These are the numbers highly discussed in news articles, instant messaging apps and on social media. We see that overall, the USA and Europe are the worst hit continents, with Asia close behind.

Visualising Vaccination

Different countries use different combinations of the available vaccines.

Plotly also enables us to visualize how a metric changes over time, in the form of animations. If this looks interesting as a gif, you would be please to know that in the notebook, you can stop at each time step and examine the values for each country individually, by hovering your mouse cursor.

The following line chart is another perspective of the same data. For the sake of readability, we only plot the top-10 countries here.

Visualising the race against Covid

The above graphs shows how slowly but surely, the vaccines are being administered in increasingly large numbers each day. If we look carefully, we can also identify a slight downward trend in the number of new cases each day, as the vaccinations progress. Humanity is on its way to victory!



I'm interested in Data Science, video games and art! I'm looking forward to travelling the world, eating food and writing about it.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store