Plotting multiple grouped scatter graphs with regression lines on the same axes

34 views Asked by At

I'm trying to create a data vis which is essentially a scattergraph.

A snip of the data looks like this:

Location Year A B
Afghanistan 1990 48.3973 1377.85863
Afghanistan 2020 65.432 620.40751
Albania 1990 76.4334 32.87494
Albania 2020 79.6756 8.27645
Algeria 1990 68.5167 205.81979
Algeria 2020 75.9125 77.69479
Angola 1990 82.6756 1101.50579
Angola 2020 83.1922 221.90649
Antigua and Barbuda 1990 73.6497 58.0267
Antigua and Barbuda 2020 80.7452 21.24795
Argentina 1990 75.9699 87.81427
Argentina 2020 81.3155 44.90008

That is, each country has a pair of points. I'd like to plot A against B, and then ideally have each country's pair of points (in a perfect world) in a different colour (ideally with labels, showing which country they are and which year).

I then would like to plot a regression line of all the 1990 points, and a regression line on the same axes of the 2020 points.

In Excel would be ideal, but I could potentially use something else if it would be more appropriate. Any help would be very gratefully received.

1

There are 1 answers

0
Tino D On

Since you are open to other options, I thought I can answer your question in python, if that's ok. Here's how you can do it with seaborn, a nice data visualization package that has a lot of helpful functions for statistics and ML.

First, to get you started with python, I would recommend anaconda, as their installation already includes a tonne of nice packages which are all compatible with each other.

After getting acquainted, you can import the following packages for the example:

import seaborn as sns # nice plotting
%matplotlib notebook
import matplotlib.pyplot as plt # package also for plottingm the above line is a magic command to define the backend
import pandas as pd # for data handling

I defined the data without Afghanistan for now, as it seems to be a bad outlier that negatively affects the regression models, I'll show that at the end. For now, here are the results without AFG:

data = {
    'Location': ['Albania', 'Albania', 'Algeria', 'Algeria', 'Angola', 'Angola', 'Antigua and Barbuda', 'Antigua and Barbuda', 'Argentina', 'Argentina'],
    'Year': [1990, 2020, 1990, 2020, 1990, 2020, 1990, 2020, 1990, 2020],
    'A': [76.4334, 79.6756, 68.5167, 75.9125, 82.6756, 83.1922, 73.6497, 80.7452, 75.9699, 81.3155],
    'B': [32.87494, 8.27645, 205.81979, 77.69479, 1101.50579, 221.90649, 58.0267, 21.24795, 87.81427, 44.90008]
} # define the data without afghanistan
df = pd.DataFrame(data) # generate dataframe
plt.figure() # figure
sns.scatterplot(data=df, x='A', y='B', hue='Location', style='Year') # scatter plot the data, changing style for year and colour with country
sns.regplot(data=df[df['Year'] == 1990], x='A', y='B', scatter=False, color='blue') # plot model for 1990
sns.regplot(data=df[df['Year'] == 2020], x='A', y='B', scatter=False, color='orange') # plot model for 2000
plt.grid() # add grid
plt.legend(loc = "lower right", ncols = 3) # and legend

This leads to the following plot:

regression plot without AFG

Of course, if you have the data in a better way, you can use pd.read_excel() or pd.read_csv() to get the data from saved files.

Check the data with Afghanistan here:

Plots with AFG

As you see, the regression primarily captures the variance found between AFG and the other groups. This can be solves, for example, by using robust regression instead.

I hope this answer encourages you to start learning python and doing this. The advantage would be that there are a lot of resources these days to get started and to even master python. Also, chatGPT can help you get started (you will come back complaining about it anyway ;) )