Worked Examples of Correlation Calculations [TimeWeb]

Illustration ILLUSTRATION  

Correlation Coefficients: Examples

Spearman's Rank Correlation Coefficient

In calculating this coefficient, we use the Greek letter 'rho' or r
The formula used to calculate this coefficient is:

r = 1 - (6sigmad2 ) / n(n2 - 1)

To illustrate this, consider the following worked example:
Researchers at the European Centre for Road Safety Testing are trying to find out how the age of cars affects their braking capability. They test a group of ten cars of differing ages and find out the minimum stopping distances that the cars can achieve. The results are set out in the table below:

Table 1: Car ages and stopping distances

Car Age
(months)
Minimum Stopping at 40 kph
(metres)
A 9 28.4
B 15 29.3
C 24 37.6
D 30 36.2
E 38 36.5
F 46 35.3
G 53 36.2
H 60 44.1
I 64 44.8
J 76 47.2

These figures form the basis for the scatter diagram, below, which shows a reasonably strong positive correlation - the older the car, the longer the stopping distance.

Scatter diiagram

Graph 1: Car age and Stopping distance (data from Table 1 above)

To process this information we must, firstly, place the ten pieces of data into order, or rank them according to their age and ability to stop. It is then possible to process these ranks.

Table 2: Ranked data from Table 1 above

Car Age (months) Minimum Stopping at 40 kph (metres) Age rank Stopping rank
A 9 28.4 1 1
B 15 29.3 2 2
C 24 37.6 3 7
D 30 36.2 4 4.5
E 38 36.5 5 6
F 46 35.3 6 3
G 53 36.2 7 4.5
H 60 44.1 8 8
I 64 44.8 9 9
J 76 47.2 10 10

Notice that the ranking is done here in such a way that the youngest car and the best stopping performance are rated top and vice versa. There is no strict rule here other than the need to be consistent in your rankings. Notice also that there were two values the same in terms of the stopping performance of the cars tested. They occupy 'tied ranks' and must share, in this case, ranks 4 and 5. This means they are each ranked as 4.5, which is the mean average of the two ranking places. It is important to remember that this works despite the number of items sharing tied ranks. For instance, if five items shared ranks 5, 6, 7, 8 and 9, then they would each be ranked 7 - the mean of the tied ranks.

Now we can start to process these ranks to produce the following table:

Table 3: Differential analysis of data from Table 2

Car Age (mths) Stopping distance Age rank Stopping rank d d2
A 9 28.4 1 1 0 0
B 15 29.3 2 2 0 0
C 24 37.6 3 7 4 16
D 30 36.2 4 4.5 0.5 0.25
E 38 36.5 5 6 1 1
F 46 35.3 6 3 -3 9
G 53 36.2 7 4.5 -2.5 6.25
H 60 44.1 8 8 0 0
I 64 44.8 9 9 0 0
J 76 47.2 10 10 0 0
Sigmad2 32.5

Note that the two extra columns introduced into the new table are Column 6, 'd', the difference between stopping distance rank and age rank; and Column 7, 'd2', Column 6 entries squared. These squared figures are summed at the foot of Column 7.
Calculation of Spearman Rank Correlation Coefficient (r) is:
r = 1 - (6Sigmad2 ) / n(n2 - 1)
Number in sample (n) = 10
r = 1 - (6 x 32.5) / 10(10 x 10 - 1)
r = 1 - (195 / 10 x 99)
r = 1 - 0.197
r = 0.803

What does this tell us? When interpreting the Spearman Rank Correlation Coefficient, it is usually enough to say that:

  • for values of r of 0.9 to 1, the correlation is very strong.
  • for values between 0.7 and 0.9, correlation is strong.
  • and for values between 0.5 and 0.7, correlation is moderate.

This is the case whether r is positive or negative.
In our case of car ages and stopping distance performance, we can say that there is a strong correlation between the two variables.

Pearson's or Product-Moment Correlation Coefficient

The Pearson Correlation Coefficient is denoted by the symbol r. Its formula is based on the standard deviations of the x-values and the y-values:

Equation for correlation coefficient

Going back to the original data we recorded from the European Centre for Road Safety Testing, the calculation needed for us to work out the Product-Moment Correlation Coefficient is best set out as in the table that follows.

Note that in the table below,
x = age of car
y = stopping distance

From this, the other notation should be obvious.

x y x2 y2 xy
9 28.4 81 806.56 255.6
15 29.3 225 858.49 439.5
24 37.6 576 1413.76 902.4
30 36.2 900 1310.44 1086
38 36.5 1444 1332.25 1387
46 35.3 2116 1246.09 1623.8
53 36.2 2809 1310.44 1918.6
60 44.1 3600 1944.81 2646
64 44.8 4096 2007.04 2867.2
76 47.2 5776 2227.84 3587.2
Totals 415 375.6 21623 14457.72 16713.3

x-bar = 415/10 = 41.5
y-bar = 376.6/10 = 37.7

r = 10 x 16713.3 - 415 x 375.6 / Square root{(10 x 21623 - 4152) (10 x 14457.72 - 375.62)}
r = 11259 / Square root(44005 x 3501.84)
r = 11259 / 124.14
r = 0.91

What does this tell us?

To interpret the value of r you need to follow these guidelines:

  • r always lies in the range -1 to +1. If it lies close to either of these two values, then the dispersion of the scattergram points is small and therefore a strong correlation exists between the two variables.
  • For r to equal exactly -1 or +1 must mean that correlation is perfect and all the points on the scattergram lie on the line of best fit (otherwise known as the regression line.) If r is close to 0, the dispersion is large and the variables are uncorrelated. The positive or negative sign on the value of r indicates positive or negative correlation.

So in the above case, there is evidence of strong positive correlation between stopping distance and age of car; in other words, the older the car, the longer the distance we could expect it to take to stop.

 

Illustration:

Let's say that we want to track the progress of a group of new employees of a large service organisation. We think we can judge the effectiveness of our induction and initial training scheme by analysing employee competence in weeks one, four and at the end of the six months.

Let's say that Human Resource managers in their organisation have been urging the company to commit more resources to induction and basic training. The company now wishes to know which of the two assessments - the new employee's skills on entry or after week four - provides a better guide to the employee's performance after six months. Although there is a small sample here, let's assume that it is accurate.

The raw data is given in the table below:

Name Skills on entry
% score
Skills at week
4 % score
Skills at 6 mths
% score
ab 75 75 75
bc 72 69 76
cd 82 76 83
de 78 77 65
ef 86 79 85
fg 76 65 79
gh 86 82 65
hi 89 78 75
ij 83 70 80
jk 65 71 70

Copy this information onto a fresh Excel worksheet, putting the names in Column A, the entry test results in Column B, the week four test Marks in Column D, and the six month test scores in Column F.

When you have entered the information, select the three number columns (do not include any cells with words in them). Go to the Data Analysis option on the Tools menu, select from that Data Analysis menu the item Correlation (note that if the Data Analysis option is not on the Tools menu you have to add it in).

When you get the Correlation menu, enter in the first Input Range box the column of cells containing the dependent variables you wish to analyze (D3 to D12 if your spreadsheet looks like TimeWeb's). Next, enter into the second input box the column of cells that contain the independent variables (B3 to B12, again if your sheet resembles TimeWeb's).

Then click the mouse pointer in the circle to the left of the Output Range label (unless there is a black dot in it already), and click the left mouse button in the Output Range box. Then enter the name of cell where you want the top left corner of the correlation table to appear (e.g., $A$14). Then click OK.

After a second or two, the Correlation Table should appear giving you the correlation between all the different pairs of data. We are interested in the correlation between Column B (the first column in the Table) and Column D (the third column in the table). The correlation between Column C (the second column in the Table) and Column D, can be approached in the same way.

Which of these two is the better predictor of success according to this study. How reliable is it?

Expected Answer:

The correlation between the Entry Mark and the Final Mark is 0.23; the correlation between the four week test and the Final Mark is 0.28. Thus, both of the tests have a positive correlation to the Final (6 month) Test; the entry test has a slightly weaker positive correlation with the Final Mark, than the Four Week Test. However, both figures are so low, that the correlation is minimal. The skills measured by the Entry test account for about 5 per cent of the skills measured by the Six Month Mark. This figure is obtained by using the R-Squared result and expressing it as a percentage.

There is an Excel spreadsheet available with a possible solution available. Did you maage to get similar results to these?

Beware!

It's vital to remember that a correlation, even a very strong one, does not mean we can make a conclusion about causation. If, for example, we find a very high correlation between the weight of a baby at birth and educational achievement at age 25, we may make some predictions about the numbers of people staying on at university to study for post-graduate qualifications. Or we may urge mothers-to-be to take steps to boost the weight of the unborn baby, because the heavier their baby the higher their baby's educational potential, but we should be aware that the correlation, in itself, is no proof of these assertions.

This is a really important principle: correlation is not necessarily proof of causation. It indicates a relationship which may be based on cause and effect, but then again, it may not be. If weight at birth is a major cause of academic achievement, then we can expect that variations in birth weight will cause changes in achievement. The reverse, however is not necessarily true. If any two variables are correlated, we cannot automatically assume that one is the cause of the other.

The point of causation is best illustrated perhaps, using the example of AIDS.

A very high correlation exists between HIV infection and cases of AIDS. This has caused many researchers to believe that HIV is the principal cause of AIDS. This belief has led to most of the money for AIDS research going into investigating HIV.

But the cause of AIDS is still not clear. Some people (especially, not surprisingly, those suffering from AIDS) have argued vehemently that investigating HIV instead of AIDS is a mistake. They say that something else is the real cause. This is the area, they argue, that requires greater research funding. More money should be going into AIDS research rather than studies into HIV.

Why not practice your understanding of the meaning of correlation coefficients with a worksheet?