TimeWeb
SITEMAP | HELP | SAMPLE DATA | MIMAS DATA | SEARCH TIMEWEB  
HOME : CRUNCHING : PROCESSING : ILLUSTRATION
Digging
Crunching
  - Tour
  - Processing
      - Explanation
      - Illustration
      - Worksheets
      - Review
  - Analysing
  - Relating
  - Experimenting
Buffing
Reference

 



IllustrationILLUSTRATION

Contents:


Data Collation

One of the best ways of collating data is to construct a frequency distribution. To illustrate this section we'll use the data in the following table.

Variable (x) 1 2 3 4
Frequency(f) 31 37 20 12

This data refers to the following case:

A large campsite in southern France is considering offering a children's playscheme for customers' use. Parents or carers would be able to use the scheme to leave their child/ren with qualified playworkers for up to half a day at a time.

Before taking the idea further, the site management decides to collect information on how many children might be involved. If we look at the number of children as a statistical variable and show this by the symbol x, then each value of x is a comment on the number of times x was observed - its frequency (f).

So if we look at the tabulated data, we can see that a value of x = 2 has an associated frequency of 37. The table therefore contains all the frequencies for the values of x. This is known as a frequency distribution. All the table shows is the number of times that a particular value of x was observed.

[Top]


Central Tendency: A Worked Example

Table 1 contains information on the levels of unemployment in the UK between 1992 and 1996.

1992 1993 1994 1995 1996
Q1
n.a. 10.6 10.0 8.9 8.3
Q2
9.8 10.4 9.7 8.7 8.3
Q3
10.0 10.3 9.4 8.7 n.a.
Q4
10.4 10.3 9.0 8.4 n.a.

Table 1: ILO unemployment rate 1992 (Q2) - 1996 (Q2) : UK: All: Aged 16 and over: %: SA
Source: National Statistics

Taking this data as our source, the modal value or the most frequently occurring number can be observed best by ordering the data, as follows:

8.3
8.3
8.4
8.7
8.7
8.9
9.0
9.4
9.7
9.8
10.0
10.0
10.3
10.3
10.4
10.4
10.6

Table 2: 1992 - 1996 unemployment rates (%) placed in value order.


You can see that there are actually 5 modal values: 8.3, 8.7, 10.0, 10.3, and 10.4. This does not convey much useful information to us as students of labour market patterns in the UK.

The mean is probably the best central measure in this example because there are no outlying observations.

X-bar (the mean unemployment rate in the UK 1992 - 1996) = 161.2/17
Mean = 9.5 (correct to 2 s.f.)

Like the modal value, the median value is best viewed from the ordered data. You can see that the middle value of the ordered data is 9.7 as there are eight observations above and below this point.

Like the modal value, the median value is best viewed from the ordered data. You can see that the middle value of the ordered data is 9.7 as there are eight observations above and below this point.

With continuous data series such as can be seen in the following table, the median can be found by calculation or graphically:

Monthly Pay (£) Number of Staff Cumulative Frequency
700 but less than 800 20 20 staff get less than £800
800 but less than 900 29 49 staff get less than £900
900 but less than 1000 35 84 staff get less than £1000
1000 but less than 1100 42 126 staff get less than £1100
1100 but less than 1200 24 150 staff get less than £1200
Total staff = 150

The position of the median for a continuous series is: f / 2, where f is the total frequency. In the above table the position of the median is 150 / 2 = 75. This means that the 75th staff member's pay is the median wage. This falls within the class interval £900 but less than £1000 a month. The first 49 staff in order earn up to £900 a month. The 75th staff member earns at least £900, but less than £1000.

The only way to get to a closer calculation of the salary of this 75th staff member is to assume that the earnings are evenly distributed within the class interval. The 75th member of staff is the twenty-sixth person in the group earning between £900 and £1000 a month. We know this because the previous group included the 49th employee and 75 - 49 = 26.

If it is assumed that the £100 per month in the class interval of this grouped data is divided evenly between the 35 people in the group, then the 75th employee's salary will be 26/35ths of £100 plus the starting point of £900:

Median = £900 + 26/35 * 100
= £974

The median monthly salary is that of the 75th member of staff, which is £974. This divides the distribution in half, so that of the 150 employees, half will earn less than £974 and half will earn more. A manager may be able to compare this with a similar group of staff. If another company were shown to have a median salary far in excess of the earlier figure, then the manager should be able to see why there is a recruitment problem in his/her organisation.

Finding the median graphically:

The median can be found graphically by constructing the cumulative frequency curve or 'ogive'.

More information on 'ogives'

'Ogive' is a term used to describe a shape that has the appearance of an S or a flattened S. It is similar to the normal shape of the cumulative frequency.

Ogive curve

Using the data from the above table (from which the cumulative frequency and median were arrived at by calculation), we plot the cumulative frequency against the monthly salaries data.

A horizontal line is drawn on the y-axis from the calculated position of the median (in the 75th employee's salary) to the ogive. Where the horizontal line and the ogive intersect, a vertical line is dropped to the x-axis. You can see that the value of the median is somewhere between £900 and £1000. Clearly, the greater the scale and accuracy of the chart, the closer to our calculated figure for the median of £974, we would get.

Let's now consider the interquartile range in greater detail. As we have seen, the median divides an ordered distribution in half. In a similar way, we can split distributions into quarters. This divides an ordered distribution into four equal parts: the lower quartile (Q1), the middle quartile or median (Q2), and the upper quartile (Q3).

In a distribution of 100 items the quartiles will be the values of the 25th, 50th and 75th items, representing Q1, Q2 and Q3 respectively. But in the above table there are 150 items, so the calculation is more difficult:

The position of the lower quartile (Q1) = n / 4. For this data, the position is therefore, 150 / 4 = 37.5. The lower quartile pay is that received by the 38th staff member (the 'half a person' being rounded up). This person is in the pay class '£800 but less than £900', shared by 29 members of staff.

Q1 = £800 + 18/29 * 100
= £862

The position of the upper quartile (Q3) = 3n / 4. For this data, the position is 450 / 4 = 112.5. The upper quartile pay is that received by the 113th staff member.This person occupies the pay class '£1000 but less than £1100', shared by 42 employees.

Q3 = £1000 + 29/42 * 100
= £1069

The distribution can be split up as follows:
Q1 = 38th employee earning £862
M = 75th employee earning £974
Q3 = 113th employee earning £1069

The interquartile range is therefore:
Q3 - Q1 = £1069 - £862 = £207

We can now describe the distribution by stating that the median average monthly pay is £974, with a range of £500 (£1200 - £700) and an interquartile range of £207. The interquartile range describes the pay of the middle half of the workforce, where there is a relatively small range compared with the full distribution. Equally, we can say that a quarter of staff earn less than £862 and a quarter more than £1069.

This is clearly a very useful tool to an organisational manager for summarising and analysing wage distributions , which can be compared quickly to national, regional, or local statistics. The great value of the interquartile range is that it is not influenced by extreme values. In this example, for instance, it would remain the same even if the 24 highest paid staff earned £3000 per month. But very high or very low earnings figures would distort the salary list in an organisation, and affect decisions about future pay levels, if only the range and the arithmetric mean were used.

[Top]


Weighted Average: A worked example

The Weighted Average is a useful calculation of the average when you have data which is grouped, such as in the following example:

Cars / household
(Xi)
% (fi) fi Xi
0 19 19
1 43 43
2 28 56
3 8 24
4 2 8
Sigma 150

Table 3: Cars per household, %, 1996, UK.

In Table 3, above, the percentage of households owning 0 to 4 cars is shown. To find the average number of cars per household, we multiply each possible number of cars by the percentage in that category (the frequency). Column 3 shows this calculation, where Xi is the number of cars and fi is the percentage in that category.

The average is the sum of this column, divided by the total frequency (100 as the data is in percentages).

150/100 = 1.50

The formula for the weighted average is as follows:
X-bar = Sigma fi Xi/ n

[Top]


Measures of Dispersion: A Worked Example

To illustrate the concepts of the range, mean absolute deviation, and the standard error we will continue to use the data in Table 1 above.

The range as we saw in the explanation section is simply the difference between the lowest and highest observation. The range of the data in Table 1 is 8.3 - 10.6 and the difference between the two readings, that is the difference between the highest and lowest unemployment rates in the UK between 1992 and 1996, is 2.6.

In order to calculate the mean absolute deviation (MAD), we must construct another table from which we will be able to read off this statistic.

Table 3: Unemployment Rates in the UK

Year/Qtr
Rate of unemp't X - X-bar |X - X -bar| (X - X-bar)2
1992 Q2 9.8 0.3 0.3 .09
Q3 10.0 0.5 0.5 .25
Q4 10.4 0.9 0.9 .81
1993 Q1 10.6 1.1 1.1 1.21
Q2 10.4 0.9 0.9 .81
Q3 10.3 0.8 0.8 .64
Q4 10.3 0.8 0.8 .64
1994 Q1 10.0 0.5 0.5 .25
Q2 9.7 0.2 0.2 .04
Q3 9.4 -0.1 0.1 .01
Q4 9.0 -0.5 0.5 .25
1995 Q1 8.9 -0.6 0.6 .36
Q2 8.7 -0.8 0.8 .64
Q3 8.7 -0.8 0.8 .64
Q4 8.4 -1.1 1.1 1.21
1996 Q1 8.3 -1.2 1.2 1.44
Q2 8.3 -1.2 1.2 1.44
12.3 11.54

Source: Statbase


In Table 3 above, column 3 gives the difference between the observed value and the mean. Note that we have already calculated the mean earlier. Column 4 states the values when the sign is ignored. These 'absolute' values are summed at the foot of this column. This figure is divided by the number of observations, which in this case is 17:

MAD = 12.3/17 = 0.7 (correct to 2 s.f.)

To interpret this statistic we consider its size. A larger value implies a larger dispersion.

The standard deviation can be calculated from the data in Table 3 too. Column 5 gives the squared value of the difference between the observed value and the mean. These values are summed at the foot of the column. This is divided by the number of observations (17) to give the variance, and the square root of this sum is calculated to give the standard deviation.

Variance (s2) = 11.54 / 17 = 0.68 Standard deviation (s.d.) = 0.82

Again, the larger the value, the larger the deviation. The standard deviation carries the same units of measurement as the original data, so here the standard deviation = 0.82%

What does this statistic tell us? This is a hard question to answer. On its own, the standard deviation is useful only in that it gives us a means of comparison with other standard deviations, in the knowledge that any larger deviations within the data are going to be represented.

Elsewhere in Timeweb we will be using standard deviations for more advanced statistical work, but for now we must merely say that s.d. is the most frequently used measure of dispersion and is used when more formal statistical testing is required.

[Top]


Coefficient of Variation: A Worked Example

The coefficient of variation is a summary measure used to give an indication of the amount of variability present in the data. It is calculated by expressing the standard deviation as a percentage of the mean.

Using the above example:

Coefficient of Variation = 0.82 / 9.5 x 100
= 0.086 x 100
= 8.6

This statistic would be used to compare against other datasets and a judgement would be made of the amount of variability in one dataset as against another.

Coefficient of Skewness

This summary statistic indicates the tendency for values in a dataset to bunch at one end of a distribution. Using the second formula given and applying it to the unemployment data given above, we can calculate that:

Coefficient of skewness within the data for UK unemployment 1992 - 96
= 3(9.5 - 9.7) / 0.82
= -0.6 / 0.82
= - 0.75

[Top]


Moving Averages: A Worked Example

A moving average allows us to 'smooth' out a series of data so that the underlying movement over time can be seen. Using the new car registrations data contained in Table 4 below, we can illustrate how to construct a moving average for this highly seasonal data.

Table 4:
UK New Car Registrations
1994 (Q4) - 1999 (Q3)

Year/Qtr 000s Cars
1994 Q4 400.6
1995 Q1 613.9
Q2 511.7
Q3 748.3
Q4 432.7
1996 Q1 621.2
Q2 563.7
Q3 769.5
Q4 455.6
1997 Q1 646.3
Q2 606.1
Q3 849.0
Q4 498.1
1998 Q1 738.2
Q2 638.8
Q3 848.4
Q4 514.9
1999 Q1 761.6
Q2 692.2
Q3 781.4

Source: National Statistics


The first decision is a matter of judgement: over what period to calculate the averaging process? The longer the length of the average, the smoother the series becomes, as each individual piece of information becomes less significant on its own. But the drawback with calculating the averaging over a long period is that changes in the underlying trend are not picked up quickly.

With this new car data, we want to remove the seasonality from the data whilst still seeing the trend year-on-year. An average of this data every fourth period should enable us to achieve this. Having decided this, the first step is to average the first four observations. This gives us the moving average for the mid-point of these four observations.

In this example this is:

(400.6 + 613.9 + 511.7 + 748.3) / 4 = 558.6

The next step is to move the average along the dataset, by dropping the first observation and including the next period's data, then averaging over the four observations again:

(613.9 + 511.7 + 748.3 + 432.7) / 4 = 576.6

These calculations are shown in Table 5 below:

Table 5: New Car Registrations 1994 - 99 Moving Average

Year Calculation Moving Average
1994 Q4 -
1995 Q1 -
Q2 (400.6+613.9+511.7+748.3) / 4 559
Q3 (613.9+511.7+748.3+432.7) / 4 577
Q4 (511.7+748.3+432.7+621.2) / 4 578
1996 Q1 (748.3+432.7+621.2+563.7) / 4 591
Q2 (432.7+621.2+563.7+769.5) / 4 597
Q3 (621.2+563.7+769.5+455.6) / 4 603
Q4 (563.7+769.5+455.6+646.3) / 4 609
1997 Q1 (769.5+455.6+646.3+606.1) / 4 619
Q2 (455.6+646.3+606.1+849.0) / 4 639
Q3 (646.3+606.1+849.0+498.1) / 4 650
Q4 (606.1+849.0+498.1+738.2) / 4 673
1998 Q1 (849.0+498.1+738.2+638.8) / 4 681
Q2 (498.1+738.2+638.8+848.4) / 4 681
Q3 (738.2+638.8+848.4+514.9) / 4 685
Q4 (638.8+848.4+514.9+761.6) / 4 691
1999 Q1 (848.4+514.9+761.6+692.2) / 4 704
Q2 (514.9+761.6+692.2+781.4) / 4 688
Q3 -

Why not try the moving averages worksheet to see that you understand this?

[Top]


Normal Curves: Summary

The mathematical analysis of the normal curve holds good for the frequencies of any value within a population that is normally distributed, (such as mortality rates, some test scores, people's heights and weights and so on).

Once you know the mean and the standard deviation, you can predict the probability of the value for any member of the population.

When you know the mean and the standard deviation values you are in a position to make a number of conclusions about the probable distribution of the entire population.

Normal curves may have different heights and widths but whatever the dimensions, the mean, median and mode coincide at the high point of the curve and divide the results into two equal and symmetrical halves.

Of all the scores in a normal distribution, approximately 34% will lie between the mean and one SD above the mean, and approximately 34% will lie between the mean and one standard deviation below the mean (in total, 68% of all scores fall within one SD below and one SD above the mean).

Of all the scores approximately 95% will fall between the lines representing two SDs from the mean (27% of all scores fall between one and two SDs, with 13.5% on either side of the curve).

Of all the scores approximately 99% will lie between the lines indicating three SDs from the mean (5% of the sample will fall between two and three SDs, 2.5% on either side of the mean).


Review of confidence interval analysis of a population from a single sample

It may be wise here, to review the steps we should take in making generalisations within confidence levels about an entire population from a single sample:

  1. Firstly, we select a sampling strategy, which usually means a random sample, and select our sample, making sure that we have at least 30 observations within it.
  2. Then we collect the information from the sample and process it (using Excel or similar spreadsheet package) in order to find out the mean and the standard error of the sample.
  3. Finally, we make conclusions at the different confidence intervals: 68% for a range within plus or minus 1 standard error of the mean of the sample; 95% for a range within plus or minus 2 standard errors of the mean of the sample; and 99% for a range within plus or minus 3 standard errors of the mean of the sample.

Suppose we want to know the average daily wage rate for a post-graduate qualified economist: following the steps reviewed above, we complete steps 1 and 2 for a sample of 30 economists. The mean daily wage rate is £112; the standard deviation is £12.

From these figures we can calculate the standard error (the standard deviation divided by the square root of the number in the sample). This comes to 12 divided by 5.47, or £2.19.

Now we can make conclusions at the different confidence intervals as follows:

  1. We are 68% certain that the average daily wage rate for post-graduate qualified economists is within the range of plus or minus 1 standard error of the sample mean. That is between £109.81 and £114.19.
  2. We can be 95% certain that the average daily wage rate for post-graduate qualified economists is within the range of plus or minus 2 standard errors of the sample mean. That is between £107.62 and £116.38.
  3. We are 99% certain that the average daily wage rate for a post-graduate qualified economist is within the range of plus or minus 3 standard errors of the sample mean. That is between £105.43 and £118.57.

This information could be used in Higher Education Institutions to guide undergraduates through their career choices or in taxation and expenditure decisions by government. One word of warning though, don't necessarily expect this level of income if you do indeed achieve this level of qualification! It's meant purely as an illustrative example!!

Try the worksheet.

[Top]


Normal Distribution Curve: A Worked Example:

Comparing examination performance.

Using no more than what we already know about normal distribution curves, we can look at different exam results and compare performance.

'Big deal!' I hear you say; and you may feel that you know enough already about numbers to be able to compare your performance in, say, a marketing exam with the results of your fellow students in Marketing. But do you?

And what if you wanted to know how good your result was in Marketing, compared to your friend's performance in Human Resource Management (HRM)? A bit trickier, perhaps?

Imagine that you have scored 77% in Marketing and your friend only gets 63% in HRM. Should you be thinking of taking next year's marketing option on the strength of this relatively good result? The answer must be to wait and use what is known as the Z-score to prevent your being misled by your apparent success or your friend's apparent (relative) failure.

The Z-score is the way to compare different scores across different populations. It turns differently measured things into a standard measurement. We can compare scores by comparing values in the distribution with the standard deviation of the distribution. To compare two marks you need to know the mean mark of the Marketing group who took the exam with you and the standard deviation of the marks for that group.

Let's say that you know that your 77% compares with a mean result in Marketing of 58%, and that the standard deviation was 12.

You will be able to tell that, based on what you know about normal distribution, 68% of the whole group scored between 46 and 70%, because 58% plus or minus 12 (one standard deviation) gives that range. Also, only 16% of the group scored above 70% and that you are in the top 16%.

But also, 95% of the whole group scored between 34 and 82%, (58% plus or minus two standard deviations), so your result is not quite good enough to put you in the top 5% of your group. You would need to have scored 5% more in your exam to have reached these dizzying heights!

OK, now let's see how your Marketing result compares with your friend's HRM mark.

Imagine that your friend's 63% mark in HRM compares to a mean mark of 49% and the standard deviation of 4. You would be able to tell from this information that your high 77% in Marketing was not as good a performance as your friend's 63% in HRM.

Your 77% is less than two standard deviations from the mean in marketing (58 + 12 +12 = 82%), while your friend's HRM mark is more than three standard deviations above the mean score, (49 + 4 +4 + 4 = 61%). In fact, your friend's result puts them into the top 5% of the HRM group.

On the basis of these scores you could have concluded that you are naturally pre-disposed to a job in marketing and your friend could be very downcast about their chances of pursuing a career in personnel work. Without comparing both your results against the rest of your respective group's scores, any fast response would be unwise. The Z-scores, however, allow you to see your performance in relation to other results across the group and, indeed, to compare your absolute performance in a chosen subject with another's performance in a completely different subject.

In fact we can calculate exactly the Z-scores in both of these cases by subtracting the average class mark from the individual's mark and dividing by the standard deviation.

(77 - 58) / 12 = 1.6

(63 - 49) / 4 = 3.5

Your friend's 3.5 Z-score is clearly bigger than your 1.6, so your friend's performance is the more impressive of the two results.

Why not try the Z-test question to see how well you understand this?

[Top]