Worked Examples to Illustrate Common Data-Related Terms [TimeWeb]

Illustration ILLUSTRATION
 

Z Charts example

The following table contains sales data for a sole trader who runs a domestic service operation.

Month 1998 Takings (£) 1999 Takings (£) Cumulative totals 1999 (£) M.A.T. (£)
Jan 1680 1990 1990 18530
Feb 1410 2260 4250 19380
Mar 1600 1780 6030 19560
Apr 1540 1780 7810 19800
May 1610 1530 9340 19720
Jun 1070 1360 10700 20010
Jul 920 1170 11870 20260
Aug 730 1140 13010 20670
Sep 1870 1480 14490 20280
Oct 1880 1530 16020 19930
Nov 1550 1590 17610 19970
Dec 2360 2160 19770 19770
18220

The table is based on monthly sales figures, given in columns 2 and 3. The 1998 figures are included to help us to calculate the Moving Annual Total (M.A.T.).

The cumulative totals for 1999 (column 4) are calculated by adding the current month's sales figure to the previous month's total. So, in January there is only one figure to be entered. This is the start of the year and the total sales at the end of the first month stood at £1990. At the end of February 1999, £2260 sales were made. This figure is added to the January total to give the cumulative total of (£1990 + £2260 =) £4250. The other figures in this column are derived in the same way.

The Moving Annual Totals (MAT) are found by totalling the monthly sales figures over a twelve month period. As time proceeds and a new month's sales data are known this figure is added to the MAT and the previous year's corresponding month's figure is eliminated. So, the MAT at the end of December 1998 is the total of the monthly sales from the start of January 1998 to the end of December 1998 which come to £18 220. The MAT at the end of January 1999 is the total of the monthly sales from the start of February 1998 to the end of January 1999. You can get this figure by subtracting the January 1998 sales figure from £18 220 and adding the January 1999 figure (18 220 - 1680 + 1990 = 18 530.)

Notice that the cumulative total and the MAT for December 1999 must come to the same figure.

The data in columns 3, 4 and 5 of the table can now be built into a graph, to enable the fluctuations to be more easily noticed

Z Chart for Sole Trader 1999
Top
 

Cross-section data example

UK Labour Market Participation of Ethnic Groups (1996)

The UK workforce comprised the following ethnic minority proportions:

Of Indian origin 590 000
Of Pakistani origin 330 000
Of Black Caribbean origin 320 000
Of Black African origin 200 000
Of Bangladeshi origin 110 000
Of Chinese origin 100 000

Using the table giving data for Labour Market Participation rates of Ethnic Minority Groups, we can calculate that Black Caribbean workers in 1996 accounted for 19.4 % of the total ethnic minority workforce in the UK.

Extension to percentages in cross-sectional data material

If we had incomplete information, for instance if the total number of people in all categories was unknown, we could still find this out. As long as we knew the total number of Black Caribbean workers in the UK and what percentage of the total ethnic minority working population this comprised.

Make the total ethnic minority working population in UK = x

19.4 x = 320 000
100

This is the same as:

0.194x = 320 000

0.194 x = 320 000
0.194 0.194

x = 320 000 = 1 649 484
0.194

x = 1 650 000 (rounded up to the nearest ten thousand)

Top
 

Ratios: A Worked Example

Alice, Ben and Charlie enter into a partnership.
Alice provides £30 000 capital
Ben provides £18 000
Charlie provides £24 000

Profits from the partnership are to be divided amongst the partners in ratio to their capital. In a year when the profit amounts to £30 000, how much does each receive?

Ratio of Alice's, Ben's and Charlie's capital =
30 000:18 000:24 000 = 15:9:12
= 5:3:4

There are 12 parts in total
Alice has 5/12, Ben has 3/12 (1/4) and Charlie has 4/12 (1/3).

Alice will receive 5/12 . 30 000 = £12 500
Ben will receive 3/12 . 30 000 = £7 500
Charlie will receive 4/12 . 30 000 = £10 000

(Confirm answer by adding these figures: 12 500 + 7 500 + 10 000 = 30 000)

Top
 

Exponents: A Worked Example

Remember that we are illustrating exponents (see the explanation of exponents if you are not sure) by using a, m and n, where m = 4 and n = 2.

am . an = am + n
(to prove that m = 4 and n = 2)

a4 . a2 = (a.a.a.a) (a.a) = a.a.a.a.a.a = a6 = a4 + 2

Top
 

Sequence of Operations: A Worked Example

4 x 42 - (2 + 1 x 22)2 / 3 =
4 x 16 - (2 + 1 x 4)2 / 3
= 64 - 62 / 3
= 64 - 36 / 3
= 64 - 18 = 46

Top
 

Central Tendency: A Worked Example

Table 1 contains information on the levels of unemployment in the UK between 1992 and 1996.

1992 1993 1994 1995 1996
Q1 n.a. 10.6 10.0 8.9 8.3
Q2 9.8 10.4 9.7 8.7 8.3
Q3 10.0 10.3 9.4 8.7 n.a.
Q4 10.4 10.3 9.0 8.4 n.a.

Table 1: ILO unemployment rate 1992 (Q2) - 1996 (Q2) : UK: All: Aged 16 and over: %: SA
Source: Statbase

Taking this data as our source, the modal value or the most frequently occurring number can be observed best by ordering the data, as follows:

8.3
8.3
8.4
8.7
8.7
8.9
9.0
9.4
9.7
9.8
10.0
10.0
10.3
10.3
10.4
10.4
10.6

Table 2: 1992 - 1996 unemployment rates (%) placed in value order.


You can see that there are actually 5 modal values: 8.3, 8.7, 10.0, 10.3, and 10.4. This does not convey much useful information to us as students of labour market patterns in the UK.

The median value is also best viewed from the ordered data. You can see that the middle value of the ordered data is 9.7 as there are eight observations above and below this point.

The mean is probably the best central measure because there are no outlying observations (see digging/meaning/explanation).

X-bar (the mean unemployment rate in the UK 1992 - 1996) = 161.2/17

Mean = 9.5 (correct to 2 s.f.)

Top
 

Weighted Average: A Worked Example

The Weighted Average was described in the 'Explanation' section (see the explanation of indices if you are not sure). It is a useful calculation of the average when you have data which is grouped, such as in the following example:

Cars / household (Xi) % (fi) fi Xi
0 19 0
1 43 43
2 28 56
3 8 24
4 2 8
Sigma 131

Table 3: Cars per household, %, 1996, UK.

In Table 3, above, the percentage of households owning 0 to 4 cars is shown. To find the average number of cars per household, we multiply each possible number of cars by the percentage in that category (the frequency). Column 3 shows this calculation, where Xi is the number of cars and fi is the percentage in that category.

The average is the sum of this column, divided by the total frequency (100 as the data is in percentages).

131/100 = 1.31

The formula for the weighted average is as follows:

X-bar = Sigmafi Xi / n

Top
 

Indices: A Worked Example

An index is a means of comparing something that is numerically measurable, over time to quantify the changes that have occurred. Indices use a base year as the point of comparison.

If we consider the performance of a company's share price over a year, we could find that the price dipped at the midpoint of the year before rallying at the year end. This might be expressed as follows:

Index of share price for Company X (Jan = 100)

Jan 100
Jun 84
Dec 123

To show how easy it is to change the base with indexed data, let's re-base this company's share performance to December.

Jan ??
Jun ???
Dec 100

Now, in order to work out the value of January's and June's indexed performance, we perform the following calculation:

?? = 100/123 x 100 = 81.3

??? = 84/123 x 100 = 68.3

So, the new indexed share price table appears as follows:

Jan 81.3
Jun 68.3
Dec 100

Why not try the worksheet on index numbers to check you understand this?

Top
 

Measures of Dispersion: Worked Examples

To illustrate the concepts of the range, mean absolute deviation, and the standard error we will continue to use the data in Table 1 above.

The range as we saw in the explanation section is simply the difference between the lowest and highest observation. The range of the data in Table 1 is 8.3 - 10.6 and the difference between the two readings, that is the difference between the highest and lowest unemployment rates in the UK between 1992 and 1996, is 2.6.

In order to calculate the mean absolute deviation (MAD), we must construct another table from which we will be able to read off this statistic.

Table 4: Unemployment Rates in the UK

Year/Qtr Rate of unemp't X - X-bar |X - X-bar| (X - X-bar)2
1992 Q2 9.8 0.3 0.3 .09
Q3 10.0 0.5 0.5 .25
Q4 10.4 0.9 0.9 .81
1993 Q1 10.6 1.1 1.1 1.21
Q2 10.4 0.9 0.9 .81
Q3 10.3 0.8 0.8 .64
Q4 10.3 0.8 0.8 .64
1994 Q1 10.0 0.5 0.5 .25
Q2 9.7 0.2 0.2 .04
Q3 9.4 -0.1 0.1 .01
Q4 9.0 -0.5 0.5 .25
1995 Q1 8.9 -0.6 0.6 .36
Q2 8.7 -0.8 0.8 .64
Q3 8.7 -0.8 0.8 .64
Q4 8.4 -1.1 1.1 1.21
1996 Q1 8.3 -1.2 1.2 1.44
Q2 8.3 -1.2 1.2 1.44
12.3 11.54

Source: National Statistics


In Table 4 above, column 3 gives the difference between the observed value and the mean. Note that we have already calculated the mean earlier. Column 4 states the values when the sign is ignored. These "absolute" values are summed at the foot of this column. This figure is divided by the number of observations, which in this case is 17:

MAD = 12.3/17 = 0.7 (correct to 2 s.f.)

To interpret this statistic we consider its size. A larger value implies a larger dispersion.

 

Standard deviation

The standard deviation can be calculated from the data in Table 3 as well. Column 5 gives the squared value of the difference between the observed value and the mean. These values are summed at the foot of the column. This is divided by the number of observations (17) to give the variance, and the square root of this sum is calculated to give the standard deviation.

s2 = 11.54 / 17 = 0.68
s.d. = squart root 0.68 = 0.82

Again, the larger the value, the larger the deviation. The standard deviation carries the same units of measurement as the original data, so here the standard deviation = 0.82%

What does this statistic tell us? This is a hard question to answer. On its own, the standard deviation is useful only in that it gives us a means of comparison with other standard deviations, in the knowledge that any larger deviations within the data are going to be represented.

Elsewhere in TimeWeb we will be using standard deviations for more advanced statistical work, but for now we must merely say that s.d. is the most frequently used measure of dispersion and is used when more formal statistical testing is required.

Top
 

Coefficient of Variation: A Worked Example

As you can see in the 'Explanation' part of this section, the coefficient of variation is a summary measure used to give an indication of the amount of variability present in the data. It is calculated by expressing the standard deviation as a percentage of the mean.

Using the above example:

Coefficient of Variation = 0.82 / 9.5 x 100
= 0.086 x 100
= 8.6

This statistic would be used to compare against other datasets and a judgement would be made of the amount of variability in one dataset as against another.

Top
 

Coefficient of Skewness:

As discussed in the 'Explanation' section, this summary statistic indicates the tendency for values in a dataset to bunch at one end of a distribution. Using the second formula given and applying it to the unemployment data given above, we can calculate that:

Coefficient of skewness within the data for UK unemployment 1992 - 96 = 3(9.5 - 9.7) / 0.82
= -0.6 / 0.82
= -0.75

Top
 

Moving Averages: A Worked Example

As we saw in the 'Explanation' section, a moving average allows us to "smooth" out a series of data so that the underlying movement over time can be seen. Using the new car registrations data contained in Table 5 below, we can illustrate how to construct a moving average for this highly seasonal data.

Table 5: UK New Car Registrations 1994 (Q4) - 1999 (Q3)

Year/Qtr 000s Cars
1994 Q4 400.6
1995 Q1 613.9
Q2 511.7
Q3 748.3
Q4 432.7
1996 Q1 621.2
Q2 563.7
Q3 769.5
Q4 455.6
1997 Q1 646.3
Q2 606.1
Q3 849.0
Q4 498.1
1998 Q1 738.2
Q2 638.8
Q3 848.4
Q4 514.9
1999 Q1 761.6
Q2 692.2
Q3 781.4

Source: National Statistics


The first decision is a matter of judgement: over what period to calculate the averaging process? The longer the length of the average, the smoother the series becomes, as each individual piece of information becomes less significant on its own. But the drawback with calculating the averaging over a long period is that changes in the underlying trend are not picked up quickly.

With this new car data, we want to remove the seasonality from the data whilst still seeing the trend year-on-year. An average of this data every fourth period should enable us to achieve this. Having decide this, the first step is to average the first four observations. This gives us the moving average for the mid-point of these four observations.

In this example this is:

(400.6 + 613.9 + 511.7 + 748.3) / 4 = 558.6

The next step is to move the average along the dataset, by dropping the first observation and including the next period's data, then averaging over the four observations again:

(613.9 + 511.7 + 748.3 + 432.7) / 4 = 576.6

These calculations are shown in Table 6 below:

Year Calculation Moving Average
1994 Q4 -
1995 Q1 -
Q2 (400.6+613.9+511.7+748.3) / 4 559
Q3 (613.9+511.7+748.3+432.7) / 4 577
Q4 (511.7+748.3+432.7+621.2) / 4 578
1996 Q1 (748.3+432.7+621.2+563.7) / 4 591
Q2 (432.7+621.2+563.7+769.5) / 4 597
Q3 (621.2+563.7+769.5+455.6) / 4 603
Q4 (563.7+769.5+455.6+646.3) / 4 609
1997 Q1 (769.5+455.6+646.3+606.1) / 4 619
Q2 (455.6+646.3+606.1+849.0) / 4 639
Q3 (646.3+606.1+849.0+498.1) / 4 650
Q4 (606.1+849.0+498.1+738.2) / 4 673
1998 Q1 (849.0+498.1+738.2+638.8) / 4 681
Q2 (498.1+738.2+638.8+848.4) / 4 681
Q3 (738.2+638.8+848.4+514.9) / 4 685
Q4 (638.8+848.4+514.9+761.6) / 4 691
1999 Q1 (848.4+514.9+761.6+692.2) / 4 704
Q2 (514.9+761.6+692.2+781.4) / 4 688
Q3 -

Why not try the worksheet on moving averages to see how well you understand this?

Top
 

Sampling Methods and Survey Types:

One of the world's best-known polling organisations, Gallup, say that one of the most frequently asked questions they get from Americans is why they've never been interviewed for a survey.

In an adult population of almost two hundred million, Americans express scepticism about the scientific reliability of sampling. In particular, they do not believe that a survey of 1500 - 2000 people can represent the views of all citizens.

Gallup's sampling principle is that selecting a sample of a small proportion of the whole population can represent the opinions of all the people, provided that the sample is properly selected.

  • So how do Gallup select a sample?
    Firstly, they have to locate a place where all or most Americans can be found. This isn't in the shopping mall, but at home. From the 1930s to mid 1980s, poll respondents were interviewed face-to-face in their homes. But by the 1990s, with approximately 95% of all U.S. homes having a telephone, the vast majority of surveys use this medium. Of course, this has the benefit of being a substantially less expensive method.
  • Identifying and describing the population.
    Gallup is often asked to carry out polls on behalf of an organisation with the aim of learning more about the population's attitudes and beliefs. Let's imagine that an American national newspaper wants a poll done about U.S. golf fans; the target population may be all Americans aged at least 18 who say that they're fans of golf. But if the poll was conducted on behalf of the U.S. PGA (Professional Golf Association), the target audience might be more specific; for instance, all people over the age of 16, who watch at least 5 hours of golf (during the major tournaments) each week. Two surveys about the same sport, including many of the same target respondents, but with very different sample populations.
  • Choosing a method to sample the target population randomly.
    The polling organisations have lists of all household telephone numbers in continental USA. A computerised system uses random digit dialling (RDD) to create a new list of all possible American telephone numbers, then selects a subset of numbers from that new list for the polling organisation to call. This is important because approximately 30% of American residential numbers are unlisted, according to recent estimates. The exclusion of these "hidden" numbers would introduce bias into the sample.
  • Sample Accuracy.
    With a sample size of 1000 adults, using the random selection process outlined above, Gallup can be statistically certain that 95 times out of one hundred, continued polling would produce the same result within a margin of error of +/- 3%. If the sample size was doubled to 2000 adults, Gallup would incur roughly twice the cost in conducting the survey, but the margin of error would decrease only to +/- 2%.
  • Interviewing the selected sample.
    What if the people randomly selected to survey are not in?
    What if some of the target population are busy on other phone calls when the pollsters call? In these cases the target respondent's phone number is stored and recalled later at regular times throughout the survey period.

    Excluding people who don't answer the phone the first time Gallup calls them, would introduce bias amongst the survey sample: for instance, young single adults, who are frequently out or using the phone, are less likely to be included in the sample population than more sedentary people who are less frequent phone users.

    In a household with more than one adult in residence, Gallup randomly select an adult, either by asking for the person with the latest birthday or by asking the person who answers the phone to list all the adults who live there. The pollster then selects one of these adults at random.
  • Asking the "right" questions.
    Gallup assess that the greatest source of bias or error in survey data is probably the wording of the questions themselves.

    For example, you may have thought that conducting a pre-election poll of voting intentions would be a simple process. But the question "Who will you vote for in the next election?" can be equally as open to bias as any other survey. Does the polling organisation list the vice-presidential candidates along with the names of the presidential candidates? Should the party represented by the candidate be listed or should there be no indication of party affiliation?

    In these cases, Gallup tries to mimic the format and content of the ballot paper and reads the names of the presidential and vice-presidential candidates and gives the name of the party represented by them.

    Questions to do with policy issues can also be very tricky: are things like food stamps or housing grants to be called "welfare" or "programs for the poor"? If members of the armed services are going abroad should this be termed "sending" troops or "contributing" to a UN force? These are emotive topics and the wording of the question can "slant" the answer received from poll respondents.
  • The oldest one in the book.
    One of the oldest question wordings concerns presidential job approval. Since the 1950s and Roosevelt"s presidency, Gallup has used the following question: "Do you approve or disapprove of the job .... is doing as president?"

    This means that there is a reliable trend line provided by the continuity of the question asked. If, for example, George W. Bush has a job approval rating of 48% after one year of his presidency, what can be learned from such a rating? What the trend line allows is for analysts to look into history and compare this figure with ratings recorded earlier in the presidential term. Additionally, an analysis can be made of this figure compared to ratings recorded during previous presidents' terms. In this case the question may be asked: did previous presidents with this approval rating at this stage in their term tend to get re-elected or not?
Top
 

Sampling: Further examples

  1. Surveys usually involve considerable expenditure of time, effort and cost. It is vital to clarify at the outset what you want to find out in the survey, before starting to use precious resources.

    The Trendy Tea and Coffee Company (TTCC) are set to launch a new premium brand of tea and want to get the packaging right. Four different designs are created from a traditional dark green colour, to a flashy black, silver and yellow look. TTCC employ a market research organisation who survey 1000 people to find out which design they prefer.

    On the basis of the reported survey findings, TTCC launch the new tea in the flashy design, and sales of the new product nosedive after the initial period. It becomes clear upon review that no research was carried out on the drinking habits of those people surveyed. If this work had been done, it would have shown that the regular tea drinkers in the sample population all preferred the dark green packaging.
  2. A Goods-In Inspector at a large drinks manufacturer in South-West England has to deal with a consignment of 1000 cases of grape juice. In the past, the drinks company has been affected by minor contamination in its fermenting process that has led to the loss of some batches of its best-selling line: "UK - the British Sherry for British Tastes".

    The inspector has neither the time not the staff to open all the cases to check for possible sources of the contamination, but she wants to have an idea of what the whole consignment is like. She decides to open twenty cases of the grape juice - one case in every fifty delivered. She could just open every fiftieth case in turn, but this seems to be too standard an approach. She wants to introduce a more random method.

    So instead, the inspector imagines that the cases are numbered one to one thousand and then uses her computer to generate at random, twenty 4-figure numbers, ignoring all those that exceed one thousand. This gives the inspector her sample population. As a result, there is no bias in her choice of cases to inspect.
  3. The sampling method outlined above will be very labour intensive to carry out. The inspector may have to open case 972, followed by case 23, then case 427. She realises this will be very tedious work and tries to think of a different solution - one that combines random and multi-stage sampling methods:

    She decides to split the consignment into batches of twenty-five, giving forty batches in total. From each of these she chooses one case by selecting a random number from one to twenty-five.
    This multi-stage sampling approach saves the inspector time, cost and effort.
Top
 

Correlation between variables

Let's start by looking at how a scatter diagram can illustrate these relationships:

  • Scattergrams

    The scattergram or XY chart can be a useful way of representing the relationship between two variables. The usual conventions of dependent and independent variable position on the axes are followed. Points on the diagram are not connected as they are on a line graph. The relationship between the two variables displayed on the chart may be positive, negative or non-existent.

    In Chart 1 there is a very strong negative correlation shown between disposable income levels and the number of discount stores in existence. You may feel that this makes sense as a hypothesis, in that as income levels fall, more discount retail enterprises emerge.
    Example Scattergram

    Chart 1: Scattergram (XY Chart) showing a negative association
    (Data for display purposes only)


    In Chart 2, disposable income is plotted against number of overseas holidays taken. It shows that in this case there is a strong positive correlation between income and "luxury" items, such as foreign holidays. You may not agree that a foreign holiday is a luxury, but may feel that, in general, the higher the income level the greater the number of overseas holidays taken will be.
    Example Scattergram

    Chart 2: Scattergram showing a positive association
    (Data for display purposes only)


    The charts shown here are meant to illustrate the concept of a scattergram. In practice, of course, the points on a scattergram are likely to lie around the chart, although a strong association between the two variables is likely to allow us to draw a straight line through the points shown. Such a straight line is known as the "line of best fit". This is a straight line that seems to fit the points on the diagram best.

    The line of best fit is usually drawn by eye. But there are more sophisticated ways of making the line more accurate. This is because it is known that for a set of points on a scattergram, the line of best fit will always pass through the point (x-bar, y-bar) where x-bar is the mean of the horizontal values and y-bar is the mean of the y values.

    Chart 3 illustrates a lack of a statistical relationship. There is little or a non-existent correlation between disposable income and amount of rainfall, unless of course we're looking at the long term effect on the global climate of taking all these extra foreign holidays and driving all these new cars that our higher incomes can afford!
    Example Scattergram

    Chart 3: Scattergram showing little or no association
    (Data for display purposes only)


    But we can go further than just representing the correlation between two separate variables; we can formally measure the strength of the association between them.

  • The Correlation Coefficient

    As indicated, the idea behind the correlation coefficient is that we can give a number value to the strength of relationship between one variable and another. There are two main measures commonly used: Spearman's Rank Correlation Coefficient and Pearson's Product-moment Correlation Coefficient. The former of these two is the least complicated to calculate and allows us to assess the aesthetic or qualitative characteristics of data. The latter allows us to measure the strength of the association between two variables by working out the dispersion of the scattergram points.

    There is an illustration of correlation coefficient measures in the 'Crunching' section on TimeWeb.

Top
 

Normal Distribution Curve illustration

The chart below illustrates a normally distributed population. You will notice that the curve conforms to the characteristics outlined in the explanation section: the most frequent value is at the centre; there is symmetry about the central value; there is diminishing frequency as you move away from the centre.

A line is drawn from each of the two points of inflexion (one on either side of the mean) to the X-axis. The distance from that point to the mean point on the X-axis is equal to the standard deviation.

Four separate areas are now identifiable from the chart:

Normal distribution curve

Area A shows the area between the mean and one standard deviation above the mean.

Area B shows the area between the mean and one standard deviation below the mean.

Area C indicates the area to the right of one standard deviation above the mean.

Area D indicates the area to the left of one standard deviation below the mean.

Because the normal curve is symmetrical, Area A equals Area B. Areas C and D are also equal. The total of A, B, C and D equals the total area under the curve, or the entire population.

Mathematical calculations show that in any normal distribution, approximately 68% of all observations fall within one standard deviation (SD) of the mean (Areas A plus B). So, about 34% of observations lie between the mean and one standard deviation above the mean (Area A) and 34% lie between the mean and one standard deviation below the mean (Area B). By subtraction, we can tell that in a normal distribution 32% of the observations fall outside one standard deviation, 16% on either side (16% in Area C and 16% in Area D).

Let's now put this into the language of probability: In any normal distribution, there is a .68 probability that a particular value will fall within one standard deviation of the mean; there is approximately a .34 probability that a value will lie between the mean and one SD above the mean (Area A) and a .34 probability that a value will lie between the mean and one SD below the mean (Area B).

Also, there is a .16 probability that a particular value will lie above one SD from the mean (Area C) and a .16 probability that the value will lie below one SD from the mean (Area D).

Using this knowledge, we can re-draw our normal curve chart, now putting in six separate areas:

Sections of the normal curve

The vertical lines from the curve to the X-axis represent the mean (at the centre) and distances of one and two SDs on either side of the mean.

Areas A and B have the same characteristics as in the first chart; each being equal and each containing approximately 34% of all the values in the normal distribution.

Areas C and D are also equal and are defined by the vertical lines indicating one and two SDs from the mean (on either side). Each of these areas contain approximately 13.5% of all the values in the normal distribution.

Areas E and F at the extreme ends of the curve are defined by the vertical line indicating three SDs from the mean and the tail ends of the distribution. Each of these areas contain 2.5% of all the values . In other words, in a normal distribution, 5% of a population will be beyond two SDs: 2.5% above the mean and 2.5% below.

Let's restate this information in the language of probability:

  1. In any normal distribution, there is a .34 probability that any particular value will fall between the mean and one SD above the mean (Area A) and the same probability of the value falling between the mean and one SD below the mean (Area B).

  2. There is a .135 probability of any value falling between one and two SDs above the mean (Area C) and the same probability of the value falling between one and two SDs below the mean (Area D).

  3. There is a .475 probability that any value will fall between two SDs above the mean (within Areas A to C) and the same probability of the value falling between two SDs below the mean (within Areas B to D).

  4. The mathematics of normal curves shows that the area contained by the vertical lines representing three SDs from the mean contains 99.7% of the area under the curve and 99.7% of all the values in the data set. There is, therefore, a probability of .997 that in any normal distribution any particular value will fall within three SDs from the mean.

Why not try the what samples tell us worksheet to see that you understand this?

Top
 

Random Sampling:

Random sampling is usually the preferred method of sampling, because of the lack of built-in bias that is involved.

This method requires that a list of every member of the population is available. There are times when this will be impossible, for instance when an entire national or regional population is involved, or for example if you are studying the whole population of small businesses in the UK. In these cases, the simple random sampling method outlined below will not be appropriate.

In a simple random sample, with a list of the entire population being studied, the sampler gives a number to every item on the list and selects the sample by using a random number generator or a table of random numbers.
Here's how it works.

Car storage at Avonmouth Docks

Imagine you want to study all the cars being stored in a warehousing complex, but you don't have the time or other resources to deal with them all. You might decide to work with a sample of 30 cars out of a total warehouse population of 1000.

So, you begin by assigning a number to every member of the total population. As the largest number you need (1000) has four digits, every car in the warehouse is given a four digit number, beginning with 0001, 0002, 0003 and so on, up to 1000.

You look at your list of random numbers, which looks like the following:

A TABLE OF RANDOM NUMBERS
00 10097 32533 76520 13586 34673 54876 80959 09117 39292 74945
01 37542 04805 64894 74296 24805 24037 20636 10402 00822 91665
02 08422 68953 19645 09303 23209 02560 15953 34764 35080 33606
03 99019 02529 09376 70715 38311 31165 88676 74397 04436 27659
04 12807 99970 80157 36147 64032 36653 98951 16877 12171 76833
05 66065 74717 34072 76850 36697 36170 65813 39885 11199 29170
06 31060 10805 45571 82406 35303 42614 86799 07439 23403 09732
07 85269 77602 02051 65692 68665 74818 73053 85247 18623 88579
08 63573 32135 05325 47048 90553 57548 28468 28709 83491 25624
09 73796 45753 03529 64778 35808 34282 60935 20344 35273 88435
10 98520 17767 14905 68607 22109 40558 60970 93433 50500 73998
11 11805 05431 39808 27732 50725 68248 29405 24201 52775 67851
12 83452 99634 06288 98083 13746 70078 18475 40610 68711 77817
13 88685 40200 86507 58401 36766 67951 90364 76493 29609 11062
14 99594 67348 87517 64969 91826 08928 93785 61368 23478 34113
15 65481 17674 17468 50950 58047 76974 73039 57186 40218 16544
16 80124 35635 17727 08015 45318 22374 21115 78253 14385 53763
17 74350 99817 77402 77214 43236 00210 45521 64237 96286 02655
18 69916 26803 66252 29148 36936 87203 76621 13990 94400 56418
19 09893 20505 14225 68514 46427 56788 96297 78822 54382 14598
20 91499 14523 68479 27686 46162 83554 94750 89923 37089 20048
21 80336 94598 26940 36858 70297 34135 53140 33340 42050 82341
22 44104 81949 85157 47954 32979 26575 57600 40881 22222 06413
23 12550 73742 11100 02040 12860 74697 96644 89439 28707 25815
24 63606 49329 16505 34484 40219 52563 43651 77082 07207 31790
25 61196 90446 26457 47774 51924 33729 65394 59593 42582 60527
26 15474 45266 95270 79953 59367 83848 82396 10118 33211 59466
27 94557 28573 67897 54387 54622 44431 91190 42592 92927 45973
28 42481 16213 97344 08721 16868 48767 03071 12059 25701 46670
29 23523 78317 73208 89837 68935 91416 26252 29663 05522 82562
30 04493 52494 75246 33824 45862 51025 61962 79335 65337 12472
31 00549 97654 64051 88159 96119 63896 54692 82391 23287 29529
32 35963 15307 26898 09354 33351 35462 77974 50024 90103 39333
33 59808 08391 45427 26842 83609 49700 13021 24892 78565 20106
34 46058 85236 01390 92286 77281 44077 93910 83647 70617 42941
35 32179 00597 87379 25241 05567 07007 86743 17157 85394 11838
36 69234 61406 20117 45204 15956 60000 18743 92423 97118 96338
37 19565 41430 01758 75379 40419 21585 66674 36806 84962 85207
38 45155 14938 19476 07246 43667 94543 59047 90033 20826 69541
39 94864 31994 36168 10851 34888 81553 01540 35456 05014 51176
40 98086 24826 45240 28404 44999 08896 39094 73407 35441 31880
41 33185 16232 41941 50949 89435 48581 88695 41994 37548 73043
42 80951 00406 96382 70774 20151 23387 25016 25298 94624 61171
43 79752 49140 71961 28296 69861 02591 74852 20539 00387 59579
44 18633 32537 98145 06571 31010 24674 05455 61427 77938 91936
45 74029 43902 77557 32270 97790 17119 52527 58021 80814 51748
46 54178 45611 80993 37143 05335 12969 56127 19255 36040 90324
47 11664 49883 52079 84827 59381 71539 09973 33440 88461 23356
48 48324 77928 31249 64710 02295 36870 32307 57546 15020 09994
49 69074 94138 87637 91976 35584 04401 10518 21615 01848 76938

You begin the selection by pointing (with your eyes closed) to an area in the table. Imagine you point to line 10 (the lines are numbered down the left-hand side of the table). The first possible four digit number between 0001 and 1000 is 0177. Notice that as the table contains five digit numbers, it's acceptable to start by taking the fifth digit of the first number in line 10.

The second four digit number is 0568.
The third number is 0722.
The fourth is 0940.
The fifth is 0970.
The sixth is 0500.

You would continue down the table, gathering four digit numbers until you had collected thirty numbers between 0001 and 1000. Each of these would represent one car in the warehouse, chosen at random to form a sample of thirty cars.

There is less bias in this selection method because every member of the population has an equal chance of being selected, and represented in the sample. You have made no attempt to organise the population into sections, so the selection process is free from your direction.

Top
 

Probability

Jaques Bernoulli was the first to suggest what is known as the 'central limit theorem' which is based on his work on probability. Imagine that you have a container that holds thousands of pebbles; you don't know how many there are, neither do you know that of the 5000 pebbles, 3000 of them are white and 2000 black. The ratio of white to black pebbles is therefore 3:2.

Bernoulli asked how many pebbles you would draw from the container before you could make an estimate of the actual ratio of white to black pebbles. Of course you would begin to get a fairly clear idea pretty soon, as you picked out a pebble, noted its colour and then replaced it in the container. But the key to the limit theorem is whether or not you can repeat the experiment over and over until it's ten, or one hundred times more probable that the 3:2 ratio exists.

Bernoulli states that this is the case; the more experiments are carried out, the more likely it is that the estimated ratio will get close to the true ratio.

Top
 

Time series

To identify trends in time series data, other than drawing a trend curve onto a graph freehand, there are two common measures used:

  • using moving averages.
  • using regression analysis to find the line of 'best fit'.
Top