ILLUSTRATION
- Z Charts example
- Cross-section data example
- Ratios: A Worked Example
- Exponents: A Worked Example
- Sequence of Operations: A Worked Example
- Central Tendency: A Worked Example
- Weighted Average: A Worked Example
- Indices: A Worked Example
- Measures of Dispersion: Worked Examples
- Coefficient of Variation: A Worked Example
- Coefficient of Skewness
- Moving Averages: A Worked Example
- Sampling Methods and Survey Types
- Sampling: Further Examples
- Correlation between variables
- Normal Distribution Curve illustration
- Random Sampling
- Probability
- Time series
Z Charts example
The following table contains sales data for a sole trader who runs a domestic service operation.
| Month | 1998 Takings (£) | 1999 Takings (£) | Cumulative totals 1999 (£) | M.A.T. (£) |
| Jan | 1680 | 1990 | 1990 | 18530 |
| Feb | 1410 | 2260 | 4250 | 19380 |
| Mar | 1600 | 1780 | 6030 | 19560 |
| Apr | 1540 | 1780 | 7810 | 19800 |
| May | 1610 | 1530 | 9340 | 19720 |
| Jun | 1070 | 1360 | 10700 | 20010 |
| Jul | 920 | 1170 | 11870 | 20260 |
| Aug | 730 | 1140 | 13010 | 20670 |
| Sep | 1870 | 1480 | 14490 | 20280 |
| Oct | 1880 | 1530 | 16020 | 19930 |
| Nov | 1550 | 1590 | 17610 | 19970 |
| Dec | 2360 | 2160 | 19770 | 19770 |
| 18220 |
The table is based on monthly sales figures, given in columns 2 and 3. The 1998 figures are included to help us to calculate the Moving Annual Total (M.A.T.).
The cumulative totals for 1999 (column 4) are calculated by adding the current month's sales figure to the previous month's total. So, in January there is only one figure to be entered. This is the start of the year and the total sales at the end of the first month stood at £1990. At the end of February 1999, £2260 sales were made. This figure is added to the January total to give the cumulative total of (£1990 + £2260 =) £4250. The other figures in this column are derived in the same way.
The Moving Annual Totals (MAT) are found by totalling the monthly sales figures over a twelve month period. As time proceeds and a new month's sales data are known this figure is added to the MAT and the previous year's corresponding month's figure is eliminated. So, the MAT at the end of December 1998 is the total of the monthly sales from the start of January 1998 to the end of December 1998 which come to £18 220. The MAT at the end of January 1999 is the total of the monthly sales from the start of February 1998 to the end of January 1999. You can get this figure by subtracting the January 1998 sales figure from £18 220 and adding the January 1999 figure (18 220 - 1680 + 1990 = 18 530.)
Notice that the cumulative total and the MAT for December 1999 must come to the same figure.
The data in columns 3, 4 and 5 of the table can now be built into a graph, to enable the fluctuations to be more easily noticed
Top
Cross-section data example
UK Labour Market Participation of Ethnic Groups (1996)
The UK workforce comprised the following ethnic minority proportions:
| Of Indian origin | 590 000 |
| Of Pakistani origin | 330 000 |
| Of Black Caribbean origin | 320 000 |
| Of Black African origin | 200 000 |
| Of Bangladeshi origin | 110 000 |
| Of Chinese origin | 100 000 |
Using the table giving data for Labour Market Participation rates of Ethnic Minority Groups, we can calculate that Black Caribbean workers in 1996 accounted for 19.4 % of the total ethnic minority workforce in the UK.
Extension to percentages in cross-sectional data material
If we had incomplete information, for instance if the total number of people in all categories was unknown, we could still find this out. As long as we knew the total number of Black Caribbean workers in the UK and what percentage of the total ethnic minority working population this comprised.
Make the total ethnic minority working population in UK = x
19.4
x = 320 000
100
This is the same as:
0.194x = 320 000
0.194
x = 320 000
0.194 0.194
x = 320 000 = 1 649 484
0.194
x = 1 650 000 (rounded up to the nearest ten thousand)
TopRatios: A Worked Example
Alice, Ben and Charlie enter into a partnership.
Alice provides £30 000 capital
Ben provides £18 000
Charlie provides £24 000
Profits from the partnership are to be divided amongst the partners in ratio to their capital. In a year when the profit amounts to £30 000, how much does each receive?
Ratio of Alice's, Ben's and Charlie's capital =
30 000:18 000:24 000 = 15:9:12
= 5:3:4
There are 12 parts in total
Alice has 5/12, Ben has 3/12 (1/4) and Charlie has 4/12 (1/3).
Alice will receive 5/12 . 30 000 = £12 500
Ben will receive 3/12 . 30 000 = £7 500
Charlie will receive 4/12 . 30 000 = £10 000
(Confirm answer by adding these figures: 12 500 + 7 500 + 10 000 = 30 000)
TopExponents: A Worked Example
Remember that we are illustrating exponents (see the explanation of exponents if you are not sure) by using a, m and n, where m = 4 and n = 2.
am . an = am + n
(to prove that m = 4 and n = 2)
a4 . a2 = (a.a.a.a) (a.a) = a.a.a.a.a.a = a6 = a4 + 2
TopSequence of Operations: A Worked Example
4 x 42 - (2 + 1 x 22)2 / 3 =
4 x 16 - (2 + 1 x 4)2 / 3
= 64 - 62 / 3
= 64 - 36 / 3
= 64 - 18 = 46
Central Tendency: A Worked Example
Table 1 contains information on the levels of unemployment in the UK between 1992 and 1996.
| 1992 | 1993 | 1994 | 1995 | 1996 | |
| Q1 | n.a. | 10.6 | 10.0 | 8.9 | 8.3 |
| Q2 | 9.8 | 10.4 | 9.7 | 8.7 | 8.3 |
| Q3 | 10.0 | 10.3 | 9.4 | 8.7 | n.a. |
| Q4 | 10.4 | 10.3 | 9.0 | 8.4 | n.a. |
Table 1: ILO unemployment rate 1992 (Q2) - 1996 (Q2) : UK: All: Aged 16 and over: %: SA
Source: Statbase
Taking this data as our source, the modal value or the most frequently occurring number can be observed best by ordering the data, as follows:
| 8.3 |
| 8.3 |
| 8.4 |
| 8.7 |
| 8.7 |
| 8.9 |
| 9.0 |
| 9.4 |
| 9.7 |
| 9.8 |
| 10.0 |
| 10.0 |
| 10.3 |
| 10.3 |
| 10.4 |
| 10.4 |
| 10.6 |
Table 2: 1992 - 1996 unemployment rates (%) placed in value order.
You can see that there are actually 5 modal values: 8.3, 8.7, 10.0, 10.3, and 10.4. This does not convey much useful information to us as students of labour market patterns in the UK.
The median value is also best viewed from the ordered data. You can see that the middle value of the ordered data is 9.7 as there are eight observations above and below this point.
The mean is probably the best central measure because there are no outlying observations (see digging/meaning/explanation).
X-bar (the mean unemployment rate in the UK 1992 - 1996) = 161.2/17
Mean = 9.5 (correct to 2 s.f.)
TopWeighted Average: A Worked Example
The Weighted Average was described in the 'Explanation' section (see the explanation of indices if you are not sure). It is a useful calculation of the average when you have data which is grouped, such as in the following example:
| Cars / household (Xi) | % (fi) | fi Xi |
| 0 | 19 | 0 |
| 1 | 43 | 43 |
| 2 | 28 | 56 |
| 3 | 8 | 24 |
| 4 | 2 | 8 |
| 131 |
Table 3: Cars per household, %, 1996, UK.
In Table 3, above, the percentage of households owning 0 to 4 cars is shown. To find the average number of cars per household, we multiply each possible number of cars by the percentage in that category (the frequency). Column 3 shows this calculation, where Xi is the number of cars and fi is the percentage in that category.
The average is the sum of this column, divided by the total frequency (100 as the data is in percentages).
131/100 = 1.31
The formula for the weighted average is as follows:
X-bar =
fi Xi / n
Indices: A Worked Example
An index is a means of comparing something that is numerically measurable, over time to quantify the changes that have occurred. Indices use a base year as the point of comparison.
If we consider the performance of a company's share price over a year, we could find that the price dipped at the midpoint of the year before rallying at the year end. This might be expressed as follows:
Index of share price for Company X (Jan = 100)
| Jan | 100 |
| Jun | 84 |
| Dec | 123 |
To show how easy it is to change the base with indexed data, let's re-base this company's share performance to December.
| Jan | ?? |
| Jun | ??? |
| Dec | 100 |
Now, in order to work out the value of January's and June's indexed performance, we perform the following calculation:
?? = 100/123 x 100 = 81.3
??? = 84/123 x 100 = 68.3
So, the new indexed share price table appears as follows:
| Jan | 81.3 |
| Jun | 68.3 |
| Dec | 100 |
Why not try the worksheet on index numbers to check you understand this?
TopMeasures of Dispersion: Worked Examples
To illustrate the concepts of the range, mean absolute deviation, and the standard error we will continue to use the data in Table 1 above.
The range as we saw in the explanation section is simply the difference between the lowest and highest observation. The range of the data in Table 1 is 8.3 - 10.6 and the difference between the two readings, that is the difference between the highest and lowest unemployment rates in the UK between 1992 and 1996, is 2.6.
In order to calculate the mean absolute deviation (MAD), we must construct another table from which we will be able to read off this statistic.
Table 4: Unemployment Rates in the UK
| Year/Qtr | Rate of unemp't | X - X-bar | |X - X-bar| | (X - X-bar)2
|
| 1992 Q2 | 9.8 | 0.3 | 0.3 | .09 |
| Q3 | 10.0 | 0.5 | 0.5 | .25 |
| Q4 | 10.4 | 0.9 | 0.9 | .81 |
| 1993 Q1 | 10.6 | 1.1 | 1.1 | 1.21 |
| Q2 | 10.4 | 0.9 | 0.9 | .81 |
| Q3 | 10.3 | 0.8 | 0.8 | .64 |
| Q4 | 10.3 | 0.8 | 0.8 | .64 |
| 1994 Q1 | 10.0 | 0.5 | 0.5 | .25 |
| Q2 | 9.7 | 0.2 | 0.2 | .04 |
| Q3 | 9.4 | -0.1 | 0.1 | .01 |
| Q4 | 9.0 | -0.5 | 0.5 | .25 |
| 1995 Q1 | 8.9 | -0.6 | 0.6 | .36 |
| Q2 | 8.7 | -0.8 | 0.8 | .64 |
| Q3 | 8.7 | -0.8 | 0.8 | .64 |
| Q4 | 8.4 | -1.1 | 1.1 | 1.21 |
| 1996 Q1 | 8.3 | -1.2 | 1.2 | 1.44 |
| Q2 | 8.3 | -1.2 | 1.2 | 1.44 |
| 12.3 | 11.54 |
Source: National Statistics
In Table 4 above, column 3 gives the difference between the observed value and the mean. Note that we have already calculated the mean earlier. Column 4 states the values when the sign is ignored. These "absolute" values are summed at the foot of this column. This figure is divided by the number of observations, which in this case is 17:
MAD = 12.3/17 = 0.7 (correct to 2 s.f.)
To interpret this statistic we consider its size. A larger value implies a larger dispersion.
Standard deviation
The standard deviation can be calculated from the data in Table 3 as well. Column 5 gives the squared value of the difference between the observed value and the mean. These values are summed at the foot of the column. This is divided by the number of observations (17) to give the variance, and the square root of this sum is calculated to give the standard deviation.
s2 = 11.54 / 17 = 0.68
s.d. =
0.68 = 0.82
Again, the larger the value, the larger the deviation. The standard deviation carries the same units of measurement as the original data, so here the standard deviation = 0.82%
What does this statistic tell us? This is a hard question to answer. On its own, the standard deviation is useful only in that it gives us a means of comparison with other standard deviations, in the knowledge that any larger deviations within the data are going to be represented.
Elsewhere in TimeWeb we will be using standard deviations for more advanced statistical work, but for now we must merely say that s.d. is the most frequently used measure of dispersion and is used when more formal statistical testing is required.
TopCoefficient of Variation: A Worked Example
As you can see in the 'Explanation' part of this section, the coefficient of variation is a summary measure used to give an indication of the amount of variability present in the data. It is calculated by expressing the standard deviation as a percentage of the mean.
Using the above example:
Coefficient of Variation = 0.82 / 9.5 x 100
= 0.086 x 100
= 8.6
This statistic would be used to compare against other datasets and a judgement would be made of the amount of variability in one dataset as against another.
TopCoefficient of Skewness:
As discussed in the 'Explanation' section, this summary statistic indicates the tendency for values in a dataset to bunch at one end of a distribution. Using the second formula given and applying it to the unemployment data given above, we can calculate that:
Coefficient of skewness within the data for UK unemployment 1992 - 96 = 3(9.5 - 9.7) / 0.82
= -0.6 / 0.82
= -0.75
Moving Averages: A Worked Example
As we saw in the 'Explanation' section, a moving average allows us to "smooth" out a series of data so that the underlying movement over time can be seen. Using the new car registrations data contained in Table 5 below, we can illustrate how to construct a moving average for this highly seasonal data.
Table 5: UK New Car Registrations 1994 (Q4) - 1999 (Q3)
| Year/Qtr | 000s Cars |
| 1994 Q4 | 400.6 |
| 1995 Q1 | 613.9 |
| Q2 | 511.7 |
| Q3 | 748.3 |
| Q4 | 432.7 |
| 1996 Q1 | 621.2 |
| Q2 | 563.7 |
| Q3 | 769.5 |
| Q4 | 455.6 |
| 1997 Q1 | 646.3 |
| Q2 | 606.1 |
| Q3 | 849.0 |
| Q4 | 498.1 |
| 1998 Q1 | 738.2 |
| Q2 | 638.8 |
| Q3 | 848.4 |
| Q4 | 514.9 |
| 1999 Q1 | 761.6 |
| Q2 | 692.2 |
| Q3 | 781.4 |
Source: National Statistics
The first decision is a matter of judgement: over what period to calculate the averaging process? The longer the length of the average, the smoother the series becomes, as each individual piece of information becomes less significant on its own. But the drawback with calculating the averaging over a long period is that changes in the underlying trend are not picked up quickly.
With this new car data, we want to remove the seasonality from the data whilst still seeing the trend year-on-year. An average of this data every fourth period should enable us to achieve this. Having decide this, the first step is to average the first four observations. This gives us the moving average for the mid-point of these four observations.
In this example this is:
(400.6 + 613.9 + 511.7 + 748.3) / 4 = 558.6
The next step is to move the average along the dataset, by dropping the first observation and including the next period's data, then averaging over the four observations again:
(613.9 + 511.7 + 748.3 + 432.7) / 4 = 576.6
These calculations are shown in Table 6 below:
| Year | Calculation | Moving Average |
| 1994 Q4 | - | |
| 1995 Q1 | - | |
| Q2 | (400.6+613.9+511.7+748.3) / 4 | 559 |
| Q3 | (613.9+511.7+748.3+432.7) / 4 | 577 |
| Q4 | (511.7+748.3+432.7+621.2) / 4 | 578 |
| 1996 Q1 | (748.3+432.7+621.2+563.7) / 4 | 591 |
| Q2 | (432.7+621.2+563.7+769.5) / 4 | 597 |
| Q3 | (621.2+563.7+769.5+455.6) / 4 | 603 |
| Q4 | (563.7+769.5+455.6+646.3) / 4 | 609 |
| 1997 Q1 | (769.5+455.6+646.3+606.1) / 4 | 619 |
| Q2 | (455.6+646.3+606.1+849.0) / 4 | 639 |
| Q3 | (646.3+606.1+849.0+498.1) / 4 | 650 |
| Q4 | (606.1+849.0+498.1+738.2) / 4 | 673 |
| 1998 Q1 | (849.0+498.1+738.2+638.8) / 4 | 681 |
| Q2 | (498.1+738.2+638.8+848.4) / 4 | 681 |
| Q3 | (738.2+638.8+848.4+514.9) / 4 | 685 |
| Q4 | (638.8+848.4+514.9+761.6) / 4 | 691 |
| 1999 Q1 | (848.4+514.9+761.6+692.2) / 4 | 704 |
| Q2 | (514.9+761.6+692.2+781.4) / 4 | 688 |
| Q3 | - |
Why not try the worksheet on moving averages to see how well you understand this?
TopSampling Methods and Survey Types:
One of the world's best-known polling organisations, Gallup, say that one of the most frequently asked questions they get from Americans is why they've never been interviewed for a survey.
In an adult population of almost two hundred million, Americans express scepticism about the scientific reliability of sampling. In particular, they do not believe that a survey of 1500 - 2000 people can represent the views of all citizens.
Gallup's sampling principle is that selecting a sample of a small proportion of the whole population can represent the opinions of all the people, provided that the sample is properly selected.
-
So how do Gallup select a sample?
Firstly, they have to locate a place where all or most Americans can be found. This isn't in the shopping mall, but at home. From the 1930s to mid 1980s, poll respondents were interviewed face-to-face in their homes. But by the 1990s, with approximately 95% of all U.S. homes having a telephone, the vast majority of surveys use this medium. Of course, this has the benefit of being a substantially less expensive method.
-
Identifying and describing the population.
Gallup is often asked to carry out polls on behalf of an organisation with the aim of learning more about the population's attitudes and beliefs. Let's imagine that an American national newspaper wants a poll done about U.S. golf fans; the target population may be all Americans aged at least 18 who say that they're fans of golf. But if the poll was conducted on behalf of the U.S. PGA (Professional Golf Association), the target audience might be more specific; for instance, all people over the age of 16, who watch at least 5 hours of golf (during the major tournaments) each week. Two surveys about the same sport, including many of the same target respondents, but with very different sample populations.
-
Choosing a method to sample the target population randomly.
The polling organisations have lists of all household telephone numbers in continental USA. A computerised system uses random digit dialling (RDD) to create a new list of all possible American telephone numbers, then selects a subset of numbers from that new list for the polling organisation to call. This is important because approximately 30% of American residential numbers are unlisted, according to recent estimates. The exclusion of these "hidden" numbers would introduce bias into the sample.
-
Sample Accuracy.
With a sample size of 1000 adults, using the random selection process outlined above, Gallup can be statistically certain that 95 times out of one hundred, continued polling would produce the same result within a margin of error of +/- 3%. If the sample size was doubled to 2000 adults, Gallup would incur roughly twice the cost in conducting the survey, but the margin of error would decrease only to +/- 2%.
-
Interviewing the selected sample.
What if the people randomly selected to survey are not in?
What if some of the target population are busy on other phone calls when the pollsters call? In these cases the target respondent's phone number is stored and recalled later at regular times throughout the survey period.
Excluding people who don't answer the phone the first time Gallup calls them, would introduce bias amongst the survey sample: for instance, young single adults, who are frequently out or using the phone, are less likely to be included in the sample population than more sedentary people who are less frequent phone users.
In a household with more than one adult in residence, Gallup randomly select an adult, either by asking for the person with the latest birthday or by asking the person who answers the phone to list all the adults who live there. The pollster then selects one of these adults at random.
-
Asking the "right" questions.
Gallup assess that the greatest source of bias or error in survey data is probably the wording of the questions themselves.
For example, you may have thought that conducting a pre-election poll of voting intentions would be a simple process. But the question "Who will you vote for in the next election?" can be equally as open to bias as any other survey. Does the polling organisation list the vice-presidential candidates along with the names of the presidential candidates? Should the party represented by the candidate be listed or should there be no indication of party affiliation?
In these cases, Gallup tries to mimic the format and content of the ballot paper and reads the names of the presidential and vice-presidential candidates and gives the name of the party represented by them.
Questions to do with policy issues can also be very tricky: are things like food stamps or housing grants to be called "welfare" or "programs for the poor"? If members of the armed services are going abroad should this be termed "sending" troops or "contributing" to a UN force? These are emotive topics and the wording of the question can "slant" the answer received from poll respondents.
-
The oldest one in the book.
One of the oldest question wordings concerns presidential job approval. Since the 1950s and Roosevelt"s presidency, Gallup has used the following question: "Do you approve or disapprove of the job .... is doing as president?"
This means that there is a reliable trend line provided by the continuity of the question asked. If, for example, George W. Bush has a job approval rating of 48% after one year of his presidency, what can be learned from such a rating? What the trend line allows is for analysts to look into history and compare this figure with ratings recorded earlier in the presidential term. Additionally, an analysis can be made of this figure compared to ratings recorded during previous presidents' terms. In this case the question may be asked: did previous presidents with this approval rating at this stage in their term tend to get re-elected or not?
Sampling: Further examples
- Surveys usually involve considerable expenditure of time, effort and cost. It is vital to clarify at the outset what you want to find out in the survey, before starting to use precious resources.
The Trendy Tea and Coffee Company (TTCC) are set to launch a new premium brand of tea and want to get the packaging right. Four different designs are created from a traditional dark green colour, to a flashy black, silver and yellow look. TTCC employ a market research organisation who survey 1000 people to find out which design they prefer.
On the basis of the reported survey findings, TTCC launch the new tea in the flashy design, and sales of the new product nosedive after the initial period. It becomes clear upon review that no research was carried out on the drinking habits of those people surveyed. If this work had been done, it would have shown that the regular tea drinkers in the sample population all preferred the dark green packaging.
- A Goods-In Inspector at a large drinks manufacturer in South-West England has to deal with a consignment of 1000 cases of grape juice. In the past, the drinks company has been affected by minor contamination in its fermenting process that has led to the loss of some batches of its best-selling line: "UK - the British Sherry for British Tastes".
The inspector has neither the time not the staff to open all the cases to check for possible sources of the contamination, but she wants to have an idea of what the whole consignment is like. She decides to open twenty cases of the grape juice - one case in every fifty delivered. She could just open every fiftieth case in turn, but this seems to be too standard an approach. She wants to introduce a more random method.
So instead, the inspector imagines that the cases are numbered one to one thousand and then uses her computer to generate at random, twenty 4-figure numbers, ignoring all those that exceed one thousand. This gives the inspector her sample population. As a result, there is no bias in her choice of cases to inspect.
- The sampling method outlined above will be very labour intensive to carry out. The inspector may have to open case 972, followed by case 23, then case 427. She realises this will be very tedious work and tries to think of a different solution - one that combines random and multi-stage sampling methods:
She decides to split the consignment into batches of twenty-five, giving forty batches in total. From each of these she chooses one case by selecting a random number from one to twenty-five.
This multi-stage sampling approach saves the inspector time, cost and effort.
Correlation between variables
Let's start by looking at how a scatter diagram can illustrate these relationships:
- Scattergrams
The scattergram or XY chart can be a useful way of representing the relationship between two variables. The usual conventions of dependent and independent variable position on the axes are followed. Points on the diagram are not connected as they are on a line graph. The relationship between the two variables displayed on the chart may be positive, negative or non-existent.
In Chart 1 there is a very strong negative correlation shown between disposable income levels and the number of discount stores in existence. You may feel that this makes sense as a hypothesis, in that as income levels fall, more discount retail enterprises emerge.
Chart 1: Scattergram (XY Chart) showing a negative association
(Data for display purposes only)
In Chart 2, disposable income is plotted against number of overseas holidays taken. It shows that in this case there is a strong positive correlation between income and "luxury" items, such as foreign holidays. You may not agree that a foreign holiday is a luxury, but may feel that, in general, the higher the income level the greater the number of overseas holidays taken will be.
Chart 2: Scattergram showing a positive association
(Data for display purposes only)
The charts shown here are meant to illustrate the concept of a scattergram. In practice, of course, the points on a scattergram are likely to lie around the chart, although a strong association between the two variables is likely to allow us to draw a straight line through the points shown. Such a straight line is known as the "line of best fit". This is a straight line that seems to fit the points on the diagram best.
The line of best fit is usually drawn by eye. But there are more sophisticated ways of making the line more accurate. This is because it is known that for a set of points on a scattergram, the line of best fit will always pass through the point (x-bar, y-bar) where x-bar is the mean of the horizontal values and y-bar is the mean of the y values.
Chart 3 illustrates a lack of a statistical relationship. There is little or a non-existent correlation between disposable income and amount of rainfall, unless of course we're looking at the long term effect on the global climate of taking all these extra foreign holidays and driving all these new cars that our higher incomes can afford!
Chart 3: Scattergram showing little or no association
(Data for display purposes only)
But we can go further than just representing the correlation between two separate variables; we can formally measure the strength of the association between them.
-
The Correlation Coefficient
As indicated, the idea behind the correlation coefficient is that we can give a number value to the strength of relationship between one variable and another. There are two main measures commonly used: Spearman's Rank Correlation Coefficient and Pearson's Product-moment Correlation Coefficient. The former of these two is the least complicated to calculate and allows us to assess the aesthetic or qualitative characteristics of data. The latter allows us to measure the strength of the association between two variables by working out the dispersion of the scattergram points.
There is an illustration of correlation coefficient measures in the 'Crunching' section on TimeWeb.
Normal Distribution Curve illustration
The chart below illustrates a normally distributed population. You will notice that the curve conforms to the characteristics outlined in the explanation section: the most frequent value is at the centre; there is symmetry about the central value; there is diminishing frequency as you move away from the centre.
A line is drawn from each of the two points of inflexion (one on either side of the mean) to the X-axis. The distance from that point to the mean point on the X-axis is equal to the standard deviation.
Four separate areas are now identifiable from the chart:
|
Area A shows the area between the mean and one standard deviation above the mean.
Area B shows the area between the mean and one standard deviation below the mean.
Area C indicates the area to the right of one standard deviation above the mean.
Area D indicates the area to the left of one standard deviation below the mean.
Because the normal curve is symmetrical, Area A equals Area B. Areas C and D are also equal. The total of A, B, C and D equals the total area under the curve, or the entire population.
Mathematical calculations show that in any normal distribution, approximately 68% of all observations fall within one standard deviation (SD) of the mean (Areas A plus B). So, about 34% of observations lie between the mean and one standard deviation above the mean (Area A) and 34% lie between the mean and one standard deviation below the mean (Area B). By subtraction, we can tell that in a normal distribution 32% of the observations fall outside one standard deviation, 16% on either side (16% in Area C and 16% in Area D).
Let's now put this into the language of probability: In any normal distribution, there is a .68 probability that a particular value will fall within one standard deviation of the mean; there is approximately a .34 probability that a value will lie between the mean and one SD above the mean (Area A) and a .34 probability that a value will lie between the mean and one SD below the mean (Area B).
Also, there is a .16 probability that a particular value will lie above one SD from the mean (Area C) and a .16 probability that the value will lie below one SD from the mean (Area D).
Using this knowledge, we can re-draw our normal curve chart, now putting in six separate areas:
|
The vertical lines from the curve to the X-axis represent the mean (at the centre) and distances of one and two SDs on either side of the mean.
Areas A and B have the same characteristics as in the first chart; each being equal and each containing approximately 34% of all the values in the normal distribution.
Areas C and D are also equal and are defined by the vertical lines indicating one and two SDs from the mean (on either side). Each of these areas contain approximately 13.5% of all the values in the normal distribution.
Areas E and F at the extreme ends of the curve are defined by the vertical line indicating three SDs from the mean and the tail ends of the distribution. Each of these areas contain 2.5% of all the values . In other words, in a normal distribution, 5% of a population will be beyond two SDs: 2.5% above the mean and 2.5% below.
Let's restate this information in the language of probability:
- In any normal distribution, there is a .34 probability that any particular value will fall between the mean and one SD above the mean (Area A) and the same probability of the value falling between the mean and one SD below the mean (Area B).
- There is a .135 probability of any value falling between one and two SDs above the mean (Area C) and the same probability of the value falling between one and two SDs below the mean (Area D).
- There is a .475 probability that any value will fall between two SDs above the mean (within Areas A to C) and the same probability of the value falling between two SDs below the mean (within Areas B to D).
- The mathematics of normal curves shows that the area contained by the vertical lines representing three SDs from the mean contains 99.7% of the area under the curve and 99.7% of all the values in the data set. There is, therefore, a probability of .997 that in any normal distribution any particular value will fall within three SDs from the mean.
Why not try the what samples tell us worksheet to see that you understand this?
TopRandom Sampling:
Random sampling is usually the preferred method of sampling, because of the lack of built-in bias that is involved.
This method requires that a list of every member of the population is available. There are times when this will be impossible, for instance when an entire national or regional population is involved, or for example if you are studying the whole population of small businesses in the UK. In these cases, the simple random sampling method outlined below will not be appropriate.
In a simple random sample, with a list of the entire population being studied, the sampler gives a number to every item on the list and selects the sample by using a random number generator or a table of random numbers.
Here's how it works.
Imagine you want to study all the cars being stored in a warehousing complex, but you don't have the time or other resources to deal with them all. You might decide to work with a sample of 30 cars out of a total warehouse population of 1000.
So, you begin by assigning a number to every member of the total population. As the largest number you need (1000) has four digits, every car in the warehouse is given a four digit number, beginning with 0001, 0002, 0003 and so on, up to 1000.
You look at your list of random numbers, which looks like the following:
| A TABLE OF RANDOM NUMBERS | ||||||||||
| 00 | 10097 | 32533 | 76520 | 13586 | 34673 | 54876 | 80959 | 09117 | 39292 | 74945 |
| 01 | 37542 | 04805 | 64894 | 74296 | 24805 | 24037 | 20636 | 10402 | 00822 | 91665 |
| 02 | 08422 | 68953 | 19645 | 09303 | 23209 | 02560 | 15953 | 34764 | 35080 | 33606 |
| 03 | 99019 | 02529 | 09376 | 70715 | 38311 | 31165 | 88676 | 74397 | 04436 | 27659 |
| 04 | 12807 | 99970 | 80157 | 36147 | 64032 | 36653 | 98951 | 16877 | 12171 | 76833 |
| 05 | 66065 | 74717 | 34072 | 76850 | 36697 | 36170 | 65813 | 39885 | 11199 | 29170 |
| 06 | 31060 | 10805 | 45571 | 82406 | 35303 | 42614 | 86799 | 07439 | 23403 | 09732 |
| 07 | 85269 | 77602 | 02051 | 65692 | 68665 | 74818 | 73053 | 85247 | 18623 | 88579 |
| 08 | 63573 | 32135 | 05325 | 47048 | 90553 | 57548 | 28468 | 28709 | 83491 | 25624 |
| 09 | 73796 | 45753 | 03529 | 64778 | 35808 | 34282 | 60935 | 20344 | 35273 | 88435 |
| 10 | 98520 | 17767 | 14905 | 68607 | 22109 | 40558 | 60970 | 93433 | 50500 | 73998 |
| 11 | 11805 | 05431 | 39808 | 27732 | 50725 | 68248 | 29405 | 24201 | 52775 | 67851 |
| 12 | 83452 | 99634 | 06288 | 98083 | 13746 | 70078 | 18475 | 40610 | 68711 | 77817 |
| 13 | 88685 | 40200 | 86507 | 58401 | 36766 | 67951 | 90364 | 76493 | 29609 | 11062 |
| 14 | 99594 | 67348 | 87517 | 64969 | 91826 | 08928 | 93785 | 61368 | 23478 | 34113 |
| 15 | 65481 | 17674 | 17468 | 50950 | 58047 | 76974 | 73039 | 57186 | 40218 | 16544 |
| 16 | 80124 | 35635 | 17727 | 08015 | 45318 | 22374 | 21115 | 78253 | 14385 | 53763 |
| 17 | 74350 | 99817 | 77402 | 77214 | 43236 | 00210 | 45521 | 64237 | 96286 | 02655 |
| 18 | 69916 | 26803 | 66252 | 29148 | 36936 | 87203 | 76621 | 13990 | 94400 | 56418 |
| 19 | 09893 | 20505 | 14225 | 68514 | 46427 | 56788 | 96297 | 78822 | 54382 | 14598 |
| 20 | 91499 | 14523 | 68479 | 27686 | 46162 | 83554 | 94750 | 89923 | 37089 | 20048 |
| 21 | 80336 | 94598 | 26940 | 36858 | 70297 | 34135 | 53140 | 33340 | 42050 | 82341 |
| 22 | 44104 | 81949 | 85157 | 47954 | 32979 | 26575 | 57600 | 40881 | 22222 | 06413 |
| 23 | 12550 | 73742 | 11100 | 02040 | 12860 | 74697 | 96644 | 89439 | 28707 | 25815 |
| 24 | 63606 | 49329 | 16505 | 34484 | 40219 | 52563 | 43651 | 77082 | 07207 | 31790 |
| 25 | 61196 | 90446 | 26457 | 47774 | 51924 | 33729 | 65394 | 59593 | 42582 | 60527 |
| 26 | 15474 | 45266 | 95270 | 79953 | 59367 | 83848 | 82396 | 10118 | 33211 | 59466 |
| 27 | 94557 | 28573 | 67897 | 54387 | 54622 | 44431 | 91190 | 42592 | 92927 | 45973 |
| 28 | 42481 | 16213 | 97344 | 08721 | 16868 | 48767 | 03071 | 12059 | 25701 | 46670 |
| 29 | 23523 | 78317 | 73208 | 89837 | 68935 | 91416 | 26252 | 29663 | 05522 | 82562 |
| 30 | 04493 | 52494 | 75246 | 33824 | 45862 | 51025 | 61962 | 79335 | 65337 | 12472 |
| 31 | 00549 | 97654 | 64051 | 88159 | 96119 | 63896 | 54692 | 82391 | 23287 | 29529 |
| 32 | 35963 | 15307 | 26898 | 09354 | 33351 | 35462 | 77974 | 50024 | 90103 | 39333 |
| 33 | 59808 | 08391 | 45427 | 26842 | 83609 | 49700 | 13021 | 24892 | 78565 | 20106 |
| 34 | 46058 | 85236 | 01390 | 92286 | 77281 | 44077 | 93910 | 83647 | 70617 | 42941 |
| 35 | 32179 | 00597 | 87379 | 25241 | 05567 | 07007 | 86743 | 17157 | 85394 | 11838 |
| 36 | 69234 | 61406 | 20117 | 45204 | 15956 | 60000 | 18743 | 92423 | 97118 | 96338 |
| 37 | 19565 | 41430 | 01758 | 75379 | 40419 | 21585 | 66674 | 36806 | 84962 | 85207 |
| 38 | 45155 | 14938 | 19476 | 07246 | 43667 | 94543 | 59047 | 90033 | 20826 | 69541 |
| 39 | 94864 | 31994 | 36168 | 10851 | 34888 | 81553 | 01540 | 35456 | 05014 | 51176 |
| 40 | 98086 | 24826 | 45240 | 28404 | 44999 | 08896 | 39094 | 73407 | 35441 | 31880 |
| 41 | 33185 | 16232 | 41941 | 50949 | 89435 | 48581 | 88695 | 41994 | 37548 | 73043 |
| 42 | 80951 | 00406 | 96382 | 70774 | 20151 | 23387 | 25016 | 25298 | 94624 | 61171 |
| 43 | 79752 | 49140 | 71961 | 28296 | 69861 | 02591 | 74852 | 20539 | 00387 | 59579 |
| 44 | 18633 | 32537 | 98145 | 06571 | 31010 | 24674 | 05455 | 61427 | 77938 | 91936 |
| 45 | 74029 | 43902 | 77557 | 32270 | 97790 | 17119 | 52527 | 58021 | 80814 | 51748 |
| 46 | 54178 | 45611 | 80993 | 37143 | 05335 | 12969 | 56127 | 19255 | 36040 | 90324 |
| 47 | 11664 | 49883 | 52079 | 84827 | 59381 | 71539 | 09973 | 33440 | 88461 | 23356 |
| 48 | 48324 | 77928 | 31249 | 64710 | 02295 | 36870 | 32307 | 57546 | 15020 | 09994 |
| 49 | 69074 | 94138 | 87637 | 91976 | 35584 | 04401 | 10518 | 21615 | 01848 | 76938 |
You begin the selection by pointing (with your eyes closed) to an area in the table. Imagine you point to line 10 (the lines are numbered down the left-hand side of the table). The first possible four digit number between 0001 and 1000 is 0177. Notice that as the table contains five digit numbers, it's acceptable to start by taking the fifth digit of the first number in line 10.
The second four digit number is 0568.
The third number is 0722.
The fourth is 0940.
The fifth is 0970.
The sixth is 0500.
You would continue down the table, gathering four digit numbers until you had collected thirty numbers between 0001 and 1000. Each of these would represent one car in the warehouse, chosen at random to form a sample of thirty cars.
There is less bias in this selection method because every member of the population has an equal chance of being selected, and represented in the sample. You have made no attempt to organise the population into sections, so the selection process is free from your direction.
TopProbability
Jaques Bernoulli was the first to suggest what is known as the 'central limit theorem' which is based on his work on probability. Imagine that you have a container that holds thousands of pebbles; you don't know how many there are, neither do you know that of the 5000 pebbles, 3000 of them are white and 2000 black. The ratio of white to black pebbles is therefore 3:2.
Bernoulli asked how many pebbles you would draw from the container before you could make an estimate of the actual ratio of white to black pebbles. Of course you would begin to get a fairly clear idea pretty soon, as you picked out a pebble, noted its colour and then replaced it in the container. But the key to the limit theorem is whether or not you can repeat the experiment over and over until it's ten, or one hundred times more probable that the 3:2 ratio exists.
Bernoulli states that this is the case; the more experiments are carried out, the more likely it is that the estimated ratio will get close to the true ratio.
TopTime series
To identify trends in time series data, other than drawing a trend curve onto a graph freehand, there are two common measures used:
- using moving averages.
- using regression analysis to find the line of 'best fit'.

