Some Basics About Data Analysis [TimeWeb]

Explanation EXPLANATION

Contents:


 

Statistical Analysis: the fundamentals.

Whether or not you realise it, you will almost certainly have performed some kind of statistical analysis at some time or another. Statistics is about making sense of and bringing order to collections of data observations. So, your monthly bank statement, end of semester test marks, or quarterly phone bill are all examples of data observations that can be collected together, over different time periods, and form what we call data sets. These can be looked at as the 'currency' or basic unit of statistics.

Bearing this in mind, basic statistical analysis has a number of main objectives:

  • Data collection
  • Data collation
  • Visual description of the data's properties
  • Summary measures of the data
  • Conclusions based on the above
  1. Data collection

    Often the problem that you are investigating is such that there are no original data available. In this case, you would have to use collection techniques such as interviews or questionnaires to extract the data from a group of respondents.

    At other times, though, you may be able to use data that has already been collected. It may be useful to you in its original form, or you may have to change its format to fit your needs. Access to this kind of secondary data is increasingly becoming available in electronic form. The TimeWeb data and sample data is an example of this data availability.

    With data that is available in electronic form, you have the opportunity to download the raw material directly into an Excel (or other spreadsheet or statistical package) workbook.

  2. Data collation

    Once you have collected the data from whatever source you selected, you will then have to bring the data together and present it in a manageable form.This process is called collation.

    In order to enable easy interpretation and analysis, collation will usually involve summarising and tabulating the information.

    Have a look at the Illustration section in 'Crunching' for more on data collection.

  3. Visual description

    Soon after collating the data, it is often useful to think of ways to display effectively the main characteristics of the information. It is quite possible to make use of your spreadsheet package to produce graphs to help you at this stage. The important point is to select the most appropriate chart that will present the data's properties clearly.

    Have a look at the picturing data section for more on this aspect of data description.

  4. Summary measures

    When dealing with your dataset you will normally have to produce summary statistics that provide a measure of the data. What these measures are is outlined below:

    • measurement of the data's average value, often referred to as measures of central tendency.
    • measurement of the degree to which the data is dispersed around its central value

    An explanation of Summary measures.

  5. Arriving at conclusions

    Having carried out all the above stages, you should now be ready to put the data to use. What this means is using data to make decisions. These will have to be based on the information that the data provides.

[Top]


 

Summations of Data:

This explanation of the process is based on the following information:

Suppose it's your turn to buy drinks and snacks in your student canteen for your friends. You buy 6 hot chocolates at 70p each, 4 vegetable rolls at 90p each, 8 pieces of toast at 20p each and 2 packs of strong mints at 30p each. OK, so you're popular and wealthy!

How do you show this information in a Table?

The first decision to make is to create a key for your data. In the table that follows, 'price' is represented by the letter 'p' and the quantity bought is shown by the letter 'q'. Any other letter or symbol could be chosen, it just seems to make more sense to use 'p' and 'q' here.

The number of items bought is the total or sum of the values shown in the 'q' column: 6 + 4 + 8 + 2 = 20. This total is labelled as Sigmaq.

Sigma is the Greek letter 'sigma' and is an instruction to sum all the values of the variable concerned. Sigmaq means 'sum of the q - values'.

We calculate the total amount spent on this visit to the canteen by multiplying each price or p - value, by the amount of that item bought, or q - value. These results are shown in the third column of the table. The amounts in column three are summed to show how much you have spent on being generous to your friends.

Notice that the correct sequence for this calculation is firstly, to multiply each p - value by the corresponding q - value, and secondly, to sum the results. Remember the BEDMAS rule from earlier. Mistakes are often made in trying to calculate Sigmapq by multiplying Sigmap ( equal to 210 in this example) with Sigmaq ( which equals 20 here). Note that this does not give Sigmapq, but Sigmap Sigmaq (which equals 210 x 20 = 4200).

Price (pence)(p)
Quantity (q) Price x Quantity (pq) q2 pq2
70 6 420 36 2520
90 4 360 16 1440
20 8 160 64 1280
30 2 60 4 120
Sigmaq = 20 Sigmapq = 1000 Sigmaq2 = 120 Sigmapq2 =5360

Let's now complicate things by assuming that your school/college/faculty decides that as a gesture to student poverty, all items bought at the canteen mid-morning by your year group will be free. If the total number of items bought is assumed to be calculated by squaring the q - values in the table, then the total number of items provided free is 120 as shown in column 4. Note that calculating Sigmaq2 is not the same as (Sigmaq)2 which would equal 20 x 20 = 400.

To find out how much this act of generosity would affect school/college/faculty finances, we must calculate Sigmapq2 as shown in column 5. The pq2 - values can be found either by multiplying the p - values by the q2, or by multiplying the q - values by the pq - values. Remember, again, that it would be incorrect to calculate Sigmapq2 by squaring the total in column 3. This would give (Sigmapq)2 = 100 x 100 = 10 000. You may agree that whilst the school/college/faculty could cope with spending £ 53.60 as a gesture to students, expecting them to manage £ 100 may be stretching things too far!

More on Summations.

[Top]


 

Summarising Data:
Remember that a separate TimeWeb section exists that covers the skills in Excel that are necessary to carry out many of the 'processing' tasks here. Learners who may not already be confident with using Excel are advised to visit this reference area of the site, whilst working within the current section.

Whilst we can use tables and charts to make sense of a collection of data, we often need to use mathematical summaries, if we are to carry out detailed statistical analysis. The aim of these summarising processes is to allow us to find one or two numbers that sum up the main characteristics of large collections of data. This section contains the following concepts: central tendency (averages), dispersion (spread), and skewness (bunching).

  • Central tendency: any measure of the central tendency is an average. In practice, three different types of average exist: The 'mode' is the most frequently occurring value in a set of data.

    The 'median' is the middle value in a set of data, when the data is arranged in ascending order.

      The 'mean' is the measure of central tendency that takes into account all of the values in a set of data. There are different versions of the mean, but the most commonly used is the 'arithmetic' mean which is calculated by summing the values in a dataset and dividing the result by the number of values that the dataset contains. The arithmetic mean is the most frequently used measure of central tendency. It is the most easily understood 'average' and is relatively simple to calculate. It is a very useful statistic to compare countries, time periods and so on. It is perhaps at its weakest when there are within the dataset a few outliers at one end of the range of data. The effect of this will be to 'pull' the mean towards them, thus making the mean unrepresentative of the dataset as a whole.

    The formula used to calculate the mean is:

    X-bar = SigmaXI / n
    where X-bar = the mean of the observations
    XI = the sum of the observations
    N = the number of observations
    Sigma = the sum of

    The mean average is often used to help interpret data if it is grouped. This use of the so-called 'weighted average' is illustrated elsewhere.

    There is an illustration of central tendency available in the 'Digging' section.

    The formula for calculating an index is:

    Index = value / base value x 100

    Notice that all indices are constructed using a base year. This is the starting point for any index, because it provides the foundation for comparing what is happening now, with what happened in the base year. This base will change from time to time and part of the task in the worksheet on indices will require you to carry out the re-basing of an index.

    An illustration of working with indices

  • The 'dispersion' of a set of values is also the spread of the data. One measure of dispersion is the 'range' - simply the difference between the highest and lowest values in the dataset - but this only takes account of the two extremes of the dataset. Sometimes the highest and lowest figures are stated, alternatively the difference between the two is quoted.

    Another measure of dispersion is the 'quartile range' which is half the range of the middle 50% of values. The quartile range is unaffected by extremes in values in the dataset and is useful when values are 'skewed'. However, as suggested above, the quartile range does not take account of all values in a dataset. The Mean Absolute Deviation: When looking at how items within a dataset differ from the mean of that dataset, some observations will be below and some above the arithmetic mean. These differences must sum to zero. The mean absolute deviation takes account of these difference by ignoring the sign. It measures the absolute deviation from the mean over all the observations.

    Examples of these measures of dispersion.

    Finally, the Standard Deviation. This measure of dispersion avoids the disadvantages associated with the two earlier measures in that it takes account of all the values in the dataset. The negative and positive differences from the mean are taken account of by squaring the differences. The size of the standard deviation relative to the mean tells us how dispersed the items in the population are from the average for the sample. The variance measures the average squared deviations from the mean. The standard deviation is the square root of this. Although it is complicated to work out and may seem hard to visualise what it means, the standard deviation (S.D.) is most valuable to us when we are working with sample data. See the Illustration section for examples of these difficult concepts.

    There is also an example of standard deviation and the variance.

  • The 'Coefficient of variation' gives us a measure of the amount of variability present in a dataset. It is worked out by expressing the standard deviation as a percentage of the mean.

    Coefficient of variation = S.D./mean x 100

  • The 'Coefficient of Skewness' shows the tendency of the dataset values to 'bunch' at one end of its distribution, with the values at the other end being relatively dispersed. The mode is the measure indicating the value where most bunching happens. Skewness is measured by working out the extent to which the mode departs from the mean. If the mode is towards the lower values in the dataset, then the skewness is said to be positive; if it occurs towards the higher values, the skewness is negative.

    Sk = mean - mode/ S.D.

    You will find that there are often difficulties involved in calculating the mode. Because of this, an alternative formula for the coefficient of skewness is often used. (This is based on the knowledge that the difference between the mean and the mode is generally about three times the difference between the mean and the median):

    Sk = 3 (mean - median) / S.D.

  • Moving Averages
    A moving average removes from data the short-run fluctuations that often occur in time series statistics, leaving a smooth pattern which helps analyse the general long-run trend of the time series.

    There is an illustration available of moving averages.

[Top]


 

The Normal Distribution Curve
One of the most important concepts in statistics is the normal distribution curve. Once you have gained a good understanding of the properties of the normal distribution curve you are well-equipped to carry out tests and experiments on data that you gather during your studies and, perhaps, at work.

  • Properties of the Normal Distribution Curve

    We have already looked at the concepts of the mean and the standard deviation. These are vital to the use we can make of the normal distribution. The mean and the standard deviation define the shape of the normal distribution curve.

    For different standard deviations there are different shapes, but all shapes of normal distributions have in common the fact that the curve of the distribution is symmetrical about the vertical axis. This means that there are as many items in a population that deviate from the mean to the right, as there are to the left of the axis.

    As we have already seen, the standard deviation is worked out by finding the square root of the sum of the squares of the deviations of each item in a population from the mean of the population.The size of the standard deviation relative to the mean, tells us how dispersed the items in the population are from the average for the sample. The larger the standard deviation relative to the mean, the wider the dispersion is of the values of the items in the population.

    The mean of a normal distribution is given the symbol 'mu' (called after the Greek letter 'mu') and the standard deviation is expressed as 'Sigma' (the letter 'sigma'). The points on a normal distribution curve represented by one standard deviation plus or minus the mean are known as the 'points of inflexion'.

    There is an illustration available of the normal curve

    So, if the items in a population are symmetrically distributed above and below the mean value, then 50% of the population have values greater than the mean and 50% less than the mean.

    Ask yourself the question: 'What are the chances of an item being above the mean?' The answer is clearly 50:50. This is the starting point for our examination of the power of the normal distribution curve. In any measurement of a normally distributed characteristic, 50% of the population will be below the arithmetic mean and 50% above it. The work of de Moivre was the original attempt to state the properties of the normal distribution curve. Because of this work we can state exactly what proportion of the population will have values between one standard deviation above and one standard deviation below the mean. In fact we can do this for any number of standard deviations above or below the mean.

  • 34% of a normally distributed population can be found within the range of the mean plus one standard deviation.

  • Another 34% can be found within the range of the mean minus one standard deviation.

  • It follows that 68% of a population lies within plus or minus one standard deviation of the mean.

    So what?

    Normal distributions are important for a number of reasons. One of the main ones is that many of the important characteristics that you will want to study are normally distributed. Characteristics that are 'heritable' include things such as height, intelligence, weight and many others. If we gather a large sample of a particular measurement, like height, and construct a frequency distribution, we can predict many things about that data.

    If we are asked what the chances are that a member of the population is no more or less than the mean value plus or minus the standard deviation of the population, we should be able to answer precisely. This means that when we are looking at a whole mass of data, we can organise it so that it helps us to predict the probability that particular values of the data can be found within certain ranges of the mean value of the population.

    This knowledge can be rolled out so that we can take distances, measured in standard deviations from the mean and predict the proportion of the population that would fall within the mean value and the multiple of the standard deviation. So for instance:

  • for a range of two standard deviations above and below the mean, approximately 95% of the items in the population are contained within these points.

  • a range of three standard deviations above and below the mean, covers approximately 99% of the population.

    Note that the ranges given for the 95 and 99% confidence levels are approximations. This is because the actual range for the proportion is 1.96 standard deviations at a 95% confidence level and 2.58 standard deviations at the 99% level.

    The next chart illustrates the 68/95/99% probabilities

    Normal distribution curve illustrating the 68/95/99% probabilities
    For most of your work in economics and the social sciences, you will be interested in the range of a population that lies within plus or minus two standard deviations of the mean value - in other words, the area covering 95% of the population. As you will see, we can make all kinds of judgements about the data we gather, assuming that it is distributed normally.

  • Are all characteristics normally distributed?
    No. But what makes the concept of normal distribution so useful is that so many things in the world are: - population heights, mortality rates, stock market movements, annual average temperatures, all repeated human measurements of a single natural phenomenon, heritable (see earlier) characteristics and so on.

  • Summary of normal distribution
    When we say that a particular population is normally distributed, we mean the following:

    1. The normal frequency curve shows that the highest frequency falls in the centre of the chart, at the mean of the values in the distribution, with an equal and exactly similar curve on either side of that centre. So, the most frequent value in a normal distribution is the average, with half the values falling below the average and half above it.

    2. The normal curve, which is often called a bell curve, is perfectly symmetrical. So the mean (arithmetic average), the mode (most frequent value), and the median (the middle value) all coincide at the centre of the curve - which is the high point of the curve.

    3. The further away any particular value is from the average, the less frequent that value will be.

    4. Because the two halves either side of the centre of the curve are symmetrical, the frequency of values above and below the mean will match exactly, provided that the distances between the values and the mean are identical.

    5. The total frequency of all values in the population will be contained by the area under the curve. In other words, the total area under the curve represents all the possible occurrences of that characteristic.

    6. Certain areas under the curve therefore indicate the percentage of the total frequency. For instance, 50% of the area under the curve lies to the left of the mean, and 50% lies to the right. This means that 50% of all scores lie to the left and 50% to the right. Equal areas under the curve represent equal numbers in the frequency.

    7. 68% of a population lies within plus or minus one standard deviation of the mean.

    8. Approximately 95% of the items in a population are contained within two standard deviations above and below the mean.

    9. Approximately 99% of the population are contained within three standard deviations above and below the mean.

    10. Normal curves may have different shapes. What determines the overall shape of the curve is the value of the mean and the standard deviation in the population. But whatever the shape, these general characteristics remain the same.


  • Confidence Intervals
    The three percentage levels, (68, 95 and 99), discussed above are known as 'Confidence Intervals'. So far, we have only used these three levels because they correspond to the ranges defined by 1, 2, and 3 standard deviations away from the mean of any normally distributed population. In practice, we are not limited to just these three levels. We can establish any level of confidence we want. But we will need to know the connection between the confidence level we want to aim for and the precise z-score at that level.

    There is an illustration of confidence intervals available

     
  • Z-scores and Confidence Intervals
    The Z-score is the standard normal unit of measurement. Tables have been created from which we can read off particular distances from the mean and their corresponding areas under the normal curve. So we can easily determine the level of confidence that we want and find the distance appropriate to it.

z 0.00 0.05
0.0 0.0000 0.0199
0.1 0.0398 0.0596
0.2 0.0793 0.0987
0.3 0.1179 0.1368
0.4 0.1554 0.1736
0.5 0.1915 0.2088
0.6 0.2257 0.2422
0.7 0.2580 0.2734
0.8 0.2881 0.3023
0.9 0.3159 0.3289
1.0 0.3413 0.3531
1.1 0.3643 0.3749
1.2 0.3849 0.3944
1.3 0.4032 0.4115
1.4 0.4192 0.4265
1.5 0.4332 0.4394
1.6 0.4452 0.4505
1.7 0.4554 0.4599
1.8 0.4641 0.4678
1.9 0.4713 0.4744
2.0 0.4772 0.4798
2.1 0.4821 0.4842
2.2 0.4861 0.4878
2.3 0.4893 0.4906
2.4 0.4918 0.4929
2.5 0.4938 0.4946
2.6 0.4953 0.4960
2.7 0.4965 0.4970
2.8 0.4974 0.4978
2.9 0.4981 0.4984
3.0 0.4987 0.4989

This table indicates in the left hand-side column the distance from the mean in standard deviation units up to one decimal place. This distance is the same as the z-score. The columns across the top of the table indicate the values for the third decimal place for that Z-score. The figure in each cell of the table indicates the area under the curve at that particular Z-score. Note that this is only for one half of the normal curve.

So in the first line, the area under the curve at a Z-score of 0.00 is 0.0000. This means that when we are exactly on the mean, the area under the curve is 0, because there is no distance between the mean and itself. If we move one column to the right, to a Z-score of 0.05, the corresponding area under the curve between the mean and this distance away from it is 0.0199.

Now check the area figure for a z-score of 1.00. Notice that it reads .3413. This means that of all the scores under the curve, 34.13% of them will fall between the mean and a Z-score of 1 on either side of the mean. To include all the scores within 1 standard deviation of the mean on both sides, we double that figure to 68.26%. Notice that we have been using the figure 68% as an approximation of that value.

The decimal numbers in the cells of the table also indicate the probability that any value in a normal distribution will fall between the mean and the particular Z-score that corresponds to that value. So if we wanted to know the Z-score that gives us a confidence level of .75 (or 75%), we could find this out easily.

Half of 75 percent is 37.5, which is expressed as a decimal as .375. Looking at the table, we can find the number closest to that value in the cell that corresponds to a Z-score of 1.15. The value in the table is .3749. This means that in a normal distribution, we can be 75% certain that any value will fall within the range of 1.15 standard deviations from the mean.

[Top]