Time Series data analysis:
Time series data consists of numerical values recorded at intervals of time. Time series data are often used in conjunction with regression techniques, which are covered here.
In time series analysis the independent variable (x) is given as a period of time. A linear regression equation is used to calculate the trend that the dependent variable (y) adheres to as time passes.
But when time is used as the independent variable there are a number of complications that are introduced to the regression method. These originate from the fact that the dependent variable will usually be subject to a number of influences that, in themselves, are affected by the units that are used to measure time.
For example, if annual data are used, it will be impossible to identify the seasonal factors that may well influence the data. So, if we are looking at data about the consumption of ice cream products, we would probably want to view quarterly figures rather than merely annual data, as we would expect there to be an increase in purchases of these products in the Summer quarter.
So, the objective of time series analysis must be to develop techniques to divide the raw time series data into its component parts. These are: a trend value (t), a seasonal element (s), and a residual element (r).
Firstly, then we want to find the trend value of the time series data that we are analysing. There are a range of techniques designed to discover the trend line. Some of these are based on trends to produce a straight line of best fit. For most purposes a line is drawn by hand or by using averages over periods of time, to smooth out fluctuations and show the general trend.
The most commonly used trend is the moving average, which is a process of repeatedly calculating a series of different average values along a time series in order to produce a trend line.
There is more detail available on moving averages in the 'Digging' section of TimeWeb.
Regression analysis is used as a more advanced method of trend identification. If we assume that the form of model used to identify the component parts of time series data is as follows:
y = t + s + r then we can discover the trend on the basis of a least squares regression equation.
Regression analysis of time series data using Excel.
You should now open a new Excel workbook and enter the following data:
In cell A1: time (x)
In cells A2:A21
In cell B1: y
In cells B2:B21
This data represents a series of quarterly observations over a five year period, where x is the number of the quarter (1 to 20).
Now compute the intercept and the gradient of the trend line and use these to calculate the estimated trend in column C, using the following formulae:
In I1: =INTERCEPT(B2:B21,A2:A21)
In I2: =SLOPE(B2:B21,A2:A21)
In C2: =I$1+I$2*A2 copied into C3:C21
The resulting sheet should look like the following:
Column C gives us the linear trend values predicted by the regression equation of y on x, as follows:
y = 14.57895 + 0.792481x
The next step is to identify the de-trended series. In the form of the equation set up earlier, this will be shown as:
De-trended series = y - t
This means that the entries in Column D should be arrived at by subtracting the values in Column C from those in Column B.
We can do this by entering the following formula into cell D2 (and copying it into D3:D21).
Now we can label the column 'De-Trended Series' in cell D1.
The sheet should now appear as below:
OK. Now we want to see if we can identify any seasonal component contained in the de-trended series. To do this we have to group all the values for each quarter of the year. This means that over the 5 year period, the grouped quarters will be as follows:
1, 5, 9, 13, 17.
2, 6, 10, 14, 18.
3, 7, 11, 15, 19.
4, 8, 12, 16, 20.
With this operation we want to find the difference between the trend value and the actual value for y. So for the first quarter of year 1 the difference between the actual value (20) and the trend value (15.37143) is 4.63.
These equivalent quarter values are placed in the worksheet and are averaged to produce an average value for each season that they represent. The worksheet should now resemble the following:
Notice that the four entries in the Row labelled 'Average' (A31 to E31) is an estimate of the seasonal variation of the series. The next step is to place these values in Column E, alonside their respective quarters. These are then subtracted from the actual time series values. This produces the seasonally adjusted series in Column F.
Do this by following these steps:
In E2, E3, E4 and E5 enter
=B$31 =C$31 =D$31 =E$31
then copy E2:E5 into E6:E21
This transfers the equivalent quarter seasonal variations into Column E and should produce the following result:
Now in F2 enter =B2-E2 and copy this into F3:F21 This will subtract the seasonal variations from the actual series values and will produce the seasonally adjusted series in Column F.
The result should resemble the following:
The final step is to identify the residual elements.
These are found by the following expression: y - s - t.
To produce these enter the following formula into Cell G2:
=B2-E2-C2 and copy it into G3:G21.
Having carried this out and added labels, the worksheet should look like the following:
Summary of Time Series analysis:
You should note that the seasonally adjusted series is one of the most important parts of this analysis. Almost all statistics that you find here in TimeWeb and generally will be seasonally adjusted. What this does is to indicate how the dependent variable would have behaved if it had not been affected by seasonal variation.
In order to see this more clearly, produce a graph from your spreadsheet of the actual series, the trend and the seasonally adjusted series, all plotted on the same axes. You should produce the following type of graph:
As you can see quite clearly, the seasonally adjusted data closely follows the trend line, rather than the actual series. This is the whole reason for carrying out seasonal adjustment.
By looking at this data we can estimate the equation of the straight line that it represents:
We can say that:
- When x = 0, y = 6. So our starting value for y is 6.
- When x = 1, y = 8. So when x increased by 1, y increased by 2.
- When x = 2, y = 10. So y has increased by a further 2 units as a result of x increasing by another 1 unit.
We can see that it appears that a unit increase in x generally causes y to increase by 2 units. Also, remembering that the starting value for y was 6, when the value for x was 0, we can see that the general linear expression for y is:
y = 6 + 2x
Now what we are going to do is to show this graphically by using Excel to produce a chart based on this data:
Open a new Excel workbook and enter in cell A1 the label 'x' and in cell B1 the label 'y'. Next, enter the values for x in A2:A8 and the values for y in B2:B8.
Now select the Chart Wizard function to draw a chart defined by data range A1:B8. Select an XY (Scatter) graph and the option that produces a scattergram with data points connected by smoothed lines. Use the first column for the category x labels and the first row for the legend text. Finally, add the titles of the chart and choose Finish.
You should have produced a straight line as shown in the following graph:
This means that y = a + bx is the general form of the equation of a straight line. Let's imagine an example that may illustrate this form:
Suppose that you want to cook a piece of beef for a special dinner you have planned. The instructions on the joint of meat may say:
Cook for 80 minutes per kilo plus an additional 20 minutes.
If we choose to represent the weight of the beef by x (kilos), then y (the cooking time) for a joint of beef weighing 2.5 kilos can be shown as:
y = 20 + 80 (2.5) = 220 minutes.
In other words, the cooking time for a joint of beef weighing x kilos is:
y = 20 + 80x
As long as we know the weight of the joint of beef (or the value of x), then the cooking time (y) is easy to calculate.
If the cooking instructions are given as follows:
b minutes per kilo plus a minutes over,
then the cooking time, or y, is expressed as below:
y = a + bx
This is the general form of the linear relation, which can be expressed in terms of the cooking instructions referred to earlier, with a = 20 and b = 80.
So, y = a + bx is the same as y = 20 + 80x.
a in the above equation is known as the intercept and b is known as the gradient. Any straight line can be defined by:
y = intercept + gradient (x) = a + bx.
When we put this equation into a graph, with y on the vertical axis and x on the horizontal axis, the following relationships are true:
- When x = 0 the value on the vertical axis (y) is given by the intercept. So:
y (when x = 0) = a
- When x is increased from 0 to 1, the value on the vertical axis (y) increases by b units. So:
y (when x =1) = a + b
- When x is increased from 1 to 2, the value on the vertical axis (y) increases by another b unit. So:
y (when x = 2) = y (when x = 1) + b
In other words, y = a + b + b
Or, y = a + 2b
- When x is increased from 2 to 3, the value on the vertical axis (y) increases by another b unit. So:
y (when x = 3) = y (when x = 2) + b
= a + 2b + b
= a + 3b
Experimenting with a linear model
Let's imagine that we have been asked to determine the extent to which varying feeding levels affects the weight of chicken.
The first step is to identify whatever relationship exists between feed levels and chicken weight. To do this we have to define our model and its variables. The model will be linear and of the form: y = a + bx, where a is the intercept term and b is the gradient term of the presumed line.
Note that it is not necessarily the case that there is a linear relationship here. We shall presume that there is in this case, bearing in mind the limitations of the approach.
What are the x and y terms in this linear model? In this case we can judge that feeding levels would probably influence chicken weight, rather than the other way around. So, chicken weight can be identified as the dependent variable (y) and the amount of feed as the independent variable (x).
This model can be expressed verbally:
'Chicken weight is a linear function of feed levels.'
The aim of linear regression is to work out a precise value to this relationship by finding the values of the intercept and gradient terms.
Now that we have a defined model, we need to collect data on the relationship between chicken weight and the amount of feed given. The standard scientific approach would be to perform a controlled experiment, in which different selected increased amounts of feed are given to a set of chickens and their weight is recorded before and after the experiment period.
There are a number of pitfalls to watch out for:
- The values of x (increased quantity of feed) chosen by the researcher must include zero. This provides the 'control' element to the study, because it tells us what would happen to chicken weight if no increase in quantity of feed is given.
- The non-zero values of x should cover a reasonable range and should avoid large changes from one value to another.
- The increased quantity of feed must be under the researcher's control.
- The conditions under which the study takes place must be fully controlled in terms of other factors, which might otherwise affect the weight of chicken. Only if these other factors can be fully controlled, can we identify the single effect of increasing the quantity of feed.
Within a laboratory environment it might be possible to conform to the above conditions, but in the real world where many variables are affected by other variables that are way beyond the control of a researcher. Our ability to produce a 'scientific' set of conditions is, as we can see, highly constrained.
And this is not all.
We may well be unaware of other factors that may be affecting our research; and if we are unsure of other independent variables that influence our dependent variable, how can we control for them?
For these reasons an investigation into a relationship between variables must reflect a logical reason for expecting there to be a relationship to be discovered.
For the sake of our illustration, though, we will assume that the four criteria above have been satisfied. We might then go on to set up our research as follows:
Percentage increase in feed (x): 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100.
Now, in an ideal world, we would select a set of identical chickens and allow each of them to be provided with the chosen increased level of feed. So the first chicken would have 0% increase in feed, the second chicken would be provided with a 10% increase, and so on.
But of course, the problem is that it is impossible to ensure that the chicken are identical in terms of their physical characteristics. We must accept that variations in these characteristics are inevitable and that this will place an effect on their body weight and potential size that is unrelated to the amount by which we increase their feed levels.
To deal with this problem we could choose to apply the chosen increases in feed levels to sets of, say, 30 relatively similar chicken and then record the mean weight of the birds at the beginning and the end of the research for each set of chicken. This means that the first set of 30 birds would each receive 0% increase in feed, the second set of 30 would each receive 10% more feed than usual, and so on. The mean weight would be calculated and recorded as the y values corresponding to a 0% increase in feed, 10% increase, and so on.
Let's suppose that after the period of the research, the following results are achieved (there is an Excel version of this available for download - reg1.xls [Excel file 17K]):
With the scatter of points seen on the chart, we must try to fit a straight line through these points.
Note that on the accompanying Excel worksheet [Excel file 17K], cells A1 and A2 contain the terms a and b; and cells B1 and B2 contain values for a (0.3) and b (0.6). These are our estimates for the values of the intercept (a) and gradient (b). We are guessing that the straight line that can be plotted onto the scattergram can be expressed as:
y = 0.3 + 0.6x
Column C of the worksheet is used to compute the values of y that are associated with the estimated equation.
To do this we need to carry out the following steps:
Enter the following formula into cell C4
This gives the value of y, predicted by the linear equation, that we are using for the first value of x.
Now copy C4 into C5:C14
This gives the rest of the predicted values for y, associated with the remaining values of x.
Finally, in C3 add the label:
Let's re-cap on what we've just done:
Column B contains the actual values of y, (the weight change in chickens), that were observed in the research experiment for each of the chosen values of x (increase in feed levels).
Column C gives the values of y that are predicted for these chosen values of x, on the basis of the straight line defined by our guesses of the values of a and b,
(y = 0.3 + 0.6x).
We can now add this predicted line to the scattergram in reg1.xls [Excel file 17K]
Do this by right-clicking on the original chart, selecting 'source data' and redefining the data range as A3:C14.
The result should look like the following worksheet.
Is this a 'good' result?
Well, you ought to be able to see that the line (y = 0.3 + 0.6) that has been fitted to the data points is not very good. What we want is an objective statement about the relationship between the two variables. To do this we have to use an algebraic method known as 'least squares'.
The Least Squares Regression Line
What were we trying to do with our estimate of a linear equation? Simply to try to find the best straight line to fit with our data set. In terms of this analysis, we take 'best' to mean 'involving the least error'. What do we mean by 'error'?
Firstly, let's look at a few definitions that will be useful here:
For each of the chosen values of x (the percentage increase in feed levels) there are two values of y:
So, for a value of x of 10%, there is the observed value of y = 53, and the predicted value of:
(y = 0.3 + 0.6x)
(y = 0.3 + 0.6(10))
(y = 6.3)
y = 6.3
Any error term can be defined as ei:
ei = (yi(observed) - yi(predicted))
In this case of x = 10, the difference between the observed and predicted value of y is 53 - 6.3 = 46.7
Because there are eleven pairs of x,y observations, there are also eleven error terms. The total error is the sum of the individual errors.
So, to return to our original question, the best straight line will be the one that involves the least total error.
The final problem to overcome is that some of the error terms will be positive and some negative. These will tend to cancel one another out, leaving us with the possibility that in certain cases the total error could be calculated as zero, even though there is not a single observation lying close to the line of 'best' fit.
To deal with this problem we do the same as was done with the variance - (see example) - we square each error term and then use the sum of the squared errors as our measure of the total error.
This statistic is known as the Sum of Squared Errors (SSE) and we want to find the values of a and b that define the line that minimises the sum of squared errors. We can do this using our current example in the worksheet reg1.xls [Excel file 17K].
Performing the error calculations in Excel
Firstly, in D3 enter the label: 'Error'
Then in D4 enter the following formula:
Now copy this formula into D5:D14
These are the error terms based on the observed data and the fitted line. As you can see some are positive and some negative and would begin to cancel each other out if we summed them. So what we have to do is compute the squared errors in column E.
Do this by entering the label 'Squared error' in E3
Then in E4 enter the formula =D4^2
This will square the first error term.
Now copy E4 into E5:E14 to produce each of the squared errors for the whole dataset.
Finally in E15 enter the formula =SUM(E4:E14)
This calculates the SSE. It is this sum that we need to make as small as possible.
When you have completed this task, your resulting worksheet should look like reg3.xls, as follows
You should be able to see that the SSE is very large (12104), so we should change the values for a and b so that the value of the SSE and the position of the fitted line will change.
Try to do this yourself by changing the value of a and b in cells B1 and B2 and notice how, as you change the values for a and b, both the fitted line and the SSE changes. Try to reduce the SSE to as small a number as possible.
Using this trial and error approach shows you what you are trying to do and how you are trying to achieve it. You could carry on with changing these a and b values for ages, until you were satisfied that the SSE could go no lower. Even so, you would probably be wrong!
There are though formulae that you can use to calculate the values of a and b that minimise the SSE. Happily, these are built into Excel functions. These are called:
=INTERCEPT(Range of y values,Range of x values)
=SLOPE(Range of y values,Range of x values)
These functions are used as follows:
Open the worksheet reg3.xls [Excel file 18K] and in cell C1 enter the label Intercept (a) and in cell C2 enter the label Gradient (b).
Now in cell D1 enter the following formula:
Notice that a value of 51.5 should be returned in this cell.
And in cell D2 enter the following formula:
Notice that a value of 0.17 should be returned in this cell.
Use the Save As option under File to name this resulting worksheet reg4.xls, which should resemble the following:
These two values for the intercept and gradient of the linear model are the least squares regression coefficients for the data in the 'chicken weight is a linear function of feed levels' experiment. The values that are produced are those that minimise the SSE and the line that is defined by these coefficients is as follows:
y = 51.5 + 0.17x
This is the best line that can be fitted to the given data.
You should confirm this by entering the values calculated in D1 and D2 as the estimated a and b values in B1 and B2.
The resulting worksheet should look like the reg5.xls screenshot that follows:
Notice that the SSE is reduced to 368.1. This is as small as it can be, but it is far from being zero. This tells you why we use the term 'best fit' for the linear equation and the least squares regression line. The values for a and b define the best straight line through the data. We are not able at this stage to analyse whether this line is good or bad. This is an area that we will investigate in the next section of Analysing.
Use Save As to rename this worksheet reg5.xls. Alternatively, there is a downloadable version [Excel file 18K] of reg5.xls.
What is also important to note here is that the equation arrived at from the least squares regression analysis, is defined for any finite value of x, not only for those values of x that we observed. This gives us the power to make predictions about the degree of weight gain in our chickens, resulting from our increasing their feed levels to an extent that was not actually done in the experiment.
If we make these predictions on the basis of the x values that we observed in the research experiment (0 to 100%) we are carrying out what is known as interpolation. If our predictions use a value of x outside the observed range (by cutting the quantity of feed, or by increasing feed by more than twice the normal level), we are engaged in what is called extrapolation.
Calculating the Goodness of Fit
OK. Let's go back to the distinction between 'good' and 'best' in terms of the link between the data we record from research or an experiment, such as in our chicken feed and weight case.
There are different measures that tell us how well a fitted line corresponds to a perfect straight line. The simplest is called the correlation coefficient.
There is an explanation of the correlation coefficient available.
The correlation coefficient (r) has a maximum value of 1 and a minimum value of -1. A value of 1 means that there is a perfect linear relationship between pairs of x,y observations.
Excel can calculate the correlation coefficient by using the =CORREL function.
Another method of assessing the goodness of fit is with the R2. This is known as the coefficient of determination and is calculated simply by squaring the correlation coefficient. Because r always lies between 1 and -1, R2 will always lie between 1 and 0 and will always be less than or equal to r.
So to finish our work with chicken feed and weight, let's complete the regression analysis by opening up the Excel worksheet reg5.xls [Excel file 18K] and calculating the r and R2.
In E1 enter the label Cor. Coeff and in E2 enter the following formula:
A value of r =0.68 should be returned.
This indicates that the linear relation in our research is not especially strong, being quite far from a perfect reading of 1.
In F1 enter the label R Squared and in F2 enter the following formula:
A value of R2 = 0.46 should be returned.
The finished Excel sheet should resemble the following:
This means that of the total variation that takes place in the y variable, 46% is explained by the regression equation that we have fitted. This confirms what we already know: that the linear equation used here is not a very good fit for the data we have.