# Business Statistics (10)op

INTRODUCTION TO MULTIPLE REGRESSION
............................................................. 10:l Multiple Regression Models 10.2 The First-Order Model: Estimating and Interpreting the P Parameters 10.3 Model Assumptions 10.4 Inferences About the P Parameters 10.5 Checking the Overall Utility of a Model 10.6 Using the Model for Estimation and Prediction 10.7 Residual Analysis: Checking the Regression Assumptions 10.8 Some Pitfalls: Estimability, Multicollinearity, and Extrapolation

C O N T E N T S .

S T A T I S T I C S I N A C T I O N .......................................................................................................................................... .......... "Wringing" The Bell Curve

n Chapter 9 we demonstrated how to model the Irelationship between a dependent variable y and an independent variable x using a straight line. We fit the straight line to the data points, used r and r2 to measure the strength of the relationship between y and x, and used the resulting prediction equation to estimate the an value of Y or to predict some future value of y for a given value of x.

chapter extends the basic concept of Chapter :i This 9, converting it into a powerful estimation and
f prediction device by modeling the mean value of Y as a
i funct~onof two or more independent variables. The : i techniques developed will enable you to build a model ; for a response, y, as a function of two or more van-

ahless in the case of a simple linear regression, a multiple regression analysis involves fitting the model i to a data set, testing the utility of the model, and using f it for estimation and prediction.

557

558

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

a

MULTIPLE REGRESSION MODELS

Most practical applications of regression analysis utilize models that are more complex than the simple straight-line model. For example, a realistic probabilistic modcl for reaction time stimulus would include more than just the amount of a particular drug in the bloodstream. Factors such as age, a measure of visual perception, and sex of the subject are a few of the many variables that might be related to reaction time. Thus, we would want to incorporate these and other potentially important independent variables into the model in order to make accurate predictions. Probabilistic models that include more than one independent variable are called multiple regression models. The general form of these models is

The dependent variable y is now written as a function of k independent variables, x,, x,, ... , x,. The random error term is added to make the model probabilistic rather than deterministic. The value of the coefficient P, determines the contribution of the independent variable x,, and P,, is the y-intercept.The coefficients Po,P, . . .,p, are usually unknown because they represent population parameters. At first glance it might appear that the regression model shown above would not allow for anything other than straight-line relationships between y and the independent variables, but this is not true. Actually, x,, x,, ... , x, can be functions of variables as long as the functions do not contain unknown parameters. For example, the reaction time, y, of a subject to a visual stimulus could be a function of the independent variables

xl = Age of the subject
x3 = 1if male subject, 0 if female subject
The x2 term is called a higher-order term, since it is the value of a quantitative variable (x,) squared (i.e., raised to the second power). The x, term is a dummy (coded) variable representing a qualitative variable (gender). The multiple regression model is quite versatile and can be made to model many different types of response variables.

where y is the dependent variable

of the independent variable x,

I

SECTION 10.2

The First-Order Model: Estimating and Interpreting t h e P Parameters

559

As shown in the box, the steps used to develop the multiple regression model are similar to those used for the simple linear regression model.

ng a Mul

egression Mo

ypothesize the deterministic component of the model. This compo lates the mean, E ( y ) , to the independent variables x,, . . , xk. T x,, involves the choice of the independent variables to be included in t model.
p 2 Use t h e sample d a t a to estimate t h e unknown model paramet Po,PI, P2, . . , Pk in the model. ep 3 Specify the probability distribution of the random error term, E, and estimat the standard deviation of this distribution, u. ep 4 Check that the assumptions on E are satisfied, and make model mod tions if necessary. tep 5 Statistically evaluate the usefulness of the model. tep 6 When satisfied that the model is useful, use it for prediction, estimatio other purposes.

In this introduction to multiple regression, we lay the foundation of model building (or useful model construction). We consider the most basic multiple regression model, called the first-order model.

THE FIRST-ORDER MODEL: ESTIMATING AND INTERPRETING THE P PARAMETERS
A model that includes only terms for quantitative independent variables, called a first-order model, is described in the box. Note that the first-order model does not include any higher-order terms (such as x:).

where x l , x2, . . . , xg are all quantitative variables that are not functions of other independent variables.

Note: p, represents the are held fixed.

f the line relating y to x, when all the other x's

The method of fitting first-order models-and multiple regression models in general-is identical to that of fitting the simple straight-line model: the method of least squares. That is, we choose the estimated model

1

*The terminology "first-order" is derived from the fact that each x in the model is raised to the first power.

560

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

,

that minimizes SSE = X ( y -

y)2

As in the case of the simple linear model, the sample estimates ... ,P, are obtained as a solution to a set of simultaneous linear equations." The primary difference between fitting the simple and multiple regression models is computational difficulty. The (k + I) simultaneous linear equations that must be solved lo find the (k + I ) estimated coefficients .. . , are difficult (sometimes nearly impossible) to solve with a calculator. Consequently, we resort to the use of computers. Instead of presenting the tedious hand calculations required to fit the models, we present output from a variety of statistical software packages.
A

p,, p,,

p,, p,, 3,

Suppose a property appraiser wants to model the relationship between the sale price of a residential property in a mid-size city and the following three independent variables: (1) appraised land value of the property, (2) appraised value of improvements (i.e., home value) on the property, and (3) area of living space on the property (i.e., home size). Consider the first-order model

where

FIGI

y

=

Sale price (dollars)

SPS!

of Ti

x, = Appraised land value (dollars)
x2 = Appraised improvements (dollars)

x3 = Area (square feet)
To fit the model, the appraiser selected a random sample of n = 20 properties from the thousands of properties that were sold in a particular year. The resulting data are given in Table 10.1.
a. Use scattergrams to plot the sample data. Interpret the plots.

b. Use the method of least squares to estimate the unknown parameters Po,PI. p2,and p, in the model.

c. Find the value of SSE that is minimized by the least squares method.

S oIut io n

a. SPSS scatterplots for examining the bivariate relationships between y and xi, and x,, and y and x3are shown in Figures 10.la-c. Of the three variables, y appraised improvements (x,) appears to have the strongest linear relationship with sale price (y). (See Figure 10.lb.)
*Students who are f a ~ i l i a with calc_ulusshould note th? r ... , are the solutions to the set of equations i~SSEldp,, 0, dSSE/dp, = 0, ... , iK3E/apk = 0. The solution is usually given in = matrix form, but we do not present the details here. See the references for details.

&, El,

Ek

SECTION 10.2

The First-Order M o d e l : Estimating and Interpreting the

P Parameters

561

TABLE0.1 1

Real Estate Appraisal Data for 20 Properties
Sale Price, Y Land Value,
XI

Improvements Value,
X2

Area,
X3

Property # (Obs.)

Source: Alachua County (Florida) Property Appraisers Office.

FIGURE 1 0 . l a SPSS scatterplots for the data ofTable 10.1

Plot of S a l e Price with Land V a l u e
n

0

-

n
O

n

n n

0

n n

B
0"

0

" ,B

0

0 LANDVAL

I

I

10000

20000

30000

b. The model hypothesized above is fit to the data of Table 10.1 using SAS. A portion of the SAS printout is reproduced in Figure 10.2 (page 563). The least squares estimates of the p parameters appear (highlighted) in the column labeled Parameter Estimate. You can see that Po = 1,470.275919,

I n t r o d u c t i o n t o M u l t i p l e Regression
Plot of Sale Price with Improvements Value

IMPROVAL Plot of Sale Price with Area

AREA

F, = 314490, p2 = 320445, and p3 = 13.528650. Therefore, the equation
that minimizes SSE for this data set (i.e., the least squares prediction equation) is

c The minimum value of the SSE is highlighted in Figure 10.2 in the Sum of Squares column and the Error row. This value is SSE = 1,003,491,259.4. ,\$

II
I

SECTION 10.2

The F i r s t - O r d e r M o d e l : E s t i m a t i n g a n d I n t e r p r e t i n g t h e / Parameters 3
Analysis of Variance Source Model Error C Total DF Sum of Squares Mean Square F Value
46.662

563

Prob>F
0.0001

3 8779676740.6 2926558913.5 16 1003491259.4 62718203.714 19 9783168000.0 7919.48254 56660.00000 13.97720

RootMSE Dep Mean C.V.

0.8974 0.8782

Parameter Estimates

Parameter
Varlable INTERCEP X1 X2 X3 DF
1 1 1 1

Estlmate
1470.275919 0.814490 0.820445 13.528650

Standard Error
5746.3245832 0.51221871 0.21118494 6.58568006

T for HO: Parameter=O
0.256 1.590 3.885 2.054

Prob > I ( T
0.8013 0.1314 0.0013 0.0567

FIGURE 10.2 SAS output for sale price model, Example 10.1

After obtaining the least squares prediction equation, the analyst will usually want to make meaningful interpretations of the P estimates. Recall that in the straight-line model (Chapter 9)

Po represents the y-intercept of the line and PI represents the slope of the line. From our discussion in Chapter 9, PI has a practical interpretation-it represents
the mean change in y for every 1-unit increase in x. When the independent variables are quantitative, the /3 parameters in the first-order model specified in Example 10.1 have similar interpretations. The difference is that when we interpret the p that multiplies one of the variables (e.g., x,), we must be certain to hold the values of the remaining independent variables (e.g., x2,x3)fixed. To see this, suppose that the mean E(y) of a response y is related to two quantitative independent variables, x1 and x2,by the first-order model

In other words, Po = 1,PI = 2, and P2 = 1. Now, when x2 = 0, the relationship between E(y) and x, is given by E(y) = 1 + 2x,

+ (0) = 1 + 2x,

-

A graph of this relationship (a straight line) is shown in Figure 10.3. Similar graphs of the relationship between E(y) and x, for x, = 1,

and for x2 = 2,

564

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression
FIGURE 10.3 Graphs of E ( y ) = 1 x, = 0, 1,2

Y
A

+ 2x, + x, for

x2=2

8 -

x2=l
x,=O

0

1

2

+"I

also are shown in Figure 10.3. Note that the slopes of the three lines are all equal to pl = 2, the coefficient that multiplies x,. Figure 10.3 exhibits a characteristic of all first-order models: If you graph E ( y ) versus any one variable-say, x,-for fixed values of the other variables,the result will always be a straight line with slope equal to PI. If you repeat the process for other values of the fixed independent variables, you will obtain a set of parallel straight lines. This indicates that the effect of the independent variable x, on E(y) is independent of all the other independent variables in the model, and this effect is measured by the slope P, (see the box on p. 559). A three-dimensional graph of the model E ( y ) = 1 2x1 x, is shown In Figure 10.4. Note that the model graphs as a plane. If you slice the plane at a particular value of x, (say, x, = 0 ) ,you obtain a straight line relating E ( y ) to x1(e.g.. E ( y ) = 1 + 2x,). Similarly, if you slice the plane at a particular value of x,, you obtain a straight line relating E ( y ) to x2. Since it is more difficult to visualize three-dimensional and, in general, k-dimensional surfaces, we will graph all the models presented in this chapter in two dimensions. The key to obtaining these graphs is to hold fixed all but one of the independent variables in the model.

+

+

FIGURE 1 0 . 4 The plane E ( y ) = 1

+ 2x, + x2

p _ _ . _ w m - * - - -

fer to the first-order model for sale price y considered in Example 10.1. Interpret the estimates of the P parameters in the model.

SECTION 10.3

M o d e l Assumptions

565

Solution

T h e least s q u a r e s prediction e q u a t i o n , as given in E x a m p l e 1 0 . 1 , is = 1,470.28 + .8145x1 + .8204x2 + 1 3 . 5 3 ~ We know t h a t with first-order ~. models P, represents the slope of the y-x, line for fixed x2 and x,. That is, P, measures the change in E ( y ) for every 1-unit increase in x , when all other independent variables in the model are held fixed. Similar statements can be made about p, and p,; e.g., p, measures the change in E ( y ) for every 1-unit increase in x , when all other x's in the model are held fixed. Consequently, we obtain the following interpretations:

p, = .8145:

i2
p3

We estimate the mean sale price of a property, E ( y ) , to increase 3145 dollar for every \$1 increase in appraised land value ( x , ) when both appraised improvements ( x 2 )and area (x,) are held fixed. = 3204: We estimate the mean sale price of a property, E(y), to increase 3204 dollar for every \$1 increase in appraised improvements ( x 2 ) when both appraised land value ( x , ) and area (x,) are held fixed. = 13.53: We estimate the mean sale price of a property, E(y), to increase \$13.53 for each additional square foot of living area (x,) when both are appraised land value ( x , ) and appraised improvements (x2) held fixed.

The value = 1,470.28 does not have a meaningful interpretation in this exam, . ple. To see this, note that = P,, when x l = x2 = x, = 0. Thus, Po = 1,470.28 represents the estimated mean sale price when the values of all the independent variables are set equal to 0. Since a residential property with these characteristicsappraised land value of \$0, appraised imp;fovements of \$0, and 0 square feet of living area-is not practical, the value of Po has no meaningful interpretation. In general, 3, will not have a practical interpretation unless it makes sense to set the values of the x's simultaneously equal to 0.

3,

-

tation of the f l parameters in a multiple regression model will determs specified in the model. The interpretations above are for a model only. In practice, you should be sure that a first-orde ode1 is the correct model for ECy) before making these P interpretations.

MODEL ASSUMPTIONS
We noted in Section 10.1 that the general multiple regression model is of the form

where y is the response variable that we wish to predict; Po,PI, . . .,Pk are parameters with unknown values; x l , x,, . . . , x, are information-contributing variables that are measured without error; and E is a random error component. Since Po,P,, . . . ,pk and x,, x2, ... ,xk are nonrandom, the quantity

represents the deterministic portion of the model. Therefore, y is composed of two components-ne fixed and one random-and, consequently, y is a random variable.

566

CHAPTER 10

Introduction t o Multiple Regression
Deterministic portion of model
r

.*

Random error

We will assume (as in Chapter 9) that the random error can be positive or negative and that for any setting of the x values, x , , x,, .. . ,x k , the random error E has ' . a normal probability distribution with mean equal to 0 and variance equal to a Further, we assume that the random errors associated with any (and every) pair of y values are probabilistically independent. That is, the error, E , associated with any one y value is independent of the error associated with any other y value. These assumptions are summarized in the next box.

, . .. , x k , the random error E has a normal probability distribution with mean equal to 0 and variance equal to a2.

Note that a2represents the variance of the random error E. As such, cr2 is an important measure of the usefulness of the model for the estimation of the mean and the prediction of actual values of y. If a2 = 0, all the random errors will equal 0 and the predicted values, j, will be identical to E(y); that is E(y) will be estimated without error. In contrast, a large value of a2implies large (absolute) values of E and larger deviations between the predicted values, j, and the mean the value, E(y). Consequently, the larger the value of a2, greater will be the error in estimating the model parameters Po, p l , . . ., pk and the error in predicting a value of y for a specific set of values of x l , x2, . . . , x k . Thus, a2plays a major role in making inferences about Po, P,, . . ., Pk, in estimating E(y), and in predicting for specific values of x , , x2, ... , xk.Since the variance, a2, the random error, E , will rarely be known, we must of use the results of the regression analysis to estimate its value. Recall that cr2is the variance of the probability distribution of the random error, E , for a given set o f values for x , , x,, . . . , x k ; hence it is the mean value of the squares of the deviations of the y values (for given values of x , , x2, . . . ,x k ) about the mean value E(y).:Tince the predicted value, j estimates E(y) for each of the data points, it seems natural to use

SSE =
to construct an estimator of a 2 .

(yi -

*Since y = E ( ~+ E , E is equal to the deviation y - E(y).Also, by definition, the variance of a ) random variable is the expected value of the square of the deviation of the random variable from its mean. According to our model, E(E) = 0. Therefore, u2= E(e2).

SECTION 10.3

Model Assumptions

567

I
I

!

For example, in the first-order model of Example 10.2, we found that SSE = 1,003,491,259.4. naw want to use this quantity to estimate the variance We of E. Recall that the estimator for the straight-line model is s2 = SSE/(n - 2) and note that the denominator is (n - Number of estimated /3 parameters), which is (n - 2) in the straight-line model. Since we must estimate four parameters, P,, PI, P2, and p, for the first-order model, the estimator of a' is
s2 =
-

SSE n-4

The numerical estimate for this example is
s2

=-

SSE 20 - 4

1,003,491,259.4 = 62,718,203.7 16

In many computer printouts and textbooks, s2is called the mean square for error (MSE). This estimate of a' is shown in the column titled Mean Square in the SAS printout in Figure 10.2. The units of the estimated variance are squared units of the dependent variable y. Since the dependent variable y in this example is sale price in dollars,the units of s2are (dollars)'. This makes meaningful interpretation of s2 difficult, so we use the standard deviations to provide a more meaningful measure of variability. In this example,

which is given on the SAS printout in Figure 10.2 next to Root MSE. One useful interpretation of the estimated standard deviation s is that the interval f2s will provide a rough approximation to the accuracy with which the model will predict future values of y for given values of x.Thus, in Example 10.2, we expect the model to provide predictions of sale price to within about f 2s = f2(7,919.5) = f 15,839 dollars.* For the general multiple regression model

we must estimate the (k + 1)parameters Po, PI, P2, . . .,Pk Thus, the estimator of a2is SSE divided by the quantity (n - Number of estimated P parameters). We will use the estimator of a both to check the utility of the model (Sections ' 10.4 and 10.5) and to provide a measure of reliability of predictions and estimates when the model is used for those purposes (Section 10.6).Thus,you can see that the estimation of c2 plays an important part in the development of a regression model.

Independent Variables
' = s

SSE n - Number of estimated P parameters n

SSE - (k + 1 )

*The & 2s approximation will improve as the sample slze is increased. We will provide more precise methodology for the construction of prediction intervals In Section 10.6.

t t

I

CHAPTER 10

I n t r o d u c t i o n t o Multiple Regression

INFERENCES A B O U T T H E

P

PARAMETERS

Inferences about the individual P parameters in a model are obtained using either a confidence interval or a test of hypothesis, as outlined in the following two boxes.*

Test of an Individual Par in the Multiple Regression Mod
One-Tailed Test Two-Tailed Test

"...................,., . "

Rejection region: t < -t, [or t > t, when Ha:p, > 01 where t, and td2are based on n - (k

+

k
//

+ 1 = Number of P parameters in the model

Ass~mptions:See Section 10.3 for assu bution for the random error componen

here t,,, is based on n

-

(k

+1

k

+ 1 = Number of P parameters in the model

We illustrate these methods with another example.
,s.a"-.s,am-*-m-,-

FIGI
*

-

MIN

A collector of antique grandfather clocks knows that the price received for the clocks increases linearly with the age of the clocks. Moreover, the collector hypothesizes that the auction price of the clocks will increase linearly as the number of bidders increases. Thus, the following first-order model is hypothesized:

Exan

*The formulas for computing ji its standard error are so complex, the only reasonable way to and present them is by using matrix algebra. We do not assume a prerequisite of matrix algebra for this text and, in any case, we think the formulas can be omitted in an introductory course without serious loss.They are programmed into almost all statistical software packages with multiple regression routines and are presented in some of the texts listed in the references.

SECTION 10.4

P Parameters

569
--

TABLE 10.2
Age, x,

Auction Price Data
Number of Bidders, x, Auction Price, y

...........................................................................
170 182 162 184 143 159 108 175 108 179 111 187 111 115 194 168 14 8 11 10 6 9 14 8 6 9 15 8 7 7 5 7 \$2,131 1,550 1,884 2,041 845 1,483 1,055 1,545 729 1,792 1,175 1,593 785 744 1,356 1,262

Age, x,

Number of Bidders, x,

Auction Price, y

where

y x,

= =

Auction price Age of clock (years)

x2 = Number of bidders
A sample of 32 auction prices of grandfather clocks, along with their age and the number of bidders, is given in Table 10.2.The model y = Po + P l x l+ P2x2+ E is fit to the data, and a portion of the MINTTAB printout is shown in Figure 10.5.
a. Test the hypothesis that the mean auction price of a clock increases as the number of bidders increases when age is held constant, that is, test P, > 0. Use a = .05. b. Form a 90% confidence interval for PI and interpret the result.

FIGURE 10.5 MINITAB printout for Example 10.3

The regression equation is Y = -1339 + 12.7 X1 + 86.0 X2 Predictor Constant X1 X2 Coef -1339.0 12.7406 85.953 StDev 173.8 0.9047 8.729 t-ratio -7.70 14.08 9.85
P 0.000 0.000 0.000

Analysis of Variance SOURCE Regression Error Total

DF 2 29 31

SS 4283063 516727 4799789

MS 2141532 17818

F 120.19

P 0.000

570

CHAPTER 10
, **,'

Introduction t o Multiple Regression a. The hypotheses of interest concern the parameter P,. Specifically,

Solution

H,: P2 = 0 Ha:p, > 0
f The test statistic is a t statistic formed by dividingJhe sample estimate o These es. the parameter p, by estimated standard error of p, (denoted sP2). timates as well as the calculated t value are shown on the MINITAB printout in the Coef, Stdev, and t-ratio columns, respectively. 85.953 Test statistic: t = - = - 9.85 = sg, 8.729 The rejection region for the test is found in exactly the same way as the rejection regions for the t-tests in previous chapters. That is, we consult Table VI in Appendix B to obtain an upper-tail value of t. This is a value t, such that P(t > t,) = a. We can then use this value to construct rejection regions for either one-tailed or two-tailed tests. For a = .05 and n - (k + 1) = 32 - (2 + 1) = 29 df, the critical t value obtained from Table VI is t,,,, = 1.699.Therefore, Rejection region: t

p,

pz

A

> 1.699 (see Figure 10.6)

FIGURE 10.6

Ha:p,

Rejection region for H,: p,

>0

=

0 vs.

0

Rejection

regron
1.699

Since the test statistic value, t = 9.85, falls in the rejection region, we have sufficient evidence to reject H,. Thus, the collector can conclude that the mean auction price of a clock increases as the number of bidders increases, when age is held constant. Note that the observed significance level of the test is also given on the printout. Since p-value = 0, any nonzero a will lead us to reject H. , A 90% confidence interval for PI is (from the box):

Substituting /?, = 12.74, sp, = .905 (both obtained from the MINITAB printout, Figure 10.5) and t,o, = 1.699 (from part a) into the equation, we obtain

or (11.21,14.27). Thus, we are 90% confident that PI falls between 11.21 and 14.27. Since PI is the slope of the line relating auction price (y) to age of the clock (xl), we conclude that price increases between \$11.21 and \$14.27 for every 1-year increase in age, holding number of bidders (x,) constant.

*

SECTION 10.4

Inferences About the P Parameters

571

: p, = 0,several conclusions are possible:

There is no relationship between y and x,. A straight-line relationship between y and x exists (holding the other in the model fixed), but a Type I1 error occurred. A relationship between y and x, (holding the other x's in the mod fixed) exists, but is more complex than a straight-line relationship (e.g. curvilinear relationship may be appropriate). The most you can s about a 0 parameter test is that there is either sufficient (if you reje Ifo: p, = 0) or insufficient (if you do not reject No:p, = 0)evidence of linear (straight-line) relationship b and x,. The models presented so far utilized quantitative independent variables (e.g., home size, age of a clock, and number of bidders). Multiple regression models can include qualitative independent variables also. For example, suppose we want to develop a model for the mean operating cost per mile, E(y),of cars as a function of the car manufacturer's country of origin. Further suppose that we are interested only in classifymg the manufacturer's origin as "domestic" or "foreign."Then the manufacturer's origin is a single qualitative independent variable with two levels: domestic and foreign. Recall that with a qualitative variable, we cannot attach a meaningful quantitative measure to a given level. Consequently, we utilize a system of coding described below. To simplify our notation, let p, be the mean cost per mile for cars manufactured domestically, and let p, be the corresponding mean cost per mile for those foreign-manufactured cars. Our objective is to write a single equation that will give the mean value of y (cost per mile) for both domestic and foreign-made cars. This can be done as follow:

where x =

1 if the car is manufactured domestically 0 if the car is not manufactured domestically

The variable x is not a meaningful independent variable as in the case of models with quantitative independent variables. Instead, it is a dummy (or indicator) variable that makes the model work. To see how, let x = 0. This condition will apply when we are seeking the mean cost of foreign-made cars. (If the car is not domestically produced, it must be foreign-made.) Then the mean cost per mile, E b ) ,is

This tells us that the mean cost per mile for foreign cars is Po. Or, using our notation, it means that p~ = Po. Now suppose we want to represent the mean cost per mile, E(y), for cars manufactured domestically. Checking the dummy variable definition, we see that we should let x = 1:

or, since Po = p ~ ,

572

CHAPTER 10

Introduction t o Multiple Regression
PD = PF +

P1

Then it follows that the interpretation of P, is

P1 = PD- PF
which is the difference between the mean costs per mile for domestic and foreign a cars. Consequently, a t-test of the null hypothesis H,j: PI = 0 is equiv, l ent to testing H,: po - /LF = 0. Rejecting H,, then, implies that the mean costs per mile for domestic and foreign cars are different. It is important to note that p,, and PI in the dummy variable model above do not represent the y-intercept and slope, respectively, as in the simple linear regression model of Chapter 9. In general, when using the 1-0 system of coding*for a dummy variable, Powill represent the mean value of y for the level of the qualitative variable assigned a value of O (called the base level) and PI will represent a difference between the mean values of y for the two levels (with the mean of the base level always subtracted.)?

Learning the Mechanics
10.1 Write a first-order model relating E(y) to: a. two quantitative independent variables b. four quantitative independent variables c. /fi-ve quantitative independent variables 10.2 SAS was used to fit the model E ( y ) = P, P,x, P,x, to n = 20 data points and the printout shown on page 573 was obtained. a. What are the sample estimates of P,,, P,, and P,? b. What is the least squares prediction equation? c. Find SSE, MSE, and s. Interpret the standard deviation in the context of the problem. d. Test H,,: PI = 0 against Ha: p, # 0. Use a = .05. e. Use a 95% confidence interval to estimate P,. 10.3 Suppose you fit the multiple regression model

c. The null hypothesis H,: P2 = 0 is not rejected. In
contrast, the null hypothesis H,,: P, = 0 is rejected. Explain how this can happen even though > 10.4 Suppose you fit the first-order multiple regression model

pz p3

+

+

to n

=

25 data points and obtain the prediction equamil
=

6.4

+ 3 . 1 + .92x, ~ ~

Y
to n
=

=

Po + Ptx1 + P 2 ~ + P3x3 + E 2
3.4 - 4 . 6 ~ ~ . 7 + ~ + 2 ~ .93x3

30 data points and obtain the following result:
=

jj

The estimated standard errors of and are 1.86 and .29, respectively. a. Test the null hypothesis H,: p, = 0 against the alternative hypothesis Hi,: P, # 0. Use a = .05. b. Test the null hypothesis H,,: P, = 0 against the alternative hypothesis Ha:/3, # 0. Use a = .05.

p,

3,

The estimated stand_ard deviations of the sampling distributions of Dl and p, are 2.3 and .27, respectively. a. Test H,,: p, = 0 against Ha:p, > 0. Use a = .05. b. Test H,,: p, = 0 against Ha:p, # 0. Use a = .05. c. Find a 90% confidence interval for PI. Interpret the interval. d. Find a 99% confidence interval for P,. Interpret the interval. 10.5 How is the number of degrees of freedom available for estimating u2(the variance of E) related to the number of independent variables in a regression model? 10.6 Consider the first-order model equation in three quantitative independent variables

a. Graph the relationship between y and x, for x2 = 1 and x, = 3.

*You do not have to use a 1-0 system of codmg for the dummy variables.Any two-value system wdl work, but the mterpretat~on glven to the model parameters w~ll depend on the code U m g the 1-0 system makes the model parameters easy to mterpret ?The system of codmg for a qualitative var~able more than two, say k , levels requlres that you at create k-1 dummy var~ables, one for each level except the base level. The mterpretation of P, is

PL = &vrl

r -

level

SECTION 10.4
SAS Output for Excercise 10.2
Dep Variable: Y

Inferences About the p Parameters

573

Analysis of Variance Source Model Error C Total DF

Sum of Squares

Mean Square 64164.63812 8883.27787 R-Square Adj R-Sq

F Value 7.223

Prob>F 0.0054

2 128329.27624 17 151015.72376 19 279345.00000 94.25114 360.50000 26.14456

Root MSE Dep Mean C.V.

0.4594 0.3958

Parameter Estimates farlable INTERCEP X1 X2

DF

Parameter Estimate 506.346067 -941.900226 -429.060418

Standard Error 45.16942487 275.08555975 379.82566485

T for HO: T Parameter=O Prob > I 1

1 1 1

11.210 -3.424 -1.130

0.0001 0.0032 0.2743

b. Repeat part a for x, = -1 and x, = 1. c. How do thc graphed lines in parts a and b relate to

each other? What is the slope of each line? d. If a linear model is first-order in three independent variables, what type of geometric relationship will you obtain when E ( y ) is graphed as a function of one of the independent variables for various combinations of values of the other independent variables?

Applying the Concepts
10.7 Detailed interviews were conducted with over 1,000street vendors in the city of Puebla, Mexico, in order to study the factors influencing vendors' incomes (World Development, Feb. 1998). Vendors were defined as indiSTATISTIX Output for Exercise 10.7

viduals working in the street, and included vendors with carts and stands on wheels and excluded beggars, drug dealers, and prostitutes.The researchers collected data on gender, age, hours worked per day, annual earnings, and education level. A subset of these data appear in the table on page 574. a. Write a first-order model for mean annual earnings, E b ) , as a function of age (x,)and hours worked (,. x) b. The model was fit to the data using STATISTIX. Find the least squares prediction equation on the printout shown below. c. Interpret the estimated / coefficients in your model. 3 d Is age x,a statistically useful predictor of annual earn. ings? Test using a = .01.

UNWEIGHTED LEAST SQUARES LINEAR REGRESSION OF EARNINGS PREDICTOR VARIABLES COEFFICIENT STD ERROR STUDENT'S T P

R- SQUARED ADJUSTED R-SQUARED

0.5823 0.5126 SS ---------5018232 3600196 8618428

RESID. MEAN SQUARE (MSE) STANDARD DEVIATION
----------

300016 547.737

SOURCE
----------

---

DF

MS

-----

F

------

P

REGRESSION RESIDUAL TOTAL

2 12 14

2509116 300016

8.36

0.0053

CASES INCLUDED 15

MISSING CASES 0

574

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

e. Construct a 99% confidence interval for P,. Interpret the interval in the words of the problem.

Vendor Number

Annual Earnings, y

Age, x,

Hours Worked per Day, x2

Source:Adapted from Smith, Paula A., and Metzger, Michael R., "The Return to Education: Street Vendors in Mexico." World Development, Vol. 26, No. 2, Feb. 1998, pp. 289-296.

10.8 Refer to the Chief Executive (Sept. 1999) study of chief e ecutive officers (CEOs) from a variety of industries, , xercise 9.35 (p. 500). Recall that a CEO's pay (y) was modeled as a function of company performance (x,), where performance was measured as a three-year annualized total return to shareholders assuming dividends are reinvested. For this exercise, consider a second independent variable, company sales (x,). The data for all three variables are listed in the table below.

4

a. Construct a scattergram of total pay versus company sales. Does your scattergram suggest that company sales will help explain the variation in CEO pay? Explain. b. The first-order model, E(y) = Po + P,x, t & . was lit to the data using EXCEL. Locate the leastsquares estimates of the /3 coefficients on the printout on p. 575 and interpret their values. C. Test H,,: ? = 0 versus Ha:p2 < 0 using a = .05. Re/ , port your findings in the words of the problem. d. L o c a t e a 95% confidence interval for fl, on the printout and interpret it in the words of the problem. 10.9 Many variables influence the price of a company's common stock, including company-specific internal variables such as product quality and financial performance, and external market variables such as interest rates and stock market performance.The table on page 576 contains quarterly data on three such external variables (x,, x2, x 3 ) a n d t h e price y of Ford Motor Company's Common stock (adjusted for a stock split). The Japanese Yen Exchange Rate (the value of a U.S. dollar expressed in yen), x,, measures the strength o f the yen versus the U.S. dollar. The higher the rate, the cheaper are Japanese imports-such as the automobiles of Toyota, Nissan, Honda, and Subaru-to US. consumers. Similarly, the higher the deutsche mark exchange rate, x,, the less expensive are BMW's and Mercedes Benz's to U.S. consumers. The S&P 500 Index, x,, is a general measure of the performance of the market for stocks in U.S. firms. a. Fit t h e first-order model y = Po + P,x, + P2x2 + P3x3 + E to the data. Report the least squares prediction equation.

CE017.DAT ....................................................................................................................................................................................................................,.,....

................................................... ........................................................................................................................................................................
Cummins Engine Bank of New York SunTrust Banks Bear Stearns Charles Schwab Coca-Cola Time Warner Humana Engelhard Chubb American Home Products Merck Schering-Plough Home Depot Dell Computer BellSouth Delta Air Lines James A. Henderson Thomas A. Renyi L. Phillip Humann James E. Cayne Charles R. Schwab M. Douglas Ivester Gerald M. Levin Gregory H. Wolf Orin R. Smith Dean R. O'Hare John R. Stafford Raymond V. Gilmartin Richard J. Kogan Arthur M. Blank Michael S. Dell F. Duane Ackerman Leo F. Mullin 4,338 7,121 3,882 25,002 16,506 12,712 25,136 4,516 6,189 4,052 8,046 7.178 6,818 2,900 115,797 18,134 17,085 .8 52.5 33.0 46.1 85.5 22.9 49.6 -13.3 -1.8 12.3 35.5 33.4 61.8 58.6 222.4 35.8 20.8 6,266 63,579 93,170 7,980 3,388 18,813 14,582 9,781 4,172 6,337 13,463 26.898 8,077 30,219 13,663 23.123 14,138

Company

CEO

Total Pay, y (\$ thousands)

Company Performance, x,

Company Sales, (S millions), x2

Source: Chief Executive, Sept. 1999, pp. 45-59.

EXCEL Output for Exercise 10.8

I SUMMARY

OUTPUT

I
0.9037971 0.8168492 1 0 -7906848 12136.8165 17

1
I I

Regression Statistics 1 Multiple R R Sauare 1 Adjusted R Square Standard Error Observations ANOVA

1
I I

I

df
Regression Residual Total

SS

MS

F
31.21987138

2 14 16

9197518606 2062232399 11259751005

4598759303 147302314.2

Significance F 6.912923-06

Intercept X Variable 1 X Variable 2

1

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% -9807.527182 397.819904 4758.206523 0.083607112 0.934552536 10603.167 327.3476448 451.741037 57.9979736 7.788910701 1.866933-06 576.13443 -0.175652631 0.1291065761-1.3605242941 0.195167091 -0.4525589441 0.10125368

576

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

F0RDSTOCK.DAT (Data for Exercise 10.9) .......................................................................................................................................................................

............................................................... ....................................................................................................... .

Date

Ford Motor Co. Common Stock, y

Yen Exchange Rate, x ,

Deutsche Mark Exchange Rate, x2

S h P 500, x,

Sources: 1. International Financial Statistics, International Monetary Fund, Washington, D. C., 1998;2. YahooFinance (www.yahoo.com).

b. Find the standard deviation of the regression model and interpret its value in the context of this problem. c. Do the data provide sufficient evidence to conclude that the price of Ford stock decreases as the yen rate increases? Report the observed significance level and reach a con~lusion~using = .05. a d. Interpret the value of p, in terms of these data. Remember that your interpretation must recognize the presence of the other variables in the model. 10.10 A disabled person's acceptance of a disability is critical to the rehabilitation process. Thc Journal of Rehabilitation (Sept. 1989) published a study that investigated the relationship between assertive behavior level and acceptance of disability in 160 disabled adults. The dependent variable, assertiveness (y), was measured using the Adult Self Expression Scale (ASES). Scores on the ASES range from 0 (no assertiveness) to 192 (extreme assertiveness). The model analyzed was E ( Y ) = P,, + PIX,+ P 2 ~ 2 P3~3, + where

xl

=

Acceptance of disability (AD) score

x2 = Age (years) x3 = Length of disability (years) The regression results are shown in the table.
Independent Variable
t

Two-Tailed p-Value

A D score (x,) Age (x,) Length (x3)

5.96 0.01 1.91

.0001 .9620 ,0576

a. Is there sufficient evidence to indicate that AD
score is positively linearly related to assertiveness level, once age and length of disability are accounted for? Test using a = .05. b. Test the hypothesis H,: P, = 0 against Ha: fi, f 0 Use a = .05. Give the conclusion in the words of the problem.

SECTION 10.4

P

Parameters

577

.................................................................................................................................................................
Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Plants Duck Chow Duck Chow Duck Chow Duck Chow Duck Chow Duck Chow Duck Chow Duck Chow Duck Chow
Source: Gadallah, F. L. and Jefferies.

Feeding Trial

Diet

Weight Change ( O h )

Digestion Efficiency (%)

Acid-Detergent Fiber (96)

R. L. "Forage quality in brood rearing areas of the lesser snow goose and the growth of captlve goslings." Journal ofApplied Biology,Vol. 32, No. 2,1995, pp. 281-282 (adapted from Figures 2 and 3).

c. Test the hypothesis H,: /3, = 0 against Ha: P, > 0. Use a = .05. Give the conclusion in the words of the problem. 10.11 Refer to the Journal ofApplied Ecology (Vol. 32,1995) study of the feeding habits of baby snow geese, Exercise 9.56 (p. 514). The data on gosling weight change, digestion efficiency, acid-detergent fiber (all measured as percentages) and diet (plants or duck chow) for 42 feeding trials are reproduced in the table

'

above.The botanists were interested in predicting weight change (y) as a function of the other variables. The first-order model E(y) = p,, + P,x, + P2x2, where x , is digestion efficiency and x, is acid-detergent fiber, was fit to the data. The MINITAB printout is provided o n page 578. a. Find t h e least squares prediction equation for weight change, y. b. Interprct the p-estimates in the equation, part a.

578

CHAPTER 10

Introduction t o Multiple Regression

MINITAB Output for Exercise 10.11

.....

C

The regression equation is wtchnge = 12.2 - 0.0265 digest Predictor Constant digest acid Coef 12.180 -0.02654 -0.4578

-

0.458 acid

StDev 4.402 0.05349 0.1283

T 2.77 -0.50 -3.57

P 0.009 0.623 0.001

4

......

N

Analysis of Variance Source Regression Error Total

DF 2 39 41

SS 542.03 483.08 1025.12

MS 271.02 12.39

F 21.88

P 0.000

c. Conduct a test to determine if digestion efficiency, x,, is a useful linear predictor of weight change. U&
a = .01. d Form a 99% confidence interval for P2. Interpret .

the result. e. Explain how to include the qualitative variable diet into the model.

10.12 Empirical research was conducted to investigate the variables that impact the size distribution of manufacturing firms in international markets (World Development, Vol. 20,1992). Data collected on n = 54 countries were used to model the country's size distribution y, measured as the share of manufacturing firms in the country with 100 or more workers. The model studied was E ( y ) = P, + P , x , + P2x2 + P3x3 + P4x4 + PSx5,where
x1 = natural logarithm of Gross National Product

10.13 Location is one of the most important decisions for hotel chains and lodging firms. A hotel chain that can select good sites more accurately and quickly than its competition has a distinct competitive advantage. Researchers S. E. Kimes (Cornell University) and J. A. Fitzsimmons (University of Texas) studied the site selection process of La Quinta Motor Inns, a moderately priced hotel chain (Interfaces, Mar.-Apr. 1990). Using data collected on 57 mature inns owned by L a Quinta, the researchers built a regression model designed to predict the profitability for sites under construction.The least squares model is given below:

Sou, spec

where

y

(LGNP)
x2 = geographic area per capita (in thousands of xg =

operating margin (measured as a percentage) - (profit + interest expenses + depreciation total revenue
=

x1 = state population (in thousands) divided by the

square meters) (AREAC) share of heavy industry in manufacturing value added (SVA)

total number of inns in the state
x2 = room rate (\$) for the inn x3 = square root of the median income of the area (in

x4 = ratio of credit claims on the private sector to

\$ thousands)
x4 = number of college students within four miles of

Gross Domestic Product (CREDIT) x5 = ratio of stock equity shares to Gross Domestic Product (STOCK)

the inn All variables were "standardized" to have a mean of 0 and a standard deviation of 1.Interpret the j estimates 3 of the model. Commcnt on the effect of each independent variable on operating margin, y. [Note: A profitable inn is defined as one with an operating margin of over 50% .]

a. The researchers hypothesized that the higher the
credit ratio of a country, the smaller the size distribution of manufacturing firms. Explain how to test this hypothesis. b. The researchers hypothesized that the higher the stock ratio of a country, the larger the size distribution of manufacturing firms. Explain how to test this hypothesis.

1.5 01

10.14 In the oil industry, water that mixes with crude oil during production and transportation must be

1

SECTION

10.4

P

Parameters

579

Experiment Number

.......................................................................................................................................................................................................................................
.64 30 3.20 .48 1.72 .32 .64 .68 .12 .88 2.32 .40 1.04 .12 1.28 .72 1.08 1.08 1.04 40 80 40 80 40 80 40 80 40 80 40 80 40 80 40 80 0 0 0 1 1 4 4 1 1 4 4 1 1 4 4 1 1 4 4 0 0 0 4 4 4 4 23 23 23 23 4 4 4 4 23 23 23 23 0 0 0 .25 .25 .25 .25 .25 .25 .25 .25 24 24 24 24 24 24 24 24 0 0 0 2 4 4 2 4 2 2 4 2 4 4 2 4 2 2 4 0 0 0 .25 .25 .75 .75 .75 .75 .25 .25 .75 .75 .25 .25 .25 .25 .75 .75 0 0 0 .5 2 .5 2 2 .5 2 .5 2 .5 2 .5 .5 2 .5 2 0 0 0

Voltage, Y (kwlcm)

Disperse Phase Volume,
XI

Salinity,
X2

Temperature,
X3

Time Delay,
X4

Surfactant Concentration,
X5

Solid Particles, Span: Triton,
x6

X7

(%I

(YO)

(='c>

(hours)

(YO)

(YO)

1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16 17 18 19

Yo~irceF@rdedal, ,et al. "Amultwariate analys~s W/O emulsions in high external electrlc fields as studled by means of dielectric time domain H of spectroscopy."Journal o f Colloid and Interface Science, Vol. 173, No. 2, Aug. 1995, p. 398 (Table 2).

removed. Chemists have found that the oil can be extracted from t h e waterloil mix electrically. Researchers at the University of Bergen (Norway) conducted a series of experiments to study the factors that influence the voltage (y) required to separate the water from the oil (Journal of Colloid and Interface Science, Aug. 1995).The seven independent variables investigated in the study are listed in the table above. (Each variable was measured at two levels-a "low" level and a "high" level.) Sixteen waterloil mixtures were prepared using different combinations of the independent variables; then each emulsion was exposed to a high electric field. In addition, three mixtures were tested when all independent variables were set to 0. The data for all 19 experiments are also given in the table. a. Propose a first-order model for y as a function of all seven independent variables. b. Use a statistical software package to fit the model to the data in the table. c. Fully interpret the P estimates. 10.15 The owner of an apartment building in Minneapolis believed that her property tax bill was too high because of an overassessment of the property's value by the city tax assessor.The owner hired an independent real estate appraiser to investigate the appropriateness of the city's assessment. The appraiser used regression analysis to explore the relationship between the sale prices of apartment buildings sold in Minneapolis and various charac-

teristics of the properties. Twenty-five apartment buildings were randomly sampled from all apartment buildings that were sold during a recent year. The table on page 580 lists the data collected by the appraiser. The real estate appraiser hypothesized that the sale price (that is, market value) of an apartment building is related to the other variables in the table according to the model Y = Po + PIXI + P 2 ~ 2 P3x3 + P4x4 + Psxs + &. + a. Fit the real estate appraiser's model to the data in the table. Report the least squares prediction equation. b. Find the standard deviation of the regression model and interpret its value in the context of this problem. c. Do the data provide sufficient evidence to conclude that value increases with the number of units in an apartment building? Report the observed significance level and reach a_ conclusion using a = .05. d. Interpret the value of p, in terms of these data. Remember that your interpretation must recognize the presence of the other variables in the model. e. Construct a scattergram of sale price versus age. What does your scattergram suggest about the relationship between these variables? f. Test H,: p, = 0 against Ha:p2 < 0 using a = .01. Interpret the result in the context of the problem. Does the result agree with your observation in part e? Why is it reasonable to conduct a one-tailed rather than a two-tailed test of this null hypothesis? g. What is the observed significance level of the hypothesis test of part f?

580

CHAPTER 10

Introduction t o Multiple Regression

M N S A L E S . D A T (Data for Exercise 10.15) ..................................................................................................................................................................................................................................................

........................................................................................................................................................................................................
0229 0094 0043 0079 0134 0179 0087 0120 0246 0025 0015 0131 0172 0095 0121 0077 0060 0174 0084 0031 0019 0074 0057 0104 0024 90,300 384,000 157,500 676,200 165,000 300,000 108,750 276,538 420,000 950,000 560,000 268,000 290,000 173,200 323,650 162,500 353,500 134,400 187,000 155,700 93,600 110,000 573,200 79,300 272,000 4 20 5 26 5 10 4 11 20 62 26 13 9 6 1 1 5 20 4 8 4 4 4 14 4 5 82 13 66 64 55 65 82 23 18 71 74 56 76 21 24 19 62 70 19 57 82 50 10 82 82 4,635 17,798 5,913 7,750 5,150 12,506 7,160 5,120 11,745 21,000 11,221 7,818 4,900 5,424 11,834 5,246 11,223 5,834 9,075 5,280 6,864 4 3 10 11,192 7,425 7,500 0 0 0 6 0 0 0 0 20 3 0 13 0 6 8 5 2 0 0 0 0 0 0 0 0

Code No.

Sale Price, y (\$)

No. of Apartments, q

Age of Structure, x, (years)

Lot Size,
x3 (sq. ft)

No. of On-Site Parking Spaces, x,

Gross Building Area, x, (sq. ft)

4,266 14,391 6,615 34,144 6,120 14,552 3,040 7.881 12,600 39,448 30,000 8,088 11,315 4,461 9,000 3,828 13,680 4,680 7,392 6,030 3,840 3,092 23,704 3,876 9,542
FIG

Source Robinson Appra~sal Co., Inc., Mankato, Minnesota

SAS

prict

CHECKING THE OVERALL UTILITY OF A MODEL
Conducting t-tests on each P parameter in a model is not the best way to determine whether the overall model is contributing information for the prediction of y. If we were to conduct a series of t-tests to determine whether the independent variables are contributing to the predictive relationship, we would be very likely to make one or more errors in deciding which terms to retain in the model and which to exclude. For example, suppose you fit a first-order model in 10 quantitative x variables and decide to conduct t-tests on all 10 of the individual p's in the model, each at a = .05. Even if all the P parameters (except Po) are equal to 0, approximately 40% of the time you will incorrectly reject the null hypothesis at least once and conclude that some P parameter differs from O.* Thus, in multiple regression models for which a large number of independent variables are being considered, conducting a series of t-tests may include a large number of insignificant variables and exclude some useful ones. To test the utility of a multiple regression model, we need a global test (one that encompasses all the P parameters). We would also like to find some statistical quantity that measures how well the model fits the data.
*The proof of this result proceeds as follows: P(Reject Hi, at least oncejp, = p, = = p,, = 0) = 1 - P(Re1ect H,, no t ~ m c s p = P, = , . = P,,, = 0 )

...

-.

SECTION 10.5

Checking the Overall Utility of a Model

581

We commence with the easier problem-finding a measure of how well a linear model fits a set of data. For this we use the multiple regression equivalent of r', the coefficient of determination for the straight-line model (Chapter 9), as shown in the box.

EFlNlTlON 10 The multiple coefficient of determination, R ~is defined as ,
~ 2 = 1 - SSE - SS,, - SSE - Explained variability -S S ~ ~ S S ~ ~ Total variability
Just as for the simple linear model, R2 represents the fraction of the sample variation of the y values (measured by SS,,) that is explained by the least squares prediction equation. Thus, R2 = 0 implies a complete lack of fit of the model to the ' data and R = 1 implies a perfect fit with the model passing through every data point. In general, the larger the value of R2,the better the model fits the data. ' To illustrate, the value R = .8974 for the sale price model of Example 10.1 is indicated in Figure 10.7.This high value of R2 implies that using the independent variables land value, appraised improvements, and home size in a first-order model explains 89.7% of the total sample variation (measured by SS,,) of sale price y. Thus, R2 is a sample statistic that tells how well the model fits the data and thereby represents a measure of the usefulness of the entire model.
FIGURE 10.7 SAS printout for sale price model

Analysis of Variance Source Model Error C Total DF Sum of Squares Mean Square

F Value
46.662

Prob>F
0.0001

3 8779676740.6 2926558913.5 16 1003491259.4 62718203.714 19 9783168000.0 7919.48254 56660.00000 13.97720 Ad]R-Sq

Root MSE DepMean C.V.

0.8782

I

Parameter Estimates Variable
INTERCEP

I
t

DF

Parameter Estimate 1470.275919

Standard Error 5746.3245832

T for HO: Parameter=O
0.256

Prob > I1 T 0.8013

I

1

A large value of R2computed from the sample data does not necessarily mean that the model provides a good fit to all of the data points in the population. For example, a first-order linear model that contains three parameters will provide a perfect fit to a sample of three data points and R~will equal 1. Likewise, you will always obtain a perfect fit ( R 2 = 1)to a set of n data points if the model contains exactly n parameters. Consequently, if you want to use the value of R2 as a measure of how useful the model will be for predicting y, it should be based on a sample that contains substantially more data points than the number of parameters in the model.

582

CHAPTER 10

I n t r o d u c t i o n t o Multiple Regression

As an alternative to using R2 as a measure of model adequacy, the adjusted multiple coefficient of deternzinution, denoted R:, is often reported. The formula for R: is shown in the box.

The adjusted multiple coefficient of determination is given by

Note: R:

5

R2

R2 and R: have similar interpretations. However, unlike R2,R: takes into account ("adiusts" for) both the sample size n and the number of P parameters in the model. R: will always be smaller than R2, and more importantly, cannot be "forced" to 1 by simply adding more and more independent variables to the model. Consequently, analysts prefer the more conservative R: when choosing a measure of model adequacy. In Figure 10.7, R: is shown directly below the value of R'. Note that R: = 3782, a value only slightly smaller than R'. Despite their utility, R2 and Rt are only sample statistics. Therefore, it is dangerous to judge the global usefulness of the model based solely on these values. A better method is to conduct a test of hypothesis involving all the /3 parameters (except Po) in a model. In particular, for the sale price model (Example 10.1). we would test

H,: PI = p2 = p3 = 0 Ha:At least one of the coefficients is nonzero
The test statistic used to test this hypothesis is an F statistic, and several equivalent versions of the formula can be used (although we will usually rely on the computer to calculate the F statistic):

Test statistic: F

(SS,, - SSE)/k
=

S S E / [ n- ( k

+ I)]

-

~2/k ( 1 - ~ ~ ) /-[( n + I ) ] k

Both these formulas indicate that the F statistic is the ratio of the explnined variability divided by the model degrees of freedom to the unexplained variability divided by the error degrees of freedom. Thus, the larger the proportion of the total variability accounted for by the model, the larger the F statistic.

SECTION 10.5

C h e c k i n g t h e Overall Utility of a Model

583

To determine when the ratio becomes large enough that we can confidently reject the null hypothesis and conclude that the model is more useful than no model at all for predicting y, we compare the calculated F statistic to a tabulated F value with k df in the numerator and [n - (k + I)] df in the denominator. Recall that tabulations of the F-distribution for various values of a are given in Tables VIII, IX, X, and XI of Appendix B. Rejection region: F > Fa, where F is based on k numerator and n - (k + 1)denominator degrees of freedom. For the sale price example [n = 20, k reject H,,: p, = P2 = p3 = 0 if
=

3, n

-

(k

+ 1) = 16, and a = .05], we will

From the SAS printout (Figure 10.7), we find that the computed F value is 46.66. Since this value greatly exceeds the tabulated value of 3.24, we conclude that at least one of the model coefficients P,, P2, and P, is nonzero. Therefore, this global F-test indicates that the first-order model y = Po + Plxl + P2x2 + P3x3 + E is useful for predicting sale price. Like SAS, most other regression packages give the F value in a portion of the printout called the "Analysis of Variance." This is an appropriate descriptive term, since the F statistic relates the explained and unexplained portions of the total variance of y. For example, the elements of the SAS printout in Figure 10.7 that lead to the calculation of the F value are:

F Value =

Sum of Squares (Model)/df (Model) - Mean Square (Model) Sum of Squares (Error)/df (Error) Mean Square (Error)

1 "

3

S

I-

Note, too, that the observed significance level for the F statistic is given under the heading Prob > F as .0001, which means that we would reject the null hypothesis H,: p, = P2 = P3 = 0 at any a value greater than .0001. The analysis of variance F-test for testing the usefulness of the model is summarized in the next box.

1-

)?

s are unimportant for

a1 In

H,: At least one& # 0
(SS,, - SSE)/k SSE/[n - (k

del term is useful for R2/k (1 - R2)/[n - (k

+ I)]

-

+ I)]

ed ili.he

(continued)

584

CHAPTER 10

Introduction t o Multiple Regression

I

Assumptions: The stand

ression assumptions about the random error

ection of the null hypothesis Hn:PI = to the conclusion [with 100(1 atistically useful. However, statistically "useful" does not necessarily mean 'best." Another model may prove even more useful in terms of providing more reliable estimates and predictions. This global F-test is usually regarded as a test that the model must pass to merit further consideration.

w -- -sB" m m-m* - - "

"""mm-m"&

*"m-,-*"-mm"-="*"

"""-~""~~~"bm"~s"ama~~~-"b*x~m"%"%m"""mm-"""*~"""m"-

~ ~ " ~ " " " " * ~ " " ~ m ~ ~ m s ~ ~ m M ~ m ~ ~ ~ - ~

r to Example 10.3,in which an antique collector modeled the auction price I! of grandfather clocks as a function of the age of the clock, x,, and the number of bidders, x,. The hypothesized first-order model is

Y

=

Po + P l ~ + P 2 ~ 2 E l +

A sample of 32 observations is obtained, with the results summarized in the MINITAB printout repeated in Figure 10.8.
FIGURE 10.8
The regression equation is Y = -1339 + 12.7 XI + 86.0 X2 Predictor Constant XI X2 Coef -1339.0 12.7406 85.953 StDev 173.8 0.9047 8.729 t-ratio -7.70 14.08 9.85 P 0.000 0.000 0.000

MINITAB printout for Example 10.4

Analysis of Variance SOURCE
Regression

Error Total

DF 2 29 31

SS 4283063 516727 4799789

MS 2141532 17818

F
120.19

P
0.000

a. Find and interpret the adjusted coefficient of determination R: for this example.

'
1'
i

b. Conduct the global F-test of model usefulness at the a icance.

=

.05 level of signif-

a. The Ra value (highlighted in Figure 10.8) is .885.This implies that the least

Solution

squares model has explained about 88.5% of the total sample variation in y values (auction prices), after adjusting for sample size and number of independent variables in the model. b. The elements of the global test of the model follow:

j

SECTION 10.5

C h e c k i n g t h e Overall Utility o f a Model

585

H,,: p, = Pz = 0 (Note: k = 2) Ha:At least one of the two model coefficients is nonzero Test statistic: F = 120.19 (highlighted in Figure 10.8) p-value: .000
Conclusion: Since a = .05 exceeds the observed significance level, p = .000, the data provide strong evidence that at least one of the model coefficients is nonzero. The overall model appears to be statistically useful for predicting auction prices. :p

Can we be sure that the best prediction model has been found if the global Ftest indicates that a model is useful? Unfortunately, we cannot. The addition of other independent variables may improve the usefulness of the model. (See the box, p. 583-584.) To summarize the discussion in this section, both R~ and Ri are indicators of how well the prediction equation fits the data. Intuitive evaluations of the contribution of the model based on R2 must be examined with care. Unlike R:, the value of R2 increases as more and more variables are added to the model. Consequently, you could force R2 to take a value very close to 1 even though the model contributes no information for the prediction of y. In fact, R2 equals 1 when the number of terms in the model (including Po) equals the number of data points. Therefore, you should not rely solely on the value of R2 (or even R;) to tell you whether the model is useful for predicting y. Use the F-test for testing the global utility of the model. After we have determined that the overall model is useful for predicting y using the F-test, we may elect to conduct one or more t-tests on the individual p parameters (see Section 10.4). However, the test (or tests) to be conducted should be decided a priori, that is, prior to fitting the model. Also, we should limit the number of t-tests conducted to avoid the potential problem of making too many Type I errors. Generally, the regression analyst will conduct t-tests only on the "most important" P's.

. . ' ** ~ecomrnendation Checking the Utility of a Multiple for Regression Model 1. First, conduct a test of overall model adequacy using the F-test, that is, test

eject H,), then proceed and fit another model. The new model may include more independent variables or higherorder terms. 2. Conduct t-tests on those parameters in which you are particularly interested (that is, the "most important" P's). These usually involve only the 0's associated with higher-order terms (x,, x,x2, etc.). However, it is a safe practice to limit the number of 6's that are tested. Conducting a series of t-tests leads to a high overall Type I error rate If the model is de

586

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

Learning the Mechanics
10.16 Suppose you fit the first-order model

clude that all of the terms in the model are important for predicting y? What is the appropriate conclusion? 10.19 Suppose you fit the first-order model

to n

=

30 data points and obtain SSE
=

to n

=

20 data points and obtain

.33 R2 = .92

a. Do the values of SSE and R2 suggest that the model
provides a good fit to the data? Explain. b. Is the model of any use in predicting y? Test the null hypothesis H,,: P1 = P2 = P3 = P4 = P5 = 0 against the alternative hypothesis Ha: At least one of the parameters PI,P2, . . .,Ps is nonzero. Use (Y = .05. 10.17 The first-order model y = Po + Plxl + P2x2+ E was fit to n = 19 data points with the results shown in the SAS printout provided below. a. Find R2 and interpret its value. b. Find Ri and interpret its value. c. Test the null hypothesis that P, = P2 = 0 against the alternative hypotheyis that at least one of P, and p2 is nonzero. Calculate the test statistic using the two formulas given in this section, and compare your results to each other and to that given on the printout. Use a = .05 and interpret the result of your test. d. Find the observed significance level for this test on the printout and interpret it. 10.18 If the analysis of variance F-test leads to the conclusion that at least one of the model parameters is nonzero, can you conclude that the model is the best predictor for the dependent variable y? Can you conSAS Output for Exercise 10.1 7
-

a. Construct an analysis of variance table for this regression analysis, using the same format as the printout in Exercise 10.17. Be sure to include the sources of variability, the degrees of freedom. the sums of squares, the mean squares, and the F statistic. Calculate R2 and Rt for the regression analysis. b. Test the null hypothesis that P, = P2 = 0 against the alternative hypothesis that at least one of the parameters differs from 0. Calculate the test statistic in two different ways and compare the results. Use 0 = .05 to reach a conclusion about whether the model contributes information for the prediction of y.

Applying the Concepts
10.20 Refer to the World Development (Feb. 1998) study of street vendors in t h e city of Puebla, Mexico, Exercise 10.7 (p. 573). Recall that the vendors' mean annual earnings E b ) was modeled as a first-orderfunction of age x , and hours worked x2. Refer to the STATISTIX printout on p. 573 and answer the following: a. Interpret the value of R2. b. Interpret the value of R:. Explain the relationship between R2 and R:.

Dep Variable: Y

Analysis of Variance Source Model Error C Total
DF

Sum of Squares

Mean Square

F Value
65.478

Prob>F
0.0001
lnde

2 16 18 0.43008 3.56053 12.07921

..... . .

Root MSE Dep Mean C.V.

Parameter Estimates Variable INTERCEP X1 X2

DF
1 1 1

Parameter Estimate
0.734606 0.765179 -0.030810

Standard Error
0.29313351 0.08754136 0.00452890

T for HO: Parameter=O
2 . 506

prob > I T I
0.0234 0.0001 0.0001

Con, CH/ SIZl COP RISl IND BIG NAS

8.741 -6.803

So~trc

No. 3.

SECTION 10.5

C h e c k i n g t h e O v e r a l l U t i l i t y of a M o d e l

587

c. Conduct a test of the global utility of the model at LY = .01. Interpret the result. 10.21 Refer to the Chief Executive (Sept. 1999) study of CEOs, Exercise 10.8 (p. 574). Recall that a CEO's pay y was modeled as a function of company performance x, and company sales x2using a first-order model. Refer to the EXCEL printout on p. 575 and answer the following: a. Find and interpret the value of the multiple coefficient of determination. b. Give the null and alternative hypotheses for testing whether the overall model is statistically useful for predicting a CEO's pay. c. Give the value of the test statistic and correspondingp-value for the test of part h. d. Conduct the test of part b using a = .05. What is your conclusion?
Results for Exercise 10.22
Variable

10.22 The Journal of Quantitative Criminology (Vol. 8,1992) published a paper on the determinants of area property crime levels in the United Kingdom. Several multiple regression models for property crime prevalence, y, measured as the percentage of residents in a geographical area who were victims of at least one property crime, were examined. The results for one of the models, based on a sample of n = 313 responses collected for the British Crime Survey, are shown in the table below. [Note: All variables except Density are expressed as a percentage of the base area.] a. Test the hypothesis that the density ( x , ) of a region is positively linearly related to crime prevalence (y), holding the other independent variables constant. b. Do you advise conducting t-tests on each of the 18 independent variables in the model to determine

fi
.331 -.I21 -.I87 -.I51 ,353 .095 ,130 -.I22 .I63 ,369 - ,210 - .I92 - .548 .I52 -.I51 - .308 .311 -.019

t

pvalue

x, = Dens~ty (population per hectare) xz = Unemployed male population x3 = Profess~onal population

x4 x7

=

Populat~on aged less than 5 10-year changc. In population

x, = Population aged between 5 and 15 x6 = Female populat~on
=

x8 = M~nor~ty population x, = Young adult popula:ion xlo = 1 ~f North region, 0 if not

xll = 1 ~f Yorkshire reglon, 0 ~f not x12 = 1 ~f East M~dlands reglon, 0 if not x13 = 1 ~f East Anglia reglon, 0 ~fnot x14 = 1 ~f South East region, 0 ~fnot x15 = 1 ~f South West region, 0 ~f not X16 = 1 I'West Midlands region, 0 if not x,, = 1 ~f North West reglon, 0 if not x18 = 1 ~tWales region, 0 ~f not
Quantrtatwe

3.88 -1.17 - 1.90 -1.51 3.42 1.31 1.40 -1.51 5.62 1.72 -1.39 -0.78 -2.22 1.37 -0.88 -1.93 2.13 -0.08

p p .01 < p p p p p p p .01 < p p p .Ol < p p p .0l < p .O1 < p p

< .O1 > .I0 < .10 > .10 < .01 > .lo > .10 > .10 < .01 < .10 > .10 > .lo < .10 > .lo > .10 < .lo < .lo > .10

Source Osborn, D R ,Tickett, A , and Elder, R "Area characteristics and reg~onal variates as determinants of area property cnme." Journal of

Cnmmology, Vol 8, No 3,1992, Plenum Publ~sh~ng Corp.

Results for Exercise 10.23 ......................................... ............................................................................................................................... ... ......... '............................................................. .
Independent Variable Expected Sign of fi

p

Estimate

t Value

Level of Significance (p-Value)

Constant CHANGE SIZE COMPLEX RISK INDUSTRY BIG8 NS A

.001 (two-tailed) .961 (one-tailed) .000(one-tailed) .000(one-tailed) .079(one-tailed) .000(one-tailed) .030(one-tailed) .OOO(two-tailed)

R '

= ,712

F

=

111.1

Source

Butterworth, S., and Houghton, K. A. "Audltor switching:The pricing of audit services." Journal of Busmess Fmunce and Account~ng,Vol. 22, No 3,Aprd 1995,p. 334 (Table 4).

588

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

\\

which variables are important predictors of crime prevalence? Explain. c. The model yielded R2 = .411. Use this information to conduct a test of the global utility of the model. Use a = .05 10.23 External auditors are hired to review and analyze the financial and other records of an organization and to attest to the integrity of the organization's financial statements. In recent years, the fees charged by auditors have come under increasing scrutiny. S. Butterworth and K. A. Houghton, two University of Melbourne (Australia) researchers, investigated the effects of several variables on the fee charged by auditors. The variables are listed at the bottom of the page. The multiple regression model E(y) = P,, + P,x, + P2x2+ P3x3+ + P7x7was fit to data collected for n = 268 companies. The results are summarized in the table on page 587. a. Write the least squares prediction equation. b. Assess the overall fit of the model. c. Interpret the cstirnate of P,. d. The researchers hypothesized the direction of the effect of each independent variable on audit fees. These hypotheses are given in the "Expccted Sign of p" column in the table on p. 587. (For example, if the expected sign is negative, the alternative hypothesis is Ha: P, < 0.) Interpret the results of the hypothesis test for P4. Use a = .05. e. The main objective of the analysis was to determine whether new auditors charge less than incumbent auditors in a given year. If this hypothesis is true, then the true value of p, is negative. Is there evidence to support this hypothesis? Explain. 10.24 An important goal in occupational safety is "active caring." Employees demonstrate active caring (AC) about the safety of their co-workers when they identify environmental hazards and unsafe work practices and then implement appropriate corrective actions for these unsafe conditions or behaviors. Three factors hypothesized to increase thc propensity for an employee to

actively care for safety are (1)high self-esteem.(2) optimism, and (3) group cohesiveness. Applied & Preventlw Psychology (Winter 1995) attempted to establish empirical support for the AC hypothesis by fitting the model E(y) = Po + Plxl + P2x2+ P3x1, where y = AC score (measuring active caring on a 15-poinl scale) x, = Self-esteem score x2 = Optimism score

x3 = Group cohesion score
The regression analysis, based on data collected for n = 31 hourly workers at a large fiber-manufacturing plant, yielded a multiple coefficient of determination ot R' = .362. a. Interpret the value of R2. b. Use the R2 value to test the global utility of the modcl. Use a = .05 10.25 Refer to the Interfaces (Mar.-Apr. 1990) study of La Quinta Motor Inns, Exercise 10.13 (p. 578).The researchers used state population per inn (x,), inn room ratc (x,), median income of the area (x?),and college enrollment ( r4)to build a first-order model for operating margin (y) of a La Quinta Inn. Based on a sample of n = 57 inns, the model yielded R2 = .51. a. Glve a descriptive measure of model adequacy. b. Make an inference about model adequacy by conducting the appropriate test. Use a = .05.

10.26 Regression analysis was employed to investigate the determinants of survival size of nonprofit hospitals (Applied Economics,Vol. 18,1986).For a given sample of hospitals, survival size, y, is defined as the largest size hospital (in terms of number of beds) exhibiting growth in market share over a specific time interval. Suppose 10 states are randomly sclccted and the survival size for all nonprofit hospitals in each state is determined for two time periods five years apart, yielding two observations per state. The 20 survival sizes are listed in the table on page 589, along with the

y

=

Logarithm of audit fee charged to auditee (FEE) 1 if auditee changed auditors after one year (CHANGE) 0 if not Logarithm of auditee's total assets (SIZE) Number of subsidiaries of auditee (COMPLEX)

Xl =

{
{

x2 = x3 =
x.4
=

1 if auditee receives an audit qualification (RISK) 0 if not

x7 =

1 if auditee in mining industry (INDUSTRY) 0 if not 1 if auditee is a member of a "Big 8" firm (BIG8) 0 if not Logarithm of dollar-value of non-audit services provided by auditor (NAS)

SECTION 10.5

Checking the Overall Utility of a Model

589

State

Time period

Survival size, y

XI

Xz

x 3

... .. ...

1 1 2 2 3 3 4 4 5

1 2 1 2

5
6 6 7
7 8

8 9 9 10 10

1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2

370 390 455 450 500 480 550 600 205 230 425 445 245 200 250 275 300 290 280 270

.13 .15 .08 .I 0 .03 .07 -06 .10 .30 .25 .04 .07 .20 .30 .07 .08 .09 .12 .10 .ll

.09 .09 .ll .16 .04 .05 ,005 .005 .12 .13 .01 .02 .01 .01 .08 .lo .12 .20 .02 .05

5,800 5,955 17,648 17,895 7,332 7,610 11,731 11,790 2,932 3,100 4,148 4,205 1,574 1,560 2,471 2,511 4,060 4,175 2,902 2,925

.... 89 87 87 85 79 78 80 81 44 45 36 38 25 28 38 38 52 54 37 38

q

Source Adapted from Bays, C. W. "The determinants of hospital size A survlvor analys~s." Applled Economics, 1986,Vol. 18, pp 359-377.

following data for each state, for the second year in each time interval:
xl = Percentage of beds that are for-profit hospitals x2 = Ratio of the number of persons enrolled in health

The article hypothesized that the following model characterizes the relationship between survival size and the four variables just listed:

maintenance organizations (HMOs) to the number of persons covered by hospital insurance
x3 = State population (in thousands)
II

a. The model was fit to the data in the table using SAS,
with the results given in the printout below. Report the least squares prediction equation. b. Find the regression standard deviation s and interpret its value in the context of the problem.

x4 = Percent of state that is urban

SAS Output for Exercise 10.26
Dep Variable: Y Analysis of Variance Source Model Error C Total DF Sum of Squares Mean Square
61634.26485 2187.19604

F Value
28.180

Prob>F
0.0001

4 246537.05939 15 32807.94061 19 279345.00000 46.76747 360.50000 12.97295

RootMSE Dep Mean C.V.

Parameter Estimates Variable INTERCEP X1 X2 X3 X4 DF
1 1 1 1 1

Parameter Estimate
295.327091 -480.837576 -829.464955 0.007934 2.360769

Standard Error
40.17888737 150.39050364 196.47303539 0.00355335 0.76150774

T for HO: Parameter=O Prob z IT1

7.350 -3.197 -4.222 2.233 3.100

0.0001 0.0060 0.0007 0.0412 0.0073

590

CHAPTER 10

Introduction t o Multiple Regression

c. Use an F-test to investigate the usefulness of the hypothesized model. Report the observed significance level, and use a = .025 to reach your conclusion. d. Prior to collccting the data it was hypothesized that increases in the number of for-profit hospital beds would decrease the survival size of nonprofit hospitals. Do the data support this hypothesis? Test using a = .05.

1 if white 0 if not x3 = Education level (in years) x4 = Tenure with firm (in years) x5 = Number of hours worked per week

xz

=

Race of manager

=

10.27 Because the coefficient of determination R2 always increases when a new independent variable is addcd to the model, it is tempting to include many variables in a model to force R2 to be near 1. However, doing so reduces the degrees of freedom available for estimating u 2 , which adversely affects our ability to makc reliable inferences. Suppose you want to use 18 economic indicators to predict next year's Gross Domestic Product (GDP). You fit the model

The regression results are shown in the table below as they were reported in the article.

Variable
XI X2 x3 X4 Xs

3
12.774

p-Value

.713
1.519

.320
.205 15.491

Y

= PO + P l x l

+P

2 ~ + ''' 2

+ P 1 7 ~ 1 7+ P 1 8 ~ 1 8 +

Constant

< .05 > .lo < .05 < .05 < .05
-

\

where y = GDP and x l , x2, . . ., X18 are the economic 0 indicators. Only 20 years of data (n = 2 ) are used to fit the model, and you obtain R2 = .95. Test to see whether this impressive-looking R2 is large enough for you to infer that the model is useful, that is, that at least one term in the model is important for predicting GDF! Use a = .05.

a. Write the hypothesized model that was used, and interpret each of the p parameters in the model. b. Write the least squares equation that estimates the model in part a, and interpret each of the j estimates 3 c. Interpret the value of R 2 . Test to determine whether the model is useful for predicting annual salary.Test using a = .05. d. Test to determine whether the gender variable indicates that male managers are paid more than female managers, even after adjusting for and holding constant the other four factors in the model.Test using a = .05. [Note: The p-values given in the table are two-tailed.] e. Why would one want to adjust for these other factors before conducting a test for salary discrimination?

10.28 Much research-and much litigation-has been conducted on the disparity between the salary levels of men and women. Research reported in Work and Occupations (Nov. 1992) analyzes the salaries for a sample of 191 Illinois managers using a regression analysis with the following independent variables:

The regression equation is wtchnge = 12.2 - 0.0265 digest - 0.458 acid Predictor Constant digest acid Coef 12.180 -0.02654 -0.4578 StDev 4.402 0.05349 0.1283 T 2.77 -0.50 -3.57 P 0.009 0.623 0.001

Analysis of Variance Source egression Error Total DF 2
39

41

SS 542.03 483.08 1025.12

MS 271.02 12.39

F 21.88

P 0.000

SECTION

10.5

C h e c k i n g t h e Overall Utility of a Model

591

10.29 Refer to the Journal of Applied Ecology study of the feeding habits of baby snow geese, Exercise 10.11 (p.577).The MINITAB printout for the model relating weight change ( y ) to digestion efficiency (x,)and aciddetergent fiber ( x 2 ) reproduced on page 590. is a. Locate R2 and Rz on the MINITAB printout. Interpret these values. Which statistic is the preferred measure of model fit? Explain. b. Locate the global F value for testing the overall model on the MINITAB printout. Use the statistic to test the null hypothesis H,,: P, = P, = 0. 10.30 Multiple regression is used by accountants in cost analysis to shed light on the factors that cause costs to be incurred and the magnitudes of their effects. The independent variables of such a regression model are the factors believed to be related to cost, the dependent variable. In some instances, however, it is desirable to use physical units instead of cost as the dependent variable in a cost analysis. This would be the case if most of the cost associated with the activity of interest is a function of some physical unit, such as hours of labor. The advantage of this approach is that the regression model will provide estimates of the number of labor hours required under different circumstances and these hours can then be costed at the current labor rate (Horngren, Foster. and Datar, 1994). The sample data shown in the table below have been collected from a firm's

accounting and production records to provide cost information about the firm's shipping department. T h e E X C E L computer printout for fitting the model y = P, + Plx1 + P2x2 + P3x3 + E is provided on page 592. a. Find the least squares prediction equation. b. Use an F-test to investigate the usefulness of the model specified in part a. Use a = .01, and state your conclusion in the context of the problem. c. Tcst H,,: / , = 0 versus Ha:p2 + 0 using a = .05. 3 What do the results of your test suggest about the magnitude of the effects of x2 on labor costs? d. Find R2, and interpret its value in the context of the problem. e. If shipping department employees are paid \$7.50 per hour, how much less, on average, will it cost the company per week if the average number of pounds per shipmcnt increases from a level of 20 to 21 7 Assume that x, and x2 remain unchanged. Your answer is an estimate of what is known in economics as the expected marginal cost associated with a onepound increase in x,. f With what approximate precision can this model be . used to predict the hours of labor'? [Note: The precision of multiple regression predictions is discussed in Section 10.6.1 g. Can regression analysis alone indicate what factors cause costs to increase? Explain.

Week

Labor, Y (hrs.1

Pounds Shipped, xl (1,000s)

Percentage of Units Shipped by Truck,
x 2

Average Shipment Weight, x, (Ibs.)

EXCEL Output for Exercise 10.30

Intercept Ship (xl) Truck (x2) Weight (x3)

Lower 95% Upper 95% P-value Coefficients Standard Error t Stat 77.45708304 186.3914211 9.985973-05 25.69321439 5.134595076 131.9242521 -2.096704051 7.548883591 0.24825743 2.275004884 1.198278645 2.72608977 -0.150671647 0.245108472 0.6198742 0.093348559 0.505829045 0.047218412 -3.950157275 -1.224730536 0.000978875 0.642818185 -4.025156669 -2.587443905

SECTION 10.6

Using t h e Model f o r Estimation a n d Prediction

593

USING THE MODEL FOR ESTIMATION AND PREDICTION
In Section 9.8 we discussed the use of the least squares line for estimating the mean value of y, E(y), for some particular value of x, say x = x,. We also showed some new value of y to how to use the same fitted model to predict, when x = x,, be observed in the future. Recall that the least squares line yielded the same value for both the estimate of E(y) and the prediction of some future value of y. That is, both are the result of substituting x, into the prediction equation = + p I xand calculating yp. There the equivalence ends. The confidence interval for the mean E(y) is narrower than the prediction interval for y because of the additional uncertainty attributable to the random error E when predicting some future value of y. These same concepts carry over to the multiple regression model. Consider, again, the first-order model relating sale price of a residential property to land improvements (x,),and home size (x?). Suppose we want to estimate the value (x,), mean sale price for a given property with x1 = \$15,000, x, = \$50,000, and x3 = 1,800 square feet. Assuming that the first-order model represents the true relationship between sale price and the three independent variables, we want to estimate

p,,

Substituting into the least squares prediction equation, we find the estimate of E(y) to be

To form a confidence interval for the mean, we need to know the standard For deviation of the sampling distribution for the estimator 7. multiple regression models, the form of this standard deviation is rather complex. However, the regression routines of statistical computer software packages allow us to obtain the confidence intervals for mean values of y for any given combination of values of the independent variables. A portion of the SAS output for the sale price example is shown in Figure 10.9a.
Obs
X1 x2 X3

Y

Predlct Value

Residual

Lower95% Mean

Upper95% Mean

FIGURE 1 0 . 9 a

SAS printout for estimated mean sale price value and corresponding confidence interval
The estimated mean value and corresponding 95% confidence interval for the Mean, selected x values are shown in the columns labeled Predict Value, Lower9S0/~ and Upper95% Mean, respectively. We observe that y^ = 79,061.4, which agrees with our calculation.The corresponding 95% confidence interval for the true mean of y, highlighted on the printout, is (73,380.7,84,742.1).Thus,with 95% confidence, we conclude that the mean sale price for all properties with xl = \$15,000, x, = \$50,000, and x, = 1,800 square feet will fall between \$73,380.70 and \$84,742.10.

10 I n t r o d u c t i o n t o M u l t i p l e R e g r e s s i o n
If we were interested in predicting the sale price for a particular (single) property with xl = \$15,000, x2 = \$50,000, and x3 = 1,800 square feet, we would use y^ = \$79,061.41 as the predicted value. However, the prediction interval for a new value of y is wider than the confidence interval for the mean value. This is reflected by the SAS printout shown in Figure 10.9b, which gives the predicted value of y and corresponding 95% prediction interval under the columns Predict Value, Lower9S0/0 Predict, and Upper9S0/0 Predict, respectively. Note that the prediction interval is (61,337.9, 96,785). Thus, with 95% confidence, we conclude that the sale price for an individual property with the characteristics x, = \$15,000, x2 = \$50,000, and x, = 1,800 square feet will fall between \$61,337.90 and \$96,785.

Obs
21

X1 15000

X2
50000

X3 1800

Y

Predict Value
79061.4

Residual

Lower958 Predict
61337.9

Upper95% Predict 96785

.

FIGURE 10.9b SAS printout for predicted sale price value and corresponding prediction interval

Applying the Concepts
10.31 Refer to the World Development (Feb. 1998) study of street vendors' earnings, y, Exercises 10.7 and 10.20 (pp. 573,586).The STATISTIX printout below shows both a 95% prediction interval for y (left side) and a 95% confidence interval for E(y) (right side) for a 45-year-old vendor who works 10 hours a day (i.e., for x, = 45 and x, = 10). a. Interpret the 95% prediction interval for y in the words of the problem.

b. Interpret the 95% confidence interval for E(y) in the words of the problem. c. Note that the interval of part a is wider than the interval of part b. Will this always be true? Explain. 10.32 Refer to the Journal of Applied Ecology study of the feeding habits of baby snow geese, Exercises 10.11and 10.29 (pages 577,59l).The MINITAB printout for the first-order model relating gosling weight change y to digestion efficiency x, and acid-detergent fiber x, ic

STATlSTlX Output for Exercise 10.31

PREDICTED/FITTED VALUES OF EARNINGS LOWER PREDICTED BOUND PRZDICTED VALUE UPPER PREDICTED BOUND SE (PREDICTED VALUE) UNUSUALNESS (LEVERAGE) PERCENT COVERAGE CORRESPONDING T
1759.7 3017.6 4275.4 577.29 0.1108 95.0 2.18
= 10.000

LOWERFITTEDBOUND FITTED VALUE UPPER FITTED BOUND SE (FITTED VALUE)

2620.3 3017.6 3414.9 182.35

PREDICTOR VALUES: AGE = 4 5 . 0 0 0 , HOURS

SECTION 10.6

Using the Model for Estimation and Prediction

595

The regression equation is wtchnge = 12.2 - 0.0265 digest - 0.458 acid Predictor Constant digest acid
s
=

Coef 12.180 -0.02654 -0.4578

StDev 4.402 0.05349 0.1283

T 2.77 -0.50 -3.57

P 0.009 0.623 0.001

3.519

R-Sq = 52.9%

Analysis of Variance Source Regression Error Total Fit -1.687 DF
2

39 41

SS 542.03 483.08 1025.12
(

MS 271.02 12.39

F 21.88

P
0.000

StDev Fit 0.866

95.0% CI -3.440, 0.065)

(

95.0% PI -9.020, 5.646)

reproduced above. Both a confidence interval for E(y) and a prediction interval for y when x, = 5% and x, = 30% are shown at the bottom of the printout. a. Interpret the confidence interval for E(y). b. Interpret the prediction interval for y.
10.33 Refer to Exercise 10.14 (p. 578). The researchers con'

logical station (station #9).

10.35 In a production facility, an accurate estimate of manhours needed to complete a task is crucial to management in making such decisions as the proper number of workers to hire, an accurate deadline to quote a client, or cost-analysis decisions regarding budgets. A manufacturer of boiler drums wants to use regression to predict the number of man-hours needed to erect the drums in future projects. To accomplish this, data for 35 boilers were collected. In addition to man-hours (y), the variables measured were boiler capacity (x, = Iblhr), boiler design pressure (x2 = pounds per square inch or psi), boiler type (x, = 1if industry field erected, 0 if utility field erected), and drum type (x, = 1 if steam, 0 if mud). The data are provided in the table on page 598. A MINITAB printout for the + model E(y) = P O+ PIXI+ P 2 ~ 2 P3x3 + P 4 ~ 4 is shown on page 597. a. Conduct a test for the global utility of the model. Use a = .01. b. Both a 95% confidence interval for E(y) and a 95% prediction interval for y when xl = 150,000, x2 = 500, x3 = 1 and x, = 0 are shown at the bottom of the MINITAB printout. Interpret both of these intervals.

cluded that "in order to break a water-oil mixture with the lowest possible voltage, the volume fraction of the disperse phase x, should be high, while the sahnity x2 and the amount of surfactant x, should be low." Use this information and t h e first o r d e r model of Exerc~se10.14 to find a 95% prediction interval for this "low" voltage y. Interpret the interval. tiple regression to predict annual rainfall levels in California. Data on the average annual precipitation (y), altitude (x,),latitude (x,), and distance from the Pacific coast (x,) for 30 meteorological stations scattered throughout California are listed in the table on page 597. Initially, the first-order model y = Po + Plxl + P,x2 t P3x3 + E was fit to the data.The SAS printout of the analysis is provided on page 596. a. Is there evidence that the first-order model is useful for predicting annual precipitation y ? Test using a = .05. b. Ninety-five percent prediction intervals for y are shown at the bottom of the printout. Locate and interpret the interval for the Giant Forest meteoro-

10.34 An article published in Geography (July 1980) used mul-

596

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

SAS Output for Exercise 10.34

Model: MODEL1 Dependent Variable: PRECIP Analysis of Variance Source Model Error C Total Root MSE Dep Mean
11.09799 19.80733 56.02968

DF

Sum of Squares

Mean Square

F Value

Prob>F

0.6003 0.5542

Parameter Estimates Variable DF INTERCEP ALTITUDE LONGTUDE COAST Obs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Parameter Estimate

Standard Error

T for HO: Parameter=O

Prob > ITI

1 1 1 1
Dep Var PRECIP
39.5700 23.2700 18.2000 37.4800 49.2600 21.8200 18.0700 14.1700 42.6300 13.8500 9.4400 19.3300 15.6700 6.0000 5.7300 47.8200 17.9500 18.2000 lO.O3OO 4.6300 14.7400 15.0200 12.3600 8.2600 4.0500 9.9400 4.2500 1.6600 74.8700 15.9500

STATION Eureka RedBluff Thermal FortBrag SodaSpri SanFranc Sacramen SanJose GiantFor Salinas Fresno PtPiedra PasaRobl Bakersf i Bishop Mineral SantaBar Susanvil TuleLake Needles Burbank LosAngel LongBeac LosBanos Blythe SanDiego Daggett DeathVal Crescent Colusa

Predict Value
38.4797 23.9136 21.2729 33.7750 39.4605 27.5918 19.1828 23.1015 29 .2534 22.8856 9.3654 20.9364 19.4445 11.0967 14.8859 36.6194 16.7077 25.4191 38.7521 -5.9539 11.8145 14.3149 12.7793 18.0332 -7.4478 9.8563 11.7920 -4.8355 41.5529 20.1703

Std Err Predict
4.568 3.795 5.132 3.815 5.799 3.144 2.988 2.647 5.596 2.842 3.029 3.147 2.629 2.512 4.193 4.392 3.526 4.571 4.646 5.224 3.046 3.416 3.595 2.621 4.950 4.275 3.298 5.843 5.121 3.399

Lower95% Predict
13.8110 -0.1953 -3.8600 9.6526 13.7221 3.8818 -4.4416 -0.3504 3.7056 -0.6628 -14.2809 -2.7752 -3.99 87 -12.2922 -9.5000 12.0861 -7.2277 0.7482 14.0218 -31.1665 -11.8412 -9.5536 -11.1997 -5.4064 -32.4260 -14.5899 -12.0062 -30.6159 16.4295 -3.6875

Upper95% Predict Residual
63 .I483 1.0903 48.0226 -0.6436 46.4057 -3.0729 57.8973 3.7050 65.1990 9.7995 51.3018 -5.7718 42.8072 -1.1128 -8.9315 46.5535 54.8012 13.3766 46.4340 -9.0356 0.0746 33.0116 44.6480 -1.6064 42.8877 -3.7745 34.4857 -5.0967 39.2717 -9.1559 11.2006 61.1527 1.2423 40.6432 -7.2191 50.0899 63.4823 -28.7221 10.5839 19.2587 2.9255 35.4701 38.1834 0.7051 36.7583 -0.4193 41.4729 -9.7732 17.5303 11.4978 34.3025 0.0837 -7.5420 35.5902 6.4955 20.9448 33.3171 66.6764 44.0280 -4.2203

SECTION 10.6

Using the Model for Estimation and Prediction

597

.............................................................................................................................................................................................................................................. ....
,

0

CALIRAIN.DAT (Data for Exercise lO.34)
Precipitation, y (inches)

Station

Altitude, x, (feet)

Latitude, x, (degrees)

Distance x, (miles)

1. Eureka 2. Red Bluff 3. Thermal 4. Fort Bragg 5. Soda Springs 6. San Francisco 7. Sacramento 8. San Jose 9. G~ant Forest 10. Salmas 11. Fresno 12. Pt P~edras 13. Pasa Robles 14 Bakersfield 15. Blshop 16. Mineral 17. Santa Barbara 18 Susanv~lle 19. Tule Lake 20. Needles 21 Burbank 22. Los Angeles 23. Long Beach 24. Los Banos 25 Blythe 26. San Diego 27 Daggett 28. Death Valley 29. Crescent City 30. Colusa

39.57 23.27 18.20 37.48 49.26 21.82 18.07 14.17 42.63 13.85 9.44 19.33 15.67 6.00 5.73 47.82 17.95 18.20 10.03 4.63 14.74 15.02 12.36 8.26 4.05 9.94 4.25 1.66 74.87 15.95

43 341 4152 74 6752 52 25 95 6360 74 331 57 740 489 4108 4850 120 4152 4036 913 699 312 50 125 268 19 2105 -178 35 60

40.8 40.2 33.8 39.4 39.3 37.8 38.5 37.4 36.6 36.7 36.7 35.7 35.7 35.4 37.3 40.4 34.4 40.3 41.9 34.8 34.2 34.1 33.8 37.8 33.6 32.7 34.1 36.5 41.7 39.2

1 97 70 1 150 5 80 28 145 12 114 1 31 75 198 142 1 198 140 192 47 16 12 74 155 5 85 194 1 91

Source Taylor, PJ. "A pedagogic application of multiple regression analys~s" Geography, July 1980,Vol 65, pp. 203-212.

MINITAB Output for Exercise 10.35
I

The regression equation is Y = -3783 + 0.00875 X1 + 1.93 X2 + 3444 X3 + 2093 X4 Predictor Coef Constant -3783 X1 0.0087490 X2 1.9265 X3 3444.3 X4 2093.4 StDev 1205 0.0009035 0.6489 911.7 305.6 t-ratio -3.14 9.68 2.97 3.78 6.85
P 0.004 0.000 0.006 0.001 0.000

I

Analysis of Variance SOURCE Regression ~rror Total

I

1

I

DF 4 31 35

SS 230854848 24809760 255664608

MS 57713712 800315

F 72.11

P
0.000

R denotes an obs. with a large st. resid. Fit StDev.Fit 1936 239
(

95% C.I. 1449, 2424)

(

95% P.I. 47, 3825)

598

CHAPTER 10

Introduction t o Multiple Regression

BOILERS.DAT .......... ..... ................ ........... ..............................................(Data..for..Exercise..10.35).............................................................................................................................................,.,,, ,
Man-Hours, Y Boiler Capacity,
XI

Design Pressure, x z

Boiler Type,
x 3

Drum Type,
x 4

FIG

Act
reg

Source: Dr. Kelly Uscategui, University of Connecticut

-

RESIDUAL ANALYSIS: CHECKING THE REGRESSION ASSUMPTIONS
When we apply regression analysis to a set of data, we never know for certain whether the assumptions of Section 10.3 are satisfied. How far can we deviate from the assumptions and still expect regression analysis to yield results that will have the reliability stated in this chapter? How can we detect departures (if they exist) from the assumptions and what can we do about them? We provide some answers to these questions in this section. Recall from Section 10.3 that for any given set of values of x,,x,, . . ., xk we assume that the random error term E has a normal probability distribution with mean equal to 0 and variance equal to a'. Also, we assume that the random errors are probabilistically independent. It is unlikely that these assumptions are ever sat-

SECTION 10.7

Residual Analysis: Checking t h e Regression Assumptions

599

isfied exactly in a practical application of regression analysis. Fortunately, experience has shown that least squares regression analysis produces reliable statistical tests, confidence intervals, and prediction intervals as long as the departures from the assumptions are not too great. In this section we present some methods for determining whether the data indicate significant departures from the assumptions. Because the assumptions all concern the random error component, E , of the model, the first step is to estimate the random error. Since the actual random error associated with a particular value of y is the difference between the actual y value and its unknown mean, we estimate the error by the difference between the actual y value and the estimated mean. This estimated error is called the regression residual, or simply the residuul, and is denoted by 2. The actual error E and residual 2 are shown in Figure 10.10.

Actual random error E and

regression residual E

A regression residual, E, is defined as the difference between an observed y value and its corresponding predicted value:
(Po

+ 31x1 + 32x2 +

'''

+P

k ~ k )

Since the true mean of y (that is, the true regression model) is not known, the actual random error cannot be calculated. However, because the residual is based on the estimated mean (the least squares regression model), it can be calculated and used to estimate the random error and to check the regression assumptions. Such checks are generally referred to as residual analyses.Two useful properties of residuals are given in the next box. The following examples show how a graphical analysis of regression residuals can be used to verify the assumptions associated with the model and to support improvements to the model when the assumptions do not appear to be satisfied. Although the residuals can be calculated and plotted by hand, we rely on the statistical software for these tasks in the examples and exercises.

~

10

I n t r o d u c t i o n t o Multiple Regression First, we demonstrate how a residual plot can detect a model in which the hypothesized relationship between E(y) and an independent variable x is misspecified.The assumption of mean error of 0 is violated in these types of models.*

1. The mean of the residuals is equal to 0. This property follows from the fact that the sum of the differences between the observed y values and their least squares predicted ^y values is equal to 0.

2 (Residuals) = 2 (y - ji) = 0
2. The standard deviation of the residuals is equal to the standard deviation of the fitted regression model, s.This property follows from the fact that the sum of the squared residuals is equal to SSE, which when divided by the error degrees of freedom is equal to the variance of the fitted regression model, s'. The square root of the variance is both the standard deviation of the residuals and the standard deviation of the regression model.

2

residual^)^

=

2 (y y)' = SSE
-

consumers, builders, and energy conservationists. Suppose we wish to investigate -. , the monthly electrical usage, y, in all-electric homes and its relationship to the size, x, of the home. Data were collected for n = 10 homes during a particular month and are shown in Table 10.3. A SAS printout for a straight-line model, E(y) = Po + P,x,fit to the data is shown in Figure 10.11. The residuals from this model is highlighted in the printout.The residuals are then plotted on the vertical axis against the variable x, size of home, on the horizontal axis in Figure 10.12.
A

*

l

1

TABLE0.3 1
Home Size x, (sq. ft.)

Home Size-Electrical Usage Data Monthly Usage y, (kilowatt-hours) Home Size x, (sq. ft.) Monthly Usage y, (kilowatt-hours)
,

.......................................................................................................................................................... ...,.....,.,..

,

*

a. Verify that each residual is equal to the difference between the observed y value and the estimated mean value, ji. b. Analyze the residual plot.
*For a mlsspecified model, the hypothesized mean of y, denoted by Eh(y), w~ll equal the true not mean of y, E(Y). Smce y = E h ( y ) + E, then E = y - E,(y) and E ( E )= E [ -~E h ( ~ ) l = E ( ~- E h ( ~ )# )

t

SECTION 10.7
F I G U R E 10.1 1 SAS printout for electrical

Residual Analysis: Checking t h e Regression Assumptions
Dep Variable: Y Analysls of Variance Source Model Error C Total
DF

601

usage example: Straight-line
model

Sum of Squares

Mean Square
703957.18342 17805.61457

F Value
39.536

Prob>F
0.0002

1 703957.18342 8 142444.91658 9 846402.10000 133.43766 1594.70000 8.36757

Root MSE Dep Mean C. V.

R-Square
A d j R-Sq

Parameter Estimates Variable INTERCEP X DF
1 1

Parameter Estimate
578.927752 0.540304

Standard Error
166.96805715 0.08592981

T for HO: Parameter=O

Prob > IT1

Obs
1 2 3 4 5 6 7 8 9 10

Y
1182.0 1172.0 1264.0 1493.0 1571.0 1711.0 1804.0 1840.0 1956.0 1954.0 0 142444.9166

Predict Value
1275.9 1308.3 1373.2 1443.4 1502.8 1573.1 1648.7 1783.8 1875.7 2162.0

Sum of Residuals Sum of Squared Residuals

Solution

a. For the straight-line model the residual is calculated for the first y value as follows:

-

t

where F is the first number in the column labeled Predict Value on the SAS printout in Figure 10.11. Similarly, the residual for the second y value is

" 6.

t

1 3

i

'' I

Both residuals agree (after rounding) with the values given in the column labeled Residual in Figure 10.11. Similar calculations produce the remaining residuals. b. The plot of the residuals for the straight-line model (Figure 10.12) reveals a nonrandom pattern. The residuals exhibit a curved shape, with the residuals for the small values of x below the horizontal 0 (mean of the residuals) line, the residuals corresponding to the middle values of x above the 0 line, and the residual for the largest value of x again below the 0 line. The indication is

602

CHAPTER 10

I n t r o d u c t i o n t o Multiple Regression
Plot of RESIDUAL*X
RESIDUAL ! Legend:
A
=

FIGURE 10.12

Residual plot for electrical usage example: Straight-line model

1 obs, B = 2 obs, etc.

FIC SA!

usa mo

SECTION 10.7
FIGURE 10.1 3 SAS printout for electrical

Residual Analysis: C h e c k i n g t h e Regression A s s u m p t i o n s
-

603

Dep Variable: Y Analysis of Variance Source Model Error C Total DF Sum of Squares Mean Square 415534.77319 2190.36480 R-Square Ad] R-Sq F Value 189.710 Prob>F 0.0001

usage example: Quadratic model

2 831069.54637 7 15332.55363 9 846402.10000
46.80133 1594.70000 2.93480

Root MSE Dep Mean C. V.

Parameter Estimates Variable INTERCEP X XSQ

DF
1 1 1

Parameter Estimate -1216.143887 2.398930 -0.000450 Obs 1 2 3 4 5 6 7 8

Standard Error 242.80636850 0.24583560 0.00005908

T for HO: Parameter=O
-5.009 9.758 -7.618

Prob > IT1 0.0016 0.0001 0.0001

Y
1182.0 1172.0 1264.0 1493.0 1571.0 1711.0 1804.0 1840.0 1956.0 1954.0 -2.27374E-12 15332.5536

Predict Value 1129.6 1202.2 1337.8 1470.0 1570.1 1674.2 1769.4 1895.5 1949.1 1949.2

9 10
Sum of Residuals Sum of Squared Residuals

A residual that is larger than 3s (in absolute value) is considered to be an outlier.
--,",-, ,,~ ,"\$,"*L,sa

". -

hich we eyofa clock as a function of age x, and number of bidders x,. The data for this example are repeated in Table 10.4,with one important difference:The auction price of the clock at the top of the second column has been changed from \$2,131 to \$1,131 (highlighted in Table 10.4).The first-order model

E ( Y ) = Po + Plxl + P2xz
Y

is again fit to these (modified) data, with the MTNITAB printout shown in Figure 10.15. The residuals are shown highlighted in the printout and then plotted against the number of bidders, x2, in Figure 10.16. Analyze the residual plot.

604

CHAPTER 10

Introduction t o Multiple Regression
Plot of RESIDUAL*X
RESIDUAL
I

FIGURE10.14

Residual plot for electrical usage example: Quadratic model

Legend:

A = 1 obs, B

=

2 obs, e t c .

FI

M I

g rt

100

7
I I

I

wi

..............................................................
A

I

75 50

?
I I

7
I

A A

A

25

:
I I I

I

O
-25

I A A +------------------A ------------------------------------------

I I

A

-50 -75 -100

:

I I I I

A A

+
I

:
----+-----+-----+-----+-----+-----+-----+-----+-----+-----+---I

1200

1400

1600

1800

2000
X

2200

2400

2600

2800

3000

TABLE 10.4
Age, xl

Altered Auction Price Data
Auction Price, y Age, xl Number of Bidders, x2 Auction Price, y

.... ....................... .... . ........................................ ............................................. ........................ .........
1 .

Number of Bidders, x2

_

127 115 127 150 156 182 156 132 137 113 137 117 137 153 117 126

13 12 7 9 6 11 12 10 9 9 15 11 8 6 13 10

\$1,235 1,080 845 1,522 1,047 1,979 1,822 1,253 1,297 946 1,713 1,024 1,147 1,092 1,152 1,336

170 182 162 184 143 159 108 175 108 179 111 187 111 115 194 168

14 8 11 10 6 9 14 8 6 9 15 8 7 7 5 7

\$1,131 1,550 1,884 2,041 845 1,483 1,055 1,545 729 1,792 1,175 1,593 785 744 1,356 1.262

S o Iut io n

T h e residual plot in Figure 10.16 dramatically reveals t h e one altered measurement. Note that one of the two residuals at x2 = 14 bidders falls more than 3 standard deviations below 0. Note that no other residual falls more than 2 standard deviations from 0. What do we do with outliers once we identify them? First, we try to determine the cause. Were the data entered into the computer incorrectly? Was the

SECTION10.7
FIGURE 10.15 MINITAB printout for grandfather clock example with altered data

Residual Analysis: Checking t h e Regression Assumptions
The regression equation is Price = - 922 + 11.1 Age + 64.0 Bidders Predictor Constant Age Bidders Coef -921.5 11.087 64.03 SE Coef 258.7 1.347 12.99 T -3.56 8.23 4.93
P 0.001 0.000 0.000

605

Analysis of Variance

DF Source 2 Regression Residual Error 29 Total 31
obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

SS 3015671 1144619 4160290

MS 1507835 39470

F 38.20

P
0.000

~ g e Price 127 1235.0 115 1080.0 845.0 127 150 1522.0 156 1047.0 182 1979.0 156 1822.0 132 1253.0 137 1297.0 946.0 113 137 1713.0 117 1024.0 137 1147.0 153 1092.0 117 1152.0 126 1336.0 170 1131.0 182 1550.0 162 1884.0 184 2041.0 845.0 143 159 1483.0 108 1055.0 17 5 1545.0 729.0 108 17 9 1792.0 111 1175.0 187 1593.0 785.0 111 744.0 115 194 1356.0 168 1262.0

Fit 1318.9 1121.8 934.7 1317.7 1192.2 1800.6 1576.3 1182.2 1173.6 907.5 1557.8 1079.9 1109.6 1158.9 1208.0 1115.7 1859.6 1608.5 1578.8 1758.7 1048.0 1417.5 1172.2 1530.9 660.0 1639.2 1269.5 1663.9 757.3 801.7 1549.4 1389.2

SE Fit 57.4 56.8 57.5 36.1 56.7 67.6 52.2 39.0 37.9 57.3 77.5 51.5 43.0 56.6 61.8 42.9 82.1 60.1 48.5 64.8 58.4 39.7 74.9 53.5 83.5 56.8 82.0 65.3 71.9 67.9 84.2 52.5

St Resid -0.44 -0.22 -0.47 1.05 -0.76 0.96 1.28 0.36 0.63 0.20 0.85 -0.29 0.19 -0.35 -0.30 1.14 -4.03R -0.31 1.58 1.50 -1.07 0.34 -0.64 0.07 0.38 0.80 -0.52 -0.38 0.15 -0.31 -1.08 -0.66

R denotes an observation with a large standardized residual

606

CHAPTER 10

Introduction t o Multiple Regression
FIG

FIGURE 10.16

MINITAB residual plot against number of bidders

MIP
Exa

dele

10 Bidders

observation recorded incorrectly when the data were collected? If so, we correct the observation and rerun the analysis. Another possibility is that the observation is not representative of the conditions we are trying to model. For example,in this case the low price may be attributable to extreme damage to the clock, or to a clock of inferior quality compared to the others. In these cases we probably would exclude the observation from the analysis. In many cases you may not be able to determine the cause of the outlier. Even so, you may want to rerun the regression analysis excluding the outlier in order to assess the effect of that observation on the results of the analysis. Figure 10.17 shows the printout when the outlier observation is excluded from the grandfather clock analysis, and Figure 10.18 shows the new plot of the residuals against the number of bidders. Now none of the residuals lies beyond 2 standard deviations from 0. Also, the model statistics indicate a much better model without the outlier. Most notably, the standard deviation (s) has decreased from 198.7 to 134.2, indicating a model that will provide more precise estimates and predictions (narrower confidence and prediction intervals) for clocks that are similar to those in the reduced sample. But remember that if the outlier is removed from the analysis when in fact it belongs to the same population as the rest of the sample, the resulting model may provide misleading estimates and predictions. Outlier analysis is another example of testing the assumption that the expected (mean) value of the random error E is 0, since this assumption is in doubt for the error terms corresponding to the outliers. The next example in this section checks the assumption of the normality of the random error component.
,
., , "a

, ,

", ,

Kcfcr to Example 10.6. lJse a stcm-and-leaf di\play (Section 2.2) l o plot the frequency distribution of the residuals in the grandfather clock example, both before and after the outlier residual is removed. Analyze the plots and determine whether the assumption of a normally distributed error term is reasonable.

SECTION 10.7
FIGURE 10.1 7 MINITAB printout for Example 10.6: Outlier deleted

Residual Analysis: Checking t h e Regression Assumptions
The regression equation is Price = - 1288 + 12.5 Age + 83.3 Bidders Predictor Coef Constant -1288.3 Age 12.5397 Bidders 83.290 SE Coef 185.3 0.9419 9.353
T -6.95 13.31 8.90

607

P 0.000 0.000 0.000

Analysis of Variance Source DF Regression 2 Residual Error 28 Total 30 Souce Age Bidders Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 DF 1 1 Age 127 115 127 150 156 182 156 132 137 113 137 117 137 153 117 126 182 162 184 143 159 108 175 108 179 111 187 111 115 194 168 SS 3627818 504496 4132314 MS 1813909 18018 F 100.67 P 0.000

Seq SS 2199077 1428741 Price 1235.0 1080.0 845.0 1522.0 1047.0 1979.0 1822.0 1253.0 1297.0 946.0 1713.0 1024.0 1147.0 1092.0 1152.0 1336.0 1550.0 1884.0 2041.0 845.0 1483.0 1055.0 1545.0 729.0 1792.0 1175.0 1593.0 785.0 744.0 1356.0 1262.0 Fit 1387.1 1153.3 887.3 1342.3 1167.7 1910.2 1667.4 1199.9 1179.3 878.3 1679.0 1095.1 1096.0 1130.1 1261.7 1124.7 1660.3 1659.4 1852.0 1004.7 1455.2 1232.1 1572.5 565.8 1706.0 1353.0 1723.0 686.7 736.8 1560.9 1401.4 SE Fit 40.4 38.8 39.6 24.7 38.5 49.2 38.4 26.5 25.6 39.0 56.2 34.9 29.2 38.5 42.7 29.0 41.5 35.4 46.5 40.1 27.5 51.6 36.8 58.6 40.0 57.1 45.2 50.0 47.2 56.9 35.6 Residual -152.1 -73.3 -42.3 179.7 -120.7 68.8 154.6 53.1 117.7 67.7 34.0 -71.1 51.0 -38.1 -109.7 211.3 -110.3 224.6 189.0 -159.7 27.8 -177.1 -27.5 163.2 86.0 -178.0 -130.0 98.3 7.2 -204.9 -139.4

St Resid -1.19 -0.57 -0.33 1.36 -0.94 0.55 1.20 0.40 0.89 0.53 0.28 -0.55 0.39 -0.30 -0.86 1.61 -0.86 1.73 1.50 -1.25 0.21 -1.43 -0.21 1.35 0.67 -1.47 -1.03 0.79 0.06 -1.69 -1.08

608

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression
FIG[ Sten gran Outl

FIGURE 10.18 MINITAB residual plot for Example 10.6: Outlier deleted

I

10 Bidders

Solution

The stem-and-leaf displays for the two sets of residuals are constructed using MINITAB and are shown in Figure 10.19.* Note that the outlier appears to skew the frequency distribution in Figure 10.19a, whereas the stem-and-leaf display in Figure 10.19b appears to be more mound-shaped. Although the displays do not provide formal statistical tests of normality, they do provide a descriptive display. Histograms and normal probability plots can also be used to check the normality assumption. In this example the normality assumption appears to be more plausible after the outlier is removed. Consult the chapter references for methods to conduct statistical tests of normality using the residuals.

FIGURE 10.19a Stem-and-leaf display for grandfather clock example: Outlier included

Stem-and-leaf of Residual Leaf Unit = 10

N = 32

*Recall that the left column of the MINITAB printout shows the number of measurements at least as extreme as the stem. In Figure 10.19a, for example, the 6 corresponding to the STEM = 1 means that six measurements are less than or equal to -100. If one of the numbers in the leftmost column is enclosed in parentheses, the number in parentheses is the number of measurements in that row. and the median is contained in that row.

SECTION 10.7
FIGURE 10.19b Stem-and-leafdisplay for grandfather clock example: Outlier excluded

Residual Analysis: Checking t h e Regression Assumptions

609

1

Stem-and-leaf of Residual

N = 31

Of all the assumptions in Section 10.3, the assumption that the random error is normally distributed is the least restrictive when we apply regression analysis in practice. That is, moderate departures from a normal distribution have very little effect on the validity of the statistical tests, confidence intervals, and prediction intervals presented in this chapter. In this case, we say that regression analysis is robust with respect to nonnormal errors. However, great departures from normality cast doubt on any inferences derived from the regression analysis. Residual plots can also be used to detect violations of the assumption of constant error variance. For example, a plot of the residuals versus the predicted value ^y may display a pattern as shown in Figure 10.20. In this figure, the range in values of the residuals increases as j j increases, thus indicating that the variance of the random error, E, becomes larger as the estimate of E ( y ) increases in value. Since E ( y ) depends on the x values in the model, this implies that the variance of E is not constant for all settings of the x's.
FIGURE 10.20 Residual plot showing changes in the variance of

E

In the final example of this section, we demonstrate how to use this plot to detect a nonconstant variance and suggest a useful remedy. 50 social workers. The first-order model E ( y ) = P, + B , x was fitted to the data using MINITAB.The MINITAB printout isshowL"in Figure 10.21, followed by a plot of the residuals versus j in Figure 10.22. Interpret the results. Make model modifications, if necessary.

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

TABLE 10.5
Years of Experience,
X

Salary Data for Example 10.8
Salary, Y Years of Experience,
X

Salary,

................................................
7 28 23 18 19 \$26,075 79,370 65,726 41,983 62,308 41,154 53,610 33,697 22,444 32,562 43,076 56,000 58,667 22,210 20,521 49,727

....................................................
21 4 24 20 20 \$43,628 16,105 65,644 63,022 47,780 38,853 66,537 67,447 64,785 61,581 70,678 51,301 39,346 24,833 65,929 41,721

Y

Salary, .................................... Y
X

Years of Experience,

15
24 13 2 8 20 21 18 7 2 18

15
25 25 28 26 27 20 18 1 26 20

-

-

,

+

28 23 17 25 26 19 16 3 12 23 20 19 27 25 12 11

\$99.139 52,624 50,594 53,272 65,343 46,216 54,288 20,844 32,586 71,235 36.530 52,745 67.282 80,931 32.303 38.371

The regression equation is Y = 11369 + 2141 X Predictor Constant
X

Coef
11369 2141.3

StDev
3 1 60 160.8

t-ratio
3.60 13.31

P
0.001 0.000

Analysis of Variance SOURCE Regression Error Total DF SS MS

F
177.25

P
0.000

1 13238774784 13238774784 48 3585073152 74689024 49 16823847936

Unusual Observations X Y Obs .
31 35 45 1.0 28.0 20.0 24833 99139 36530

Fit Stdev.Fit Residual
13511 71326 54196 3013 2005 1259

St .Resid
1.40 X 3.31R -2.07R

113 2 2 27813 -17666

R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence.
FIGURE 10.21 MINITAB analysis for first-order model, Example 10.8

S o I u t io n

The MINITAB printout, Figure 10.21, suggests that the first-order model provides an adequate fit to the data. The R2 value indicates that the model explains 78.7% of the sample variation in salaries. The t value for testing fl,. 13.31, is highly significant (p-value = 0) and indicates that the model contributes information for the prediction of y. However, an examination of the residuals

SECTION 10.7

Residual Analysis: Checking t h e Regression A s s u m p t i o n s

611

30000+
-

Residual-

F I G U R E 10.22 MINITAB residual plot for first-order model, Example 10.8

plotted against j (Figure 10.22) reveals a potential problem. Note the "cone" shape of the residual variability; the size of the residuals increases as the estimated mean salary increases, implying that the constant variance assumption is violated. One way to stabilize the variance of E is to refit the model using a transformation on the dependent variable y. With economic data (e.g., salaries) a useful variance-stabilizing transformation is the natural logarithm of y.* We fit the model

to the data of Table 10.5. Figure 10.23 shows the regression analysis printout for the n = 50 measurements, while Figure 10.24 shows a plot of the residuals from the log model. You can see that the logarithmic transformation has stabilized the error variances. Note that the cone shape is gone; there is no apparent tendency of the residual variance to increase as mean salary increases.We therefore are confident that inferences using the logarithmic model are more reliable than those using the untransformed model.

Residual analysis is a useful tool for the regression analyst, not only to check the assumptions, but also to provide information about how the model can be improved. A summary of the residual analyses presented in this section to check the assumption that the random error E is normally distributed with mean 0 and constant variance is presented in the box on p. 613.
*Other variance-stabilizing transformations that are used successfully in practice are sin-' -\/T;. Consult the chapter references for more details on these transformations.

fiand

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

The regression equation is LOGY = 9.84 + 0.0500 X Predictor Constant X Coef 9.84133 0.049978 StDev 0.05636 0.002868 t-ratio 174.63 17.43 P 0.000 0.000

Analysis of Variance SOURCE Regression Error Total DF 1 48 49 SS 7.2118 1.1400 8.3519 MS 7.2118 0.0238

F 303.65

P 0.000

Unusual Observations X LOGY Fit Stdev.Fit Residual St .Resid Obs . 19 4.0 9.6869 10.0412 0.0460 -0.3544 -2.41R 31 1.0 10.1199 9.8913 0.0537 0.2286 1.58 X 45 20.0 10.5059 10.8409 0.0225 -0.3350 -2.20R R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence.
FIGURE 10.23

MlNlTAB printout for modified model, Example 10.8

FIGURE 10.24

MlNlTAB residual plot for modified model, Example 10.8

SECTION 10.7

Residual Analysis: Checking t h e Regression Assumptions

613

Using the TI-83 Graphing Calculator
Plotting Residuals on the TI-83
When you compute a regression equation on the TI-83, the residuals are automatically computed and saved to a list called RESID. RESID can be found under the LIST menu (2nd STAT). To make a scatterplot of the residuals,
Step 1 Enter the data in Ll and L2 Step 2' Compute the regression equation (see Section 9.2) Step 3 Set u p the data plot Press 2nd Y = for STATPLOT Press 1 for Plat1 Use the arrow and ENTER keys to set up the screen as shown below.

YPET:

m-- m

Ed b

Mark: I
Note:

V1ist:RESID
+

To enter the RESID as your Ylist: 1. 'Use the arrow keys to move the cursor after Ylist: 2. Press 2nd STAT for LIST 3. Highlight the listname RESID and press ENTER

'

Step 4 View the scatterplot o f the re~iduals

Press ZOOM 9 for ZoomStat
Example The figures below show a table of data entered on the TL-83 and the scatterplot of the residuals obtained using the steps given above.

?-

quantitative independent variables. Analyze each plot, looking for a curvilina ear trend.This shape signals the need for a quadratic term in the mode1.T~ second-order term in the variable against which the residuals are plotted. 2. Examine the residual plots for outliers. Draw lines on the residual plots at 2- and 3-standard-deviation distances below and above the 0 line. Examine (continued)

I n t r o d u c t i o n t o Multiple Regression

error in data collection or transcription, or corresponds to a member of a population different from that of the remainder of the sample, or simply

mine its effect on the analysis.
distribution of the ious departures from normality exist. Extreme skewness of the fre(Normalizing transformaou can find information in g the residuals against the tern that indicates that the variance of
E

is not constant, refit the

SOME PITFALLS: ESTIMABILITY, MULTICOLLINEARITY, AND EXTRAPOLATION
You should be aware of several potential problems when constructing a prediction model for some response y. A few of the most important are discussed in this final section.

Problem 1

Parameter Estimability

Suppose you want to fit a model relating annual crop yield y to the total expenditure for fertilizer x. We propose the first-order model

E ( Y ) = Po + PIX Now suppose we have three years of data and \$1,000 is spent on fertilizer each year. The data are shown in Figure 10.25.You can see the problem: The parameters of the model cannot be estimated when all the data are concentrated at a single x value. Recall that it takes two points (x values) to fit a straight line.Thus,the pa rameters are not estimable when only one x is observed.
A

Yield and fertilizer expenditure data: Three years
9

5

7

Fertilizer expenditure (dollars)

i

SECTION 10.8

S o m e Pitfalls: Estimability, M u l t i c o l l i n e a r i t y , a n d E x t r a p o l a t i o n

615

A similar problem would occur if we attempted to fit the quadratic model

to a set of data for which only one or two different x values were observed (see Figure 10.26). At least three different x values must be observed before a quadratic model can be fit to a set of data (that is, before all three parameters are estimable).

Only two x values observed: Quadratic model is not estimable
9

A
0

?

?
0 1,000 2,000 Fertilizer expenditure (dollars)

In general, the number of levels of observed x values must be one more than n i. the order of the polynomial i x that you want to f t For controlled experiments, the researcher can select experimental designs that will permit estimation of the model parameters. Even when the values of the independent variables cannot be controlled by the researcher, the independent variables are almost always observed at a sufficient number of levels to permit estimation of the model parameters. When the statistical software you use suddenly refuses to fit a model, however, the problem is probably inestimable parameters.

Problem 2

Multicollinearity

Often, two or more of the independent variables used in the model for E(y) contribute redundant information. That is, the independent variables are correlated with each other. Suppose we want to construct a model to predict the gas mileage rating of a truck as a function of its load, x,,and the horsepower of its engine, x2. In general, we would expect heavy loads to require greater horsepower and to result in lower mileage ratings. Thus, although both x 1 and x2 contribute information for the prediction of mileage rating, some of the information is overlapping because x, and x2 are correlated. If the model

were fitted to a set of data, we might find that the t values for both and (the least squares estimates) are nonsignificant. However, the F-test for H,: PI = P2 = 0 would probably be highly significant.The tests may seem to produce contradictory conclusions, but really they do not. The t-tests indicate that the contribution of one variable, say xl = Load, is not significant after the effect of x2 = Horsepower has been taken into account (because x2 is also in the model). The significant F-test, on the other hand, tells us that at least one of the two variables is making a contribution or to the prediction of y (that is, either PIor P2, both, differ from 0). In fact, both are probably contributing, but the contribution of one overlaps with that of the other.

pl

p2

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

When highly correlated independent variables are present in a regression model, the results are confusing.The researcher may want to include only one o the f variables in the final model. One way of deciding which one to include is by using a technique called stepwise regression. In stepwise regression, all possible one-variable n models of the form E(y) = Po Pix, are fit and the "best" x,is selected based o the t-test for p,.Next, two-variable models of the form E(y) = p,,+ plxl+ P2x,are fit (where x, is the variable selected in the first step); the "second best" x,is selected based on the test for p,. The process continues in this fashion until no more "impor' tant" x s can be added to the model. Generally, only one of a set of multicollinear independent variables is included in a stepwise regression model, since at each step every variable is tested in the presence of all the variables already in the model. For example, if at one step the variable Load is included as a significant variable in the prediction of the mileage rating, the variable Horsepower will probably never be added in a future step. Thus, if a set of independent variables is thought to be multicollinear, some screening by stepwise regression may be helpful. Note that it would be fallacious to conclude that an independent variable x, is unimportant for predicting y only because it is not chosen by a stepwise regression procedure. The independent variable x, may be correlated with another one, x2, that the stepwise procedure did select. The implication is that x, contributes more for predicting y (in the sample being analyzed), but it may still be true that x, alone contributes information for the prediction of y.

+

E

Lee

10.2

10.3

Problem 3

Prediction Outside the Experimental Region

By the late 1960s many research economists had developed highly technical models to relate the state of the economy to various economic indices and other independent variables. Many of these models were multiple regression models, where, for example, the dependent variable y might be next year's Gross Domestic Product (GDP) and the independent variables might include this year's rate of inflation, this year's Consumer Price Index (CPI), etc. In other words, the model might be constructed to predict next year's economy using this year's knowledge. Unfortunately, these models were almost all unsuccessful in predicting the recession in the early 1970s. What went wrong? One of the problems was that many of the regression models were used to extrapolate, i.e., predict y values of the independent variables that were outside the region in which the model was developed. For example, the inflation rate in the late 1960s, when the models were developed, ranged from 6% to 8%. When the double-digit inflation of the early 1970s became a reality, some researchers attempted to use the same models to predict future growth in GDP. As you can see in Figure 10.27, the model may be very accurate for predicting y when x is in the range of experimentation, but the use of the model outside that range is a dangerous practice.

APF

10.31

FIGURE 10.27 Using a regression model outside the experimental region

Y
A

0

B
1

0

-

6 8 Inflation rate(%)

SECTION 10.8

Some Pitfalls: Estimability, Multicollinearity, and Extrapolation

617

Problem 4

Correlated Errors

Another problem associated with using a regression model to predict a variable y based on independent variables x,, ....xk arises from the fact that the data are x,, frequently time series. That is, the values of both the dependent and independent variables are observed sequentially over a period of time. The observations tend to be correlated over time. which in turn often causes the prediction errors of the regression model to be correlated. Thus, the assumption of independent errors is violated, and the model tests and prediction intervals are no longer valid. One solution to this problem is to construct a time series model; consult the references for this chapter to learn more about these complex, but powerful, models.

Learning the Mechanics 10.36 Identify the problem(s) in each of the residual plots shown at the bottom of the page. 10.37 Consider fitting the multiple regression model

Correlation Matrix for Exercise 10.37 .................................. -................................................

....................................................................................
XI
x 2 x 3 x 4 x 5

Xl

x 2

x 3

x 4

x 5

-

.17

-

.02 .45

E ( Y ) = Po + PIXI + P

2 ~ + P x 2 33

+ P4x4 + Psx5

-

-.23 .93 .22
-

.19 .02 -.01 .86

A matrix of correlations for all pairs of independent variables is given to the right. Do you detect a multicollinearity problem? Explain.

-

Applying the Concepts 10.38 Chemical engineers at Tokyo Metropolitan University analyzed urban air specimens for the presence of lowResidual plots for Exercise 10.36
a.

molecular-weight dicarboxylic acid (Environmental Science & Engineering, Oct. 1993). The dicarboxylic acid (as a percentage of total carbon) and oxidant concentrations for 19 air specimens collected from urban

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

I
Oxidant (ppm)

URBANAIRDAT .....................................................................................................................................................................

...................................................................................................................................................................
.85 1.45 1.80 1.80 1.60 1.20 1.30 .20 .22 .40

Dicarboxylic Acid (%)

Oxidant (ppm)

Dicarboxylic Acid (%)

78 80 74 78 60 62 57 49 34 36

SO .38 .30 .70 30 .90 1.22 1.oo 1. O O

32 28 25 45 40 45 41 34 25

Source: Kawamura, K., and Ikushima, K. "Seasonal changes in the distribut~on d~carboxylic of acids in the urban atmoswhere." Environmental Suence & Technology,Vol. 27. No. 10, Oct. 1993, p. 2232 (data extracted from Figure 4).

Tokyo are listed in the table above. SAS printouts for the straight-line model relating dicarboxylic acid pcrcentage (y) to oxidant concentration (x) are also provided on pages 619-621. Conduct a complete residual analysis. 10.39 World Development (Vol. 20,1992) published a study of the variables impacting the size distribution of manufacturing firms in international markets. Five independent variables, Gross Domestic Product (GDP), area per capita (AREAC), share of heavy industry in value added (SVA), ratio of credit claims to GDP ( C R E D I T ) , and ratio of stock equity of G D P (STOCK), were used to model the share, y, of firms with 100 or more workers. Thc researchers detected a high correlation between pairs of the following independent variables: GDP and SVA, GDP and STOCK, and CREDIT and STOCK. Describe the problems that may arise if these high correlations are ignored in the multiple regression analysis of the model. 10.40 Passive exposure to environmental tobacco smoke has been associated with growth suppression and an increased frequency of respiratory tract infections in normal children. Is this association more pronounced in children with cystic fibrosis? To answer this question, 43 children (18 girls and 25 boys) attending a 2-week summer camp for cystic fibrosis patients were studied (The New England Journal of Medicine, Sept. 20,1990). Researchers investigated the correlation between a child's weight percentile (y) and the number of cigarettes smoked per day in the child's home (x). The table on page 621 lists the data for the 25 boys. A MINITAB regression printout (with residuals) for the straight-line model relating y to x is also provided on page 622. Examine the residuals. Do you detect any outliers? 10.41 Road construction contracts in the state of Florida are awarded on the basis of com~etitive, sealed bids; the contractor who submits the lowest bid price wins the contract. During the 1980s, the Office of the Florida

-

Attorney General (FLAG) suspected numerous contractors of practicing bid collusion, i.e., setting the winning bid price above the fair, or competitive, price in order to increase profit margin. FLAG collected data for 279 road construction contracts; the data are ava~lable in the file FLAG.DAT. For each contract, the following variables were measured: 1. Price of contract (\$thousands) bid by lowest bidder 2. Department of Transportaion (DOT) engineer's estimate of fair contract price (\$thousands) 3. Ratio of low (winning) bid price to DOT engineer's estimate of fair price. 4. Status (fixed or competitive) of contract 5. District (1,2,3,4, or 5) in which construction project is located 6. Number of bidders on contract 7. Estimated number of days to complete work 8. Length of road project (miles) 9. Percentage of costs allocated to liquid asphalt 10. Percentage of costs allocated to base material 11. Percentage of costs allocated to excavation 12. Percentage of costs allocated to mobilization 13. Percentage of costs allocated to structures 14. Percentage of costs allocatcd to traffic control 15. Subcontractor utilization (yes or no) FLAG wants to model the price (y) of the contract bid by lowest bidder in hopes of preventing price-fixing in the future. a. Do you detect any multicollinearity in these variables? If so, do you recommend that all of these variables be used to predict low bid pricey? If not. which variables do you recommend? b. Using the variables selected in part a, fit a firstorder model for E ( y ) to the data stored in the file. c. Conduct a complete residual analysis on the model fit in part b. Do you detect any outliers?Ae r the standard regression assumptions reasonably satisfied?

i

.

1

SECTION 10.8

Some P i t f a l l s : E s t i m a b i l i t y , M u l t i c o l l i n e a r i t y , a n d E x t r a p o l a t i o n

619

SAS Output for Exercise 10.38
Dependent Variable: DICARBOX Analysis of Variance Source Model Error C Total Root MSE Dep Mean C.V.

DF
1 17 18

Sum of Squares

Mean Square 2.41362 0.14131 R-square Adj R-sq

F Value
17.080

Prob>F 0.0007

2 .dl362 2.40234 4.81597
0.37592 0.92737 40.53600

0.5012 0.4718

Parameter Estimates Variable INTERCEP OXIDANT Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 OXIDANT 78 80 ; 4 78 60 62 57 49 34 36 32 28 25 45 40 45 41 34 25 DF 1 1 Parameter Estimate -0.023737 0.019579 Predict Value 1.5034 1.5425 1.4251 1.5034 1.1510 1.1901 1.0922 0.9356 0.6419 0.6811 0.6028 0.5245 0.4657 0.8573 0.7594 0.8573 0.7790 0.6419 0,4657 Standard Error 0.24576577 0.00473739 Std Err Predict

T for HO: Parameter=O
-0.097 4.133

Prob > IT1 0.9242 0.0007

Dep Var DICARBOX 0.8500 1.4500 1.8000 1.8000 1.6000 1.2000 1.3000 0.2000 0.2200 0.4000 0.5000 0.3800 0.3000 0.7000 0.8000 0.9000 1.2200 1.0000 1.0000

Residual

Std Err Residual 0.338 0.334 0.346 0.338 0.362 0.360 0.364 0.366 0.359 0.361 0.357 0.353 0.348 0.365 0.364 0.365 0.364 0.359 0.348

0.164 -0.6534 0.172 -0.0925 0.148 0.3749 0.164 0.2966 0.102 0.4490 0.107 0.0099 0.095 0.2078 0.086 -0.7356 0.110 -0.4219 0.105 -0.2811 0.117 -0.1028 0.130 -0.1445 0.141 -0.1657 0.088 -0.1573 0.095 0.0406 0.088 0.0427 0.093 0.4410 0.110 0.3581 0.141 0.5343

(Continued)

i
i
I

I

1

1042 Teaching Sociology (July 1995) developed a model for the professional socialization of graduate students working toward their doctorate. One of the dependent variables modeled was professional confidence.y, measured on a 5-point scale. The model included over 20 independent variables and was fitted to data collected for a sample of 309 graduate students. One concern is whet he^ multicollinearity exists in the data. A matrix of Pearson product moment correlations for ten of the independent variables is shown on page 622. [Note:Each entry in the table is the correlation coefficient r between the variable in the corresponding row and corresponding column.]

a. Examine the correlation matrix and find the independent variables that are moderately or highly correlated. b. What modeling problems may occur if the variables, part a, are left in the model? Explain. 10.43 The data in the table on p. 624 were collected for a random sample of 26 households in Washington, D.C., during 2000. An economist wants to relate household food consumption, y, to household income, X I , and household size, x,, with the first-order model

620

CHAPTER 10

Introduction t o Multiple Regression

SAS Output for Exercise 10.38 (continued)
UNIVARIATE PROCEDURE Variable=RESID Residual Moment e N Mean std Dev Skewness Sum Wgt s Sum Variance Kurtosis

uss
cv
T :Mean=0 Sgn Rank Num A=O W: Normal

css
Std Mean T Prob> [ I Prob> IS! ProbxW

100% 75% 50% 25% 0%

Max
Q3

Med
Q1

Min

Range
Q3-Ql

Mode Extremes Lowest
-0.73561 -0.65339 -0.42193 -0.28109 -0.16573
( ( (
(

Obs
8) 1) 9) 10) 13)

Highest
0.358066( 0.374924( 0.441016( 0.449024 ( 0.534273 (

(

Stem Leaf
4 2 0 -0 -2 -4 -6 453 1067 144 76409 8 2 45

Boxplot

I

(Continued)

The SPSS printout for the model below is followed by several residual plots on pages 623 and 625. a. Do you detect any signs of multicollinearity in the data' Explain. b. Is there visual evidence that a second-order model may be more appropriate for predicting household food consumption? Explain.

c. Comment on the assumption of constant error van. ance. Does it appear to be satisfied? d. Are there any outliers in the data? If so, ident~h them. e. Does the assumption of normal errors appear to b, reasonably satisfied? Explain.

I

SECTION 10.8

Some P i t f a l l s : E s t i m a b i l i t y , M u l t i c o l l i n e a r i t y , a n d E x t r a p o l a t i o n

SAS Output for Exercise 10.38 (continued)
Plot of RESID*OXIDANT.
Legend: A
=

1 obs, B = 2 obs,etc.

OXIDANT

CFSMOKE.DAT (Data for Exercise 10.40) .............................................................................................................................................................................
No. of Cigarettes No. of Cigarettes Smoked per Day, x Weight Percentile, y Smoked per, Day x Weight Percentile, y ...................._ _ ............................................................................................................................................

6 6 2 8 11 17 24 25 17 25 25 31 35

0 15 40 23 20 7 3 0 25 20 15 23 10

43 49 50 49 46 54 58 62 66 66 83 87

0 0 0 22 30 0 0 0 0 23 0 44

Source: Rubin, B. K. "Exposure of children with cystic fibrosis to environmental tobacco smoke." The New England Journal of Medicine, Sept. 20,1990. Vol. 323, No. 12, p. 85 (data extracted from Figure 3).

622

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression
-

<
SPS

MINITAB Output for Exercise 10.40
-

The regression equatlon is UTPCTILE = 41.2 0.262 SMOKED
-

E

Predictor :onstant SMOKED
s = 24.68

Coef 41.153 -0.2619

Stdev 6.843 0.3702

t-ratio 6.01 -0.71
=

P
0.000 0.486

B

R-sq = 2.1%

0.0%

MI R A (
St

4nalysls of Variance SOURCE Regression Error rota1 DF 1 23 24 SS 304.9 14011.1 14316.0

MS 304.9 609.2

F 0.50

P
0.486
-.

VB
Fit Stdev.Fit Residual St.Resid

SMOKED 0.0 15.0 40.0 23.0 20.0 - 7.0 3.0 0.0 25.0 20.0 15.0 23.0 10.0 0.0 0.0 0.0 22.0 30.0 0.0 0.0 0.0 0.0 23.0 0.0 44.0

WTPCTILE 6.00 6.00 2.00 8.00 11.00 17.00 24.00 25.00 17.00 25.00 25.00 31.00 35.00 43.00 49.00 50.00 49.00 46.00 54.00 58.00 62.00 66.00 66.00 83.00 87.00

I: n HC (C
Re

kP:
'ZI

'R:

.z:

'ot

Correlation matrix for Exercise 10.42 ..................................................................................................................................................................................................................................................

...................................................................................................................................................................................................................................................
(1) Father's occupation (2) Mother's education (3) Race (4) Sex (5) Foreign status (6) Undergraduate GPA (7) Year GRE taken (8) Verbal GRE score (9) Years in graduate program (10) First-year graduate GPA 1.000 .363 ,099 -.I10 -.047 -.053 -.I11 .I78 .078 .049 .363 1.000 .228 -.I39 -.216 .084 -.I18 .I92 ,125 ,068 ,099 .228 1.000 ,036 -.515 .014 -.I20 .I12 .I17 .337 -.I10 -.I39 .036 1.000 .I65 -.256 .I73 -.lo6 -.I17 .073 -.047 -.216 -.515 ,165 1.000 -.041 .I59 -.I30 -.I65 -.I71 -.053 .084 .014 -.256 -.041 1.000 .032 .028 -.034 .092 -.I11 -.I18 -.I20 .I73 .I59 .032 1.000 -.086 -.602 .016 .I78 .I92 .I12 -.I06 -.I30 .028 -.086 1.000 ,132 '087 .078 .I17 -.I17 -.I65 -.034 -.602 .I32 1.000 -.071

Independent Variable

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

,049
.33,073 -.I71 0b8 ,091
,016

'

I

,087 -.071 1.000

Source: Keith, B., and Moore, H. A. "Training sociologists: An assessment of professional socializatoin and the emergence of career aspirations." Teaching Sociology, Vo. 23, No. 3, July 1995,p. 205 (Table 1).

I

SECTION 10.8

Some P i t f a l l s : E s t i m a b i l i t y , M u l t i c o l l i n e a r i t y , a n d E x t r a p o l a t i o n

623

SPSS Output for Exercise 10.43

Equation Number 1. Block Number 1.

Dependent Variable. . Enter

FOOD HOMESIZE

Method:

INCOME

Multlple R R Square Adjusted R Square Standard Error

.74699 .55800 .51956 .71881

Analysis of Variance DF Regression 2 Residual 23

Sum of Squares

15.00268 11.88386

- - - - - - - - - - - - - - - - - - Variables in the Equation - - - - - - - - - - - - - - - - - -

Variable INCOME HOMESIZE (Constant )

B
-1.63937E-04 .383485 2.794380

SE B
.006564 .071887 .436335

Beta -.003495 .746508

T -.025 5.335 6.404

Sig T .9803 .OOOO .OOOO

Residuals Statistics : Min "RED *RESID *ZPRED 'ZRESID 3.1664 -. 9748 -1.3194 -1.3561
=

Max
6.2383 2.7894 2.6460 3.8806 26

Mean 4.1885 .0000 .OOOO .OOOO

Std Dev .7747 .6895 1.0000 .9592

N 26 26 26 26

rota1 Cases

SPSS Plots for Exercise 10.43

3l
2
-

p
I

l o-1 -2

0 0

f +
0

.. @. .
9..
I I
I

2

-

.@ 2
-9

9

l -

@

o-

t

I
I

!

.: : *
I

@

.

.

0

-1 I

@

,

I

0

20

40

60

80

100

-2

I

I

0

2

4
Homesize

6

I

8

Income

10

Regression Standardized Predicted Value

I
624
CHAPTER 10
Introduction t o Multiple Regression

#

S

"Wringing" The Bell Curve
ingly convincing".) This Statistics in Action is based on two reviews of The Bell Curve that critique the statistical methodology employed by the authors and the inferences derived from the statistics.Both reviews, one published in Chance (Summer 1995) and the other in The Journal of the American Statistical Association (Dec. 1995), were written by Carnegie Mellon University professors Bernie Devlin, Stephen Fienberg. Daniel Resnick, and Kathryn Roeder. (Devlin, Fienberg, and Roeder are all statisticians; Resnick, a historian.) Here, our focus is on the statistical method used repeatedly by Herrnstein and Murray (H&M) to support their conclusions in The Bell Curve:regression analysis.The following are just a few of the problems with H&M's use of regression that are identified by the Carnegie Mellon professors: esl cal er otl IQ P e' wh mo

n Statistics in Action in Chapter 4, we introduced The Bell Curve (Free Press, 1994) by Richard Herrnstein and Charles Murray, a controversial book about race, genes, IQ, and economic mobility.The book heavily employs statistics and statistical methodology in an attempt to support the authors' positions on the relationships among these variables and their social consequences.The main theme of The Bell Curve can be summarized as follows: 1. Measured intelligence (IQ) is largely genetically inherited. 2. IQ is correlated positively with a variety of socioeconomic status success measures, such as prestigious job, high annual income, and high educational attainment. 3. From 1 and 2, it follows that socioeconomic successes are largely genetically caused and therefore resistant to educational and environmental interventions (such as affirmative action). With the help of a major marketing campaign, the book became a best-seller shortly after its publication in October 1994.The underlying theme of the book-that intelligence is hereditary and tied to race and class-apparently appealed to many readers. However, reviews of The Bell Curve in popular magazines and newspapers were mostly negative. Social critics have described the authors as "un-Americann and "pseudo-scientific racists," and their book as "alien and repellant." (On the other hand, there were defenders who labeled the book as "powerfully written" and "overwhelm-

I

Problem 1 H&M consistently use a trio of independent variables-IQ, socioeconomic status, and age-in a series o firctf order models designed to predict dependent social outcomi. variables such as income and unemployment. (Only on a single occasion are interaction terms incorporated.) Consider. for example, the model:

Prc reg cie~ hov ant] den wit! the esm evitr like

where y = income, x, = IQ,x2 = socioeconomic status, and x, = age. H&M utilize t-tests on the individual /3 parameter\ to assess the importance of the independent variables, A\ with most of the models considered in The Bell Curve, the

Prot butlc misc tellip IQ sc ditio~ form

...................
Household

.............................................................................................................................................................................................
Food Consumption (\$1,000~) Income (\$1,000~) Household Size Household Food Consumption (\$1,000~) Income (\$1,000~)
Household Size

.......................................................................................................................................................................................................................
1 2 3 4 5 6 7 8 9 10 1 1 12 13 4.2 3.4 4.8 2.9 3.5 4.0 3.6 4.2 5.1 2.7 4.0 2.7 5.5 41.1 30.5 52.3 28.9 36.5 29.8 44.3 38.1 92.0 36.0 76.9 69.9 43.1 4 2 4 1 2 4 3 4 5 1 3 1 7 14 15 16 17 18 19 20 21 22 23 24 25 26 4.1 5.5 4.5 5.0 4.5 2.8 3.9 3.6 4.6 3.8 4.5 4.0 7.5 95.2 45.6 78.5 20.5 31.6 39.9 38.6 30.2 48.7 21.2 24.3 26.9 7.3 2
9

3 5
4 1 3 2 5 3 7
F
j

I

1
SECTION 10.8

S o m e Pitfalls: Estirnability, Multicollinearity, and E x t r a p o l a t i o n

625

estimate of p, in the income model is positive and statistically significant at n = .05, and the associated t value is larger (in absolute value) than the t values associated with the other independent variables. Consequently, H&M clairn that IQis a better predictor of income than the other two independent variables. No attempt was made t o determine whether the model was properly specified or whether the model provides an adequate fit to the data. Problem 2 In an appendix, the authors describe multiple regression as a "mathematical procedure that yields coefficients for each of [the independent variables], indicating how much of a change in [the dependent variable] can b e anticipated for a given change in any particular [independent] variable, with all the others held constant." Armed with this information and the fact that the estimate of p, in the model above is positive, H&M infer that ~1 high IQ necessarily implies (or causes) a high income, and a low IQ inevitably leads to a low income. (Cause-and-effect inferences like this are made repeatedly throughout the book.) Problem 3 The title of the book refers to the normal distribution and its well-known "bell-shaped" curve. There is a misconception among the general public that scorcs on intelligence tests (IQ) are normally distributed. In fact, most IQ scores have distributions that are decidedly skewed. Traditionally, psychologists and psychometricians have transformed these scores so that the resulting numbers have a

precise normal distribution. H&M make a special point to d o this. Consequently, the measure o f ZQ used in all the regression models is normalized (i.e., transformed so that the resulting distribution is normal), despite the fact that regression methodology does not require predictor (independent) variables to be normally distributed.
Problem 4 A variable that is not used as a predictor of social outcome in any of the models in The Bell Curve is level of education. H&M purposely omit education from the models, arguing that IQ causes education, not the othcr way around. Other researchers who have examined H&M's data report that when education is included as an independent variable in the model, the effect o f I Q o n the dependent variable (say, income) is diminished.

F o c u s
a. Comment on each of the problems identified by the Carnegie Mellon University professors in their review of The Bell Curve. Why do each of these problems cast a shadow on the inferences made by the authors? b. Using the variables specified in the model above, describe how you would conduct the multiple regression analysis. (Propose a more complex model and describc the appropriate model tests, including a residual analysis.)

SPSS Histogram for Exercise 10.43
Std. Dev = .96

",

1 .SO -1.00 S O

0.00

30 1.OO 1.50 2.00 2.50 Regression Standardized Residual

3.00

3.50

1.00

626

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

Key Terms
Adjusted multiple coefficient of determination 582 Base level 572 Correlated errors 617 Curvature 602 Dummy variables 558,571 Extrapolation 616 First-order model 559 Global F-test 583 Higher-order term 558

Indicator variable 571 Least squares prediction equation 562 Mean square for error 567 Model building 559 Multicollinearity 615 Multiple coefficient of determination 581 Multiple regression model 558 Outlier 603 Parameter estimability 614 Quadratic model 602

Quadratic term 602 Residual 599 Residual analysis 599 Robust method 609 Second-order model 602 Second-order term 602 Stepwise regression 616 Time series model 617 Variance-stabilizing transformation 611

Lz e
10.4

................................................................................................................................................................................................................................................
Key Formulas
30
2 ~ 2

=

E(Y) = Po + PIX, + P E(Y) = Po + P I X where x
=

First-order model with two quantitative independent variables 559 Quadratic Model 602 SSE

E(Y) = Po + PIX + P2x2
1if level A 0 if level B

Model with one qualitative variable at 2 levels 571

s2 = MSE =

SSE

n - (k

+ 1)

Estimator of a2for a model with k independent variables 567 Test statistic for testing H,,: P, 0 568
10.45

t =

-

=

SIT,

where tmlzdepends on n - (k

+ 1)df

100(1 - a)% confidence interval for P, = 0 568 10.46 Multiple coefficient of determination 581

~ z = l -

Adjusted multiple coefficient of determination 582 Test statistic for testing H,: Regression residual 599

F=

R2/k MS (Model) MSE (1 - R 2 ) / [ n - (k

+ I)]

Dl

=

P2 =

= Pk - 0 -

583

Symbol Pronunciation Description ....................................................................................................................................................................................................................................
x2

I

x-squared M-S-E beta-i beta-i-hat
s of beta-i-hat

Quadratic term that allows for curvature in the relationship between y and I Mean square for error (estimates a2) Coefficient of x, in the model Least squares estimate of p, Estimated standard error of

MSE

P,

3,

I

p,

Supplementary Exercises

627

R2 Rf
F
. .
R

R-squared R-squared adjusted epsilon-hat Log of y

Multiple coefficient of determination Adjusted mult~ple coefficient of determination Test statistic for testing global usefulness of model Estimated random error, or residual Natural logarithm of dependent variable

IO~Y

Learning the Mechanics
10.44 Suppose you fit the model

predict y when x, = 30, x2 = .6, and x3 = 1,300, or x, = 60, x2 = .4, and x, = 900? Why?

10.47 Suppose you used MINITAB to fit the model
+ P 4 ~ 1 +2 ~ to n = 25 data points with the following results: y
=

PO+ Plxl + P ~ X : + h x 2

SSE = .41 R2 = .83

.

a. Is there sufficient evidence to conclude that at least one of the parameters PI, P2, P3, or P4 is nonzero? Test using a = .05. = 0 against Ha: PI < 0. Use a = .05. h. Test H,,: c. Test H,,:p2 = 0 against Ha: p2 > 0. Use n = .05. d. Test H,,:p, = 0 against Ha:p, # 0. Use n = .05. 10.45 When a multiple regression model is used for estimating the mean of the dependent variable and for predicting a new value of y, which will be narrower-the confidence interval for the mean or the prediction interval for the new y value? Why?

10.46 Suppose you have developed a regression model to explain the relationship between y and x,, x2,and x,. The ranges of the variables you observed were as follows: 10 5 y 5 100, 5 5 x, 5 55, .5 5 x2 5 1, and 1,000 5 x, 5 2,000. Will the error of prediction be smaller when you use the least squares equation to

to n = 15 data points and you obtained the printout shown below. a. What is the least squares prediction equation? b. Find R2 and interpret its value. c. Is there sufficient evidence to indicate that the model is useful for predicting y? Conduct an F-test using a = .05. d. Test the null hypothesis H,: P, = 0 against the alternative hypothesis H,: P, 0. Test using a = .05. Draw the appropriate conclusions. e. Find the standard deviation of the regression model and interpret it. 10.48 The first-order model E ( y ) = Po + P,x1 was fit to n = 19 data points. A residual plot for the model is shown on p. 628. Is the need for a quadratic term in the model evident from the residual plot? Explain. 10.49 To model the relationship between y, a dependent variable, and x, an indepcndent variable, a researcher has taken one measurement on y at each of three different x values. Drawing on his mathematical expertise, the researcher realizes that he can fit the second-order model

+

MINITAB Output for Exercise 10.47
The regression equation is Y = 90.1 - 1.84 X1 + .285 X2 Predictor Constant XI X2 Coef 90.10 -1.836 0.285 StDev 23.10 0.367 0.231 t-ratio 3.90 -5.01 1.24 P 0.002 0.001 0.465

Analysis of Variance SOURCE Regression Error Total
DF

2 12 14

SS 14801 1364 16165

MS 7400 114

F
64.91

P 0.001

628

C H A P T E10 R
A

Introduction t o Multiple Regression x2 = Hogan Personality Inventory-Adjustment scale (HPI-A) xg = Years of experience (EXP) x, = Locus of Control scale (LOC) x5 = Social Support scale (SS) x6 = Dissociative Experiences scale (DES) x7 = Peritraumatic Dissociation Experiences Questionnaire, self-report (PDEQ-SR)

Residual Plot for Exercise 10.48

and it will pass exactly through all three points, yielding SSE = 0. The researcher, delighted with the "excellent" fit of the model, eagerly sets out to use it to make inferences. What problems will h e encounter in attempting to make inferences?

Applying the Concepts
10.50 BestS Review (June 1999) compared the mortgage loan portfolios for a sample of 25 lifelhealth insurance companies. The information in the table on page 629 is extracted from the articlc. Suppose you want to model the percentage of problcm mortgages (y) of a company as a function of total mortgage loans (x,), percentage of invested assets (x,), percentage of commercial mortgages (x,), and percentage of residential mortgages (x,). a. Write a first-order model for E(y). b. Fit the model of part a to the data and evaluate its overall usefulness. Use a = .05. c. Interpret the P estimates in the fitted model. d. Construct scattergrams of y versus each of the four independent variables in the model. Which variables warrant inclusion in the model as secondorder (i.e., squared) terms? 10.51 Emergency services (EMS) personnel are constantly exposed to traumatic situations. However, few researchers have studied the psychological stress that EMS personnel may experience. The Journal of Consulting und Clinical Psychology (June 1995) reported on a study of EMS rescue workers who responded to the 1-880 freeway collapse during the 1989 San Francisco earthquake. The goal of the study was to identify the predictors of symptomatic distress in the EMS workers. One of the distress variables studied was the Global Symptom Index (GSI). Several models for GSI, y, were considered based on the following independent variables: x,
=

-

a. Write a first-order model for E(y) as a function of ~. , the first five independent variables, x,-x5. b. The modcl of part a, fitted to data collected f o ~ n = 147 EMS workers, yielded the following results: R2 = .469, F = 34.47,~-value ,001.Inter. < pret these results. C. Write a first-order model for E(y) as a function oi all seven independent variables, x,-x7. d. The model, part c, yielded R2 = .603. Interpret this result. e. The t-tests for testing the DES and PDEQ-SR variables both yielded ap-value of ,001. Interpret these results. Since the Great Depression of the 1930s. the link between the suicide rate and the state of the economy has been the subject of much research. Research exploring this link using regression analysis was reported in the .lournal of Socio-Economics (Spriny. 1992).The rcsearchers collected data from a 45-year period on the following variables:
y = Suicide rate xl = Unemployment rate x2 = Percentage of females in the labor force xg = Divorce rate x4 = Logarithm of Gross National Product (GNP) x5 = Annual percent change in GNP One of the models explored by the researchers was 7 multiple regrcssion model relating y to linear terms 111 x, through x5.The least squares model below resulted (the observed significance levels of the /3 estimatr. are shown in parentheses beneath the estimates):

9 = .002 + .0204xl - .0231x2+ , 0 7 6 5 + ~ ~ .2760x4t .00181
(.002) (.02) (>.lo) (>.lo)
p.10
1

R2 = .45
a. Interpret the value of R2. IS there sufficient i.11 dence to indicate that the modcl is useful for prt dicting the suicide rate? Use a = .05. b. Interpret each of the coefficients in the model,and each of the corrcsponding significance levels. c. Is there sufficient evidence to indicate that the un employment rate is a useful predictor of the suic~d~ rate? Use a = .05. 10.53 To meet the increasing demand for new software products, many systems development experts hait adopted a prototyping methodology. The effects or

Critical Incident Exposure scale (CIE)

.

f

,

Supplementary Exercises

629

BESTINSDAT ..................................................................................................................................................................................................................... . . .
Total Mortgage Loan, x ,
Oh

YO
Commercial Mortgages, x ,

YO
Residential Mortgages, x ,

Yo
Problem Mortgages, y

Company

Invested Assets, x ,

TIAA Group Metropolltan Insurance Prudent~al of Am Group Principal Mutual I A Northwestern Mutual Cigna Group John Hancock Group Aegon USA Inc. New York Life Nationwide Massachusetts Mutual Equ~table Group 2etna US Healthcare Group American Express Financial ING Group American General Llncoln Nat~onal SunAmer~ca Inc. Allstate Travelers Insurance Group GE Capital Corp. Group ReliaStar Fmancial Corp. General American Life Statc Farm Group Pacific Mutual L~fe

\$18,803,163 18,171,162 16,213,150 11,940,345 10,834,616 10,181,124 8,229,523 7,695,198 7,088,003 5,328,142 4,965,287 4,905,123 3,974,881 3,655,292 3,505,206 3,359,650 3,264,860 3,909,177 2,987,144

20.7 13.9 12.9 30.3 17.8 25.1 20.4 17.7 9.4 26.3 12.2 12.7 10.5 13.9 16.2 6.4 11.5 15.7 10.9

100.0 77.8 87.4 98.8 99.5 99.8 82.0 73.0 92.2 100.0 78.6 63.6 94.1 100.0 99.8 99.8 99.9 100.0 100.0

0.0 1.6 2.3 1.2 0.0 0.2 0.1 24.7 7.8 0.0 21.4 0.0 5.4 0.0 0.2 0.2 0.1 0.0 0.0 0.1 0.3 30.0 0.2 2.4 3.6

11.4 3.8 4.1 32.6 2.2 11.1 12.2 6.4 2.4 7.5 6.3 27.0 8.7 2.1 0.7 2.1 2.2 2.6 2.1 3.2 0.7 6.4 1.3 0.1 6.1

2,978,628
2,733,981 2,342,992 2,107,592 2,027,648 1,945,392

10.3
7.5 16.2 15.2 8.6 9.7

74.9
99.7 69.9 99.8 97.6 96.4

Source Best's Review, (LifeIHealth),June 1999, p. 35.

prototyping on the system development life cycle (SDLC) was i n v e s t i g a t e d i n t h e J o u r n a l o f Computer Information Systems (Spring 1993). A survey of 500 randomly selected corporate level MIS managers was conducted. Three potential independent variables were: (1) importance of prototyping to each p h a s e of t h e S D L C ; (2) d e g r e e of support prototyping provides for the SDLC; and (3)

degree to which prototyping replaces each phase of the SDLC. The table on the next page gives the pairwise correlations of the three variables in the survey data for one particular phase of the SDLC. Use this information to assess the degree of multicollinearity in the survey data. Would you recommend using all three independent variables in a regression analysis? Explain.

630

CHAPTER 10

Introduction t o Multiple Regression scattergram. Investigators discovered that all of these data points were collected at the intersection of Interstate 35W and 46th Street. (These are observations 55-72 in the table.) While all other locations in the sample were three-lane highways. this location was unique in that the highway widens to four lanes just north of the electronic sensor. Consequently, the forecasters decided to include a dummy variable to account for a difference brtween the I-35W location and all other locations. b. Propose a first-order model for E(y) as a function of 24-hour volume x, and the dummy variable for location. c. Using an available statistical software package, fit the model of part b to the data. Interpret the result\ d. Conduct a residual analysis of the model, part b. Evaluate the assumptions of normality and constant error variance, and determine whether any outliers exist. 10.55 The audience for a product's advertising can be divided into two segments according to the degree of exposurr. received as a result of the advertising. These segments are groups of consumers who receive high (H) or lo\\ (L) exposure to the advertising.A company is interested in exploring whether its advertising effort affects its product's market share. Accordingly, the cornpan!

Results for Exercise 10.53
Variable Pairs Correlation Coefficient, r

Importance-Replace Importance-Support Replace-Support

.2682 .6991 - .0531

Source: Hardgrave, B. C.,Doke, E. R., and Swanson, N. E. "Prototyping effects of the system development life cycle: An empirical study." Journal of Computer Information System4 Vol. 33, No. 3, Spring 1993, p. 16 (Table 1).

10.54 Traffic forecasters at the Minnesota Department of

Transportation (MDOT) use regression analysis to estimate weekday peak-hour traffic volumes on existing and proposed roadways. In particular, they model y, the peak-hour volume (typically, the volume between 7 and 8 A.M.), as a function of x,,the road's total volume for the day. For one project involving the redesign of a section of Interstate 494, the forecasters collected n = 72 observations of peak-hour traffic volume and 24-hour weekday traffic volume using electronic sensors that count vehicles. The data are provided in the table below. a. Construct a scattergram for the data, plotting peakhour volume y against 24-hour volume x,. Note the isolated group of observations at the top of the

Observation Number

Peak-Hour 24-Hour Volume Volume

1-35

Observation Number

Peak-Hour Volume

24-Hour Volume

1-35

Observation Number

Peak-Hour 24-Hour Volume Volume 1-35

Source: John Sem. Director: Allan E. Pint, State Traffic Forecast Engineer; and James Page Sr.,Transportation Planner,Traffic and Commoditiec Studies Section, Minnesota Department of Transportation, St. Paul, Minnesota.

S u p p l e m e n t a r y Exercises

631

identifies 24 sample groups of consumers who have been exposed to its advertising, twelve groups at each exposure level. Then, the company determines its product's market share within each group. a. Write a regression model that expresses the company's market share as a function of advertising exposure level. Define all terms in your model, and list any assumptions you make about them. b. The data in the table below were obtained by the company. Fit the model you constructed in part a to the data.

-

1 if sunny 0 if overcast

(dummy variable)
(OF)

x3 = Predicted daily high temperature

These data were recorded for a random sample of 30 days, and a regression model was fitted to the data. The least squares analysis produced the following results:

with

Market Share Within Croup

Exposure Level

a. Interpret the estimated model coefficients. b. Is there sufficient evidence to conclude that this model is useful tor the prediction of daily attendance? Use a = .05. c. Is there sufficient evidence to conclude that the mean attendance increases on weekends? Use C = .lo. Y d. Use the model to predict the attendance on a sunny weekday with a predicted high temperature of 95F. e. Suppose the 90% prediction interval for part d is (645,1,245). Interpret this interval. 10.57 Many colleges and universities develop regression models for predicting the GPA of incoming freshmen. This predicted GPA can then be used to make admission decisions. Although most models use many independent variables to predict GPA, we will illustrate by choosing two variables:
xl
Verbal score on college entrance examination (percentile) x2 = Mathematics score on college entrance examination (percentile)
=

c. Is there evidence to suggest that the firm's expected market share differs for the two levels of advertising exposure? Test using a = .05. T determine whether extra personnel are needed for o the day, the owners of a water adventure park would like to find a model that would allow them to predict the day's attendance each morning before opening based on the day of the week and weather conditions. The model is of the form

where
y
=

XI =

{

1 if weekend 0 otherwise (dummy variable)

The data in the table on page 632 are obtained for a random sample of 40 freshmen at one college. The SPSS printout corresponding to t h e model y = PI, + Plxl + P2x2+ E is shown below the data. a. Interpret the least squares estimates P, and P, in the context of this application. b. Interpret the standard deviation and the adjusted coefficient of determination of the regression model in the context of this application. c. Is this model useful for predicting GPA? Conduct a statistical test to justify your answer. d. Sketch the relationship between predicted GPA, y, and verbal score, x l , for the following mathematics scores: x2 = 60,75, and 90. e. The residuals from the first-order model are plotted against x l and x, and shown on p. 633. Analyze the two plots, and determine whether visual evidence exists that curvature (a quadratic term) for either x, or x, should be added to the model.

632

CHAPTER 10

I n t r o d u c t i o n t o M u l t i p l e Regression

.............COLLGPA.DAT...................... .................................
Verbal,
Xl

Mathematics,
x2

GPA, Y

Verbal,
Xl

Mathematics,
x2

GPA, Y

Verbal,
X1

Mathematics,
x2

GPA, Y

SPSS Output for Exercise 10.57
Multiple R R Square Adjusted R Square Standard Error .82527 .68106 .66382 .40228 Sum of Squares 12.78595 5.98755
Mean Square 6.39297 .I6183

Analysis of Variance DF Regression 2 Residual 37

- - - - - - - - - - - - - - - - - - -Variables in the Equation-------------------

Variable
X1 X2 (Constant)

B

SE B

Beta .59719 .63702

T

Sig T

. 0 2 5 7 3 4.02357E-03 .03361 4.92751E-03 -1.57054 .49375

6.395 6.822 -3.181

.OOOO
.OOOO .0030

Supplementary Exercises
SPSS Plots for Exercise 10.57
RESIDUAL PLOT FOR FIRST-ORDER MODEL
75

633

RESIDUAL PLOT FOR FIRST-ORDER MODEL
1

C

X2

40 cases plotted.

40 cases plotted.

634

CHAPTER 10

Introduction t o Multiple Regression

I

Real-World Case:

The Condo Sales Case
/ i
pool, the parking lot, etc., are shown in the accompanying figure. There are several features of the complex that you should note:
I

(A Case Covering Chapters 9 and 10)

T

his case involves an investigation of the factors that affect the sale price of oceanside condominium units. It represcnts an extension of an analysis of the same data by Herman Kelting (1979). Although condo sale prices have increased dramatically over the past 20 years, the relationship between these factors and sale price remain about the same. Consequently, the data provide valuable insight into today's condominium sales market. The sales data were obtained for a new oceanside condominium complex consisting of two adjacent and connected eight-floor buildings. he complex contains 200 units of equal size (approximately 500 square feet each). The locations of the buildings relative to the ocean, the swimming

,
i

1
a

1

1. The units facing south, called ocean-view,face the beach j and ocean. In addition, units in building 1 have a good view of the pool. Units to the rear of the building, called i bay-view,face the parking lot and an area of land that ;. ultimately borders a bay.The view from the upper floors i of these units is primarily of wooded, sandy terrain.The i bay is very distant and barely visible.
a

t t

3. 1
a

Fe

i

1

2. The only elevator in the complex is located at the east end of building 1, as are the office and the game room. People moving to or from the higher floor units in i

4. C d P t1
0

5. ' I

w
ti
Pool

IT
C(

v
w

sa

Distance from elevator-1 Office

Ground Floor 2 3
4

*
5
6

SF

45.

113 115 117 119 121 123 125

M dt IY ni

1 Buildmg 2 Three-story motel

'n

Building 1

Traffic flow

Parking

\

FIGURE C 5 . 1

Layout of condominium complex

a mplc P1rod1 an d p ir1 bui t 1le h c f1'om eltc., a g;ste t for ea 1. Sai for 2. Flc lev,

Real-World Case

635

building 2 would likely use the elevator and move through the passages to their units. Thus, units on the higher floors and at a greater distance from the elevator would be less convenient; they would require greater effort in moving baggage, groceries, etc., and would be farther away from the game room, the office, and the swimming pool. These units also possess an advantage: there would be the least amount of traffic through the hallways in the area and hence they are the most private.

3. Lower-floor oceanside units are most suited to active people; they open onto the beach, ocean, and pool.They are within easy reach of the game room and they are easily reached from the parking area. 4. Checking the layout of the condominium complex, you discover that some of the units in the center of the complex, units ending in numbers 11 and 14, have part of their view blocked. 5. The condominium complex was completed at the time o the 1975 recession;sales were slow and the developer f was forced to sell most of the units at auction approximately 18 months after opening. Consequently, the auction data are completely buyer-specified and hence consumer-oriented in contrast to most other real estate sales data which are, to a high degree, seller and broker specified. 6. Many unsold units in the complex were furnished by the developer and rented prior to the auction. Consequently, some of the units bid on and sold at auction had furniture, others did not. This condominium complex is obviously unique. For example, the single elevator located at one end of the complex produces a remarkably high level of both inconvenience and privacy for the people occupying units on the top floors in building 2. Consequently, the developer is unsure of how the height of the unit (floor number), distance of the unit from the elevator, presence or absence of an ocean view, etc.,affect the prices of the units sold at auction. To investigate these relationships, the following data were recorded for each of the 106 units sold at the auction: 1. Sale price. Measured in hundreds of dollars (adjusted for inflation). 2. Floor height. The floor location of the unit; the variable levels are 1,2,. ...8.

3. Distance from elevator. This distance, measured along the length of the complex, is expressed in number of condominium units. An additional two units of distance was added to the units in building 2 to account for the walking distance in the connecting area between the two buildings.Thus, the distance of unit 105 from the elevator would be 3, and the distance between unit 113 and the elevator would be 9. The variable levels are l, 2,. .., 15. 4. View of ocean. The presence or absence of an ocean view is recorded for each unit and specified with a dummy variable (1 if the unit possessed an ocean view and 0 if not). Note that units not possessing an ocean view would face the parking lot. 5. End unit. We expect the partial reduction of view of end units on the ocean side (numbers ending in 11) to reduce their sale price. The ocean view of these end units is partially blocked by building 2. This qualitative variable is also specified with a dummy variable (1 if the unit has a unit number ending in 11 and 0 if not). 6. Furniture. The presence or absence of furniture is recorded for each unit, and represented with a single dummy variable (1 if the unit was furnished and 0 if not). Your objective for this case is to build a regression model that accurately predicts the sale price of a condominium unit sold at auction. Prepare a professional document that presents the results of your analysis. Include graphs that demonstrate how each of the independent variables in your model affects auction price. A layout of the data file is described below.

CONDO.DAT (Number of Observations: 106) ..................................................................................................................
Variable

..................................................................................................................
1-3 5 7-8 10 12 14

Column(s)

Type

PRICE FLOOR DISTANCE VIEW ENDUNIT FURNlSH

QN QN QN QL QL QL

[Important Note: You may want to consider cross-product terms of the form xlxz in your model. These terms, called interaction terms, allow the relationship between y and one of the independent variables, say x,, to change as the value of the other independent variable, x, changes.] ,

Business Statistics (05)... 40页 5财富值 Business Statistics (10)... 79...Business Statistics (09a)op 哈佛统计学课件哈佛统计学课件隐藏>> SIMPLE LINEAR...
Business Statistics (07)op_经济学_高等教育_教育专区。哈佛统计学课件COMPARING....I56 1.53 10 COMPARING T W O POPULATION MEANS: PAIRED DIFFERENCE ...
Business Statistics Chapter10 - Chapter 10 Introduction to Estimation Where we have been… Recap...
business statistics_其它语言学习_外语学习_教育专区。1 a:comparative advantage:...(toy:1 toy=10 processing fee=0.5) c: re-export-39% in 01 of C’...

Business Statistics Final Assessment.final draft.doc
Levin. D, (2006), numerical descriptive measures, Business Statistics, 4th Edition, Beijing, Renmin University of China Press 10 / 10 ID: 2200807200428 ...

wei zhang and cong du who worked in wall street for more than 10 years...business statistics,mathematics of finance,organizational behaviour, co-op ...
Business Statistics (Level 1)_2007aug_图文.pdf
Business Statistics (Level 1)_2007aug_经管营销_专业资料。Best ...1.310 1.303 1.296 1.289 1.282 I 0.95 0.98 0.99 1 Level of ...

wei zhang and cong du who worked in wall street for more than 10 years...business statistics,mathematics of finance,organizational behaviour, co-op ...
Business Statistics 1_经管营销_专业资料。BB 102 Business Information Systems ...10 The Role of Information Systems in Business Today ? Operational excellence...
Ch01A Preview of Business Statistics(商务统计导论-....ppt
Ch01A Preview of Business Statistics(商务统计导论-英文版)_幼儿读物_幼儿教育_教育专区。CHAPTER 1: A Preview of Business Statistics to accompany Introduction ...
Chapter 5 Ppt Business Statistics_图文.ppt
Chapter 5 Ppt Business Statistics_财务管理_经管营销_专业资料。Slides by JOHN....10 .7290 .2430 .0270 .0010 .15 .6141 .3251 .0574 .0034 .20 .5120...
Business Statistics (Level 1)_2004april_图文.pdf
Business Statistics (Level 1)_2004april