# R para principiantes # PUCE, Quito, 5-8 enero 2010 # Simon Queenborough # JUEVES: clase 13 ######### LINEAR REGRESSION ################## # The maximum heart rate of a person is often said to be related to age by the equation # Max = 220 - Age Age <- c(18,23,25,35,65,54,34,56,72,19,23,42,18,39,37) Max.Rate <- c(202,186,187,180,156,169,174,172,153,199,193,174,198,183,178) plot(x=Age,y=Max.Rate) # make a plot abline(lm(Max.Rate ~ Age)) # plot the regression line lm(Max.Rate ~ Age) # the basic values of the regression analysis summary(lm(Max.Rate ~ Age)) model <- lm(Max.Rate ~ Age) summary(model) str(model) # The result of the lm function is of class lm and so the plot and summary commands adapt themselves to that. # The variable lm.result contains the result. We used summary to view the entire thing. To view parts of it, you can # call it like it is a list or better still use the following methods: resid for residuals, coef for coefficients and predict # for prediction. Here are a few examples, the former giving the coefficients b0 and b1, the latter returning the residuals # which are then summarized. coef(model) coef(model)[1] summary(resid(model)) # Once we know the model is appropriate for the data, we will begin to identify the meaning of the numbers. ###################### Testing the assumptions of the model #################### # Errors are INDEPENDENT AND IDENTICALLY DISTRIBUTED (i.i.d.) NORMAL RANDOM variables # We can test for normality with: histograms, boxplots and normal plots. # We can test for correlations by looking if there are trends in the data. # This can be done with plots of the residuals vs. time and order. # We can test the assumption that the errors have the same variance with plots of residuals vs. time order and fitted values. # The plot command will do these tests for us if we give it the result of the regression par(mfrow=c(2,2)) plot(model) # output is different from plot(y ~ x) !!!! # Residuals vs. fitted This plots the fitted y.hat values against the residuals. Look for spread around the line y = 0 # and no obvious trend. # Normal qqplot The residuals are normal if this graph falls close to a straight line. # Scale-Location This plot shows the square root of the standardized residuals. The tallest points, are the largest # residuals. # Cook's distance This plot identifies points which have a lot of influence in the regression line. ####################### Statistical inference ################################ # If you are satisfied that the model fits the data, then statistical inferences can be made. There are 3 parameters # in the model: sigma, beta0 and beta1. # Sigma # # standard deviation of the error terms. If we had the exact regression line, then the error terms # and the residuals would be the same, so we expect the residuals to tell us about the value of sigma. # Estimator beta1 # # The estimator b1 for beta1, the slope of the regression line, is also an unbiased estimator. # R will automatically do a hypothesis test for the assumption beta1 = 0 which means no slope. See how the p-value # is included in the output of the summary command in the column Pr(>|t|) summary(model)$coef # beta0 # # As well, a statistical test for b0 can be made (and is). Again, R includes the test for beta0 = 0 which tests to see if # the line goes through the origin. To do other tests, requires a familiarity with the details. ##################### Predicted values ##################### # To get the predicted values We can use the predict function to get predicted values, but it is a little clunky to # call. We need a data frame with column names matching the predictor or explanatory variable. In this example # this is 'Age' so we can do the following to get a prediction for a 50 and 60 year old we have predict(model,data.frame(Age= c(50,60))) #################### Confidence Intervals ############################### # Visually, one is interested in confidence intervals. The regression line is used to predict the value of y for a given # x, or the average value of y for a given x and one would like to know how accurate this prediction is. This is the job # of a confidence interval. # The confidence bands would be a chore to compute by hand. Generally speaking, for each value of x we want a point to plot. # This is done as before with a data frame containing all the x values we want. In addition, we need to ask for # the interval. There are two types: confidence, or prediction. The confidence will be for the mean, and the # prediction for the individual. Let's see the output, and then go from there. This is for a 90% confidence level predict(model,data.frame(Age=sort(Age)), # as before level=.9, interval="confidence") # what is new # We see we get 3 numbers back for each value of x. (note we sorted x first to get the proper order for plotting.) # To plot the lower band, we just need the second column which is accessed with [,2]. So the following will # plot just the lower. Notice, we make a scatterplot with the plot command, but add the confidence band with # points. plot(Age,Max.Rate) abline(model) ci.lwr = predict(model,data.frame(Age=sort(Age)), level=.9,interval="confidence")[,2] points(sort(Age), ci.lwr, type="l", lty=3) # or use lines() ci.upr = predict(model,data.frame(Age=sort(Age)), level=.9,interval="confidence")[,3] points(sort(Age), ci.upr, type="l", lty=3) # or use lines() ###################################################################################################### # # EJERCICIOS # ###################################################################################################### # 13.1 # Make another graph identical to the one above, using lines() instead of points() # 13.2 # The cost of a home depends on the number of bedrooms in the house. Suppose the following data is recorded # for homes in a given town. Make a scatterplot, and fit the data with a regression line. # Add confidence limits price <- c(80, 151.4, 310, 295, 339, 337.5, 228.7, 245, 339, 43, 279, 599, 119, 289, 249, 178, 159, 289, 488, 376, 249, 275, 275, 459, 219, 359, 379, 189, 173), sale = c(117.7, 151, 300, 275, 340, 337.5, 215, 239, 345, 48, 262.5, 613, 119, 305, 249, 170, 153, 291, 450, 370, 245, 275, 272.5, 459, 230, 360, 370, 185, 185), bedrooms <- c(3, 4, 4, 4, 3, 4, 3, 3, 3, 1, 3, 4, 3, 3, 2, 3, 2, 3, 3, 3, 3, 4, 2, 5, 3, 3, 4, 4, 3), # 13.3 # It is well known that the more beer you drink, the more your blood alcohol level rises. Suppose we have the # following data on student beer consumption. Make a scatterplot and fit the data with a regression line. # Add confidence limits Student <- c( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) Beers <- c(5, 2, 9, 8, 3, 7, 3, 5, 3, 5) BAL <- c(0.10, 0.03, 0.19, 0.12, 0.04, 0.095, 0.07, 0.06, 0.02, 0.05) # 13.4 # The lapse rate is the rate at which temperature drops as you increase elevation. Some hardy students were # interested in checking empirically if the lapse rate of 9.8 degrees C/km was accurate for their hiking. To # investigate, they grabbed their thermometers and their Suunto wrist altimeters and found the following data # on their hike. Import the following data. # elevation (ft) 600 1000 1250 1600 1800 2100 2500 2900 # temperature (F) 56 54 56 50 47 49 47 45 # Draw a scatter plot with regression line, and investigate if the lapse rate is 9.8C/km. (First, it helps to convert # to the rate of change in Fahrenheit per feet with is 5.34 degrees per 1000 feet.) Test the hypothesis that the # lapse rate is 5.34 degrees per 1000 feet against the alternative that it is less than this.