In OLS regression, we use linear combinations of predictor (independent) variables to compute expected values of the response (dependent) variable.
The matrix formulation for OLS regression looks like this
Stata Program using Matrix Arithmetic
program define matreg2, eclass version 6.0 syntax varlist(min=2 numeric) [if] [in] [, Level(integer $S_level)] marksample touse /* mark cases in the sample */ tokenize "`varlist'" quietly matrix accum sscp = `varlist' if `touse' local nobs = r(N) local df = `nobs' - (rowsof(sscp) - 1) /* df residual */ matrix XX = sscp[2...,2...] /* X'X */ matrix Xy = sscp[1,2...] /* X'y */ matrix b = Xy * syminv(XX) /* (X'X)-1X'y */ local k = colsof(b) /* number of coefs */ matrix hat = Xy * b' matrix V = syminv(XX) * (sscp[1,1] - hat[1,1])/`df' estimates post b V, dof(`df') obs(`nobs') depname(`1') /* */ esample(`touse') est local depvar "`1'" est local cmd "matreg" display estimates display, level(`level') matrix drop sscp XX Xy hat end
Example using matreg2
use http://www.ats.ucla.edu/stat/data/hsbdemo, clear regress write read female matreg2 write read female
Assumptions in OLS Regression
Linearity - The expected value of y is linearly related to the x's through the β parameters. Specification errors result when there is a nonlinear relationship.
Independence - The independence of the x's and ε is necessary in order to identify the unknown β parameters, that is, in order to be able to solve for the β's
ε are i.i.d. - The assumption is that the ε's are independent and identically distributed which implies there should be no heterogeneity of variance and no autocorrelation among the residuals.
All relevant variables are in the model - A specification error can occur when the model does not contain all of the relevant variables. As a corollary, a specification error can occur when irrelevant variables are included in the model.
x's are measured without error - The independent variables are measured without error.
Normality* - If we wish to draw statistical inferences we need to add the further assumption that the ε are normally distributed.
Example
use http://www.ats.ucla.edu/stat/data/hsbdemo, clear describe Contains data from http://www.gseis.ucla.edu/courses/data/hsb2.dta obs: 200 highschool and beyond (200 cases) vars: 11 21 Jun 2000 08:54 size: 9,600 (99.8% of memory free) ------------------------------------------------------------------------------- 1. id float %9.0g 2. female float %9.0g fl 3. race float %12.0g rl 4. ses float %9.0g sl 5. schtyp float %9.0g scl type of school 6. prog float %9.0g sel type of program 7. read float %9.0g reading score 8. write float %9.0g writing score 9. math float %9.0g math score 10. science float %9.0g science score 11. socst float %9.0g social studies score ------------------------------------------------------------------------------- summarize Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- id | 200 100.5 57.87918 1 200 female | 200 .545 .4992205 0 1 race | 200 3.43 1.039472 1 4 ses | 200 2.055 .7242914 1 3 schtyp | 200 1.16 .367526 1 2 prog | 200 2.025 .6904772 1 3 read | 200 52.23 10.25294 28 76 write | 200 52.775 9.478586 31 67 math | 200 52.645 9.368448 33 75 science | 200 51.85 9.900891 26 74 socst | 200 52.405 10.73579 26 71 corr write read math science socst female (obs=200) | write read math science socst female -------------+------------------------------------------------------ write | 1.0000 read | 0.5968 1.0000 math | 0.6174 0.6623 1.0000 science | 0.5704 0.6302 0.6307 1.0000 socst | 0.6048 0.6215 0.5445 0.4651 1.0000 female | 0.2565 -0.0531 -0.0293 -0.1277 0.0524 1.0000 pcorr write read math science socst female (obs=200) Partial correlation of write with Variable | Corr. Sig. -------------+------------------ read | 0.1373 0.055 math | 0.2468 0.000 science | 0.2751 0.000 socst | 0.2974 0.000 female | 0.4107 0.000 kdensity write, normal graph read math science socst female write, matrix half tab1 female prog -> tabulation of female female | Freq. Percent Cum. ------------+----------------------------------- male | 91 45.50 45.50 female | 109 54.50 100.00 ------------+----------------------------------- Total | 200 100.00 -> tabulation of prog type of | program | Freq. Percent Cum. ------------+----------------------------------- general | 45 22.50 22.50 academic | 105 52.50 75.00 vocation | 50 25.00 100.00 ------------+----------------------------------- Total | 200 100.00 regress write read math female i.prog Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 5, 194) = 45.01 Model | 9602.28627 5 1920.45725 Prob > F = 0.0000 Residual | 8276.58873 194 42.6628285 R-squared = 0.5371 -------------+------------------------------ Adj R-squared = 0.5251 Total | 17878.875 199 89.843593 Root MSE = 6.5317 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- read | .3069424 .0611262 5.02 0.000 .1863852 .4274996 math | .3603705 .0690064 5.22 0.000 .2242715 .4964695 female | 5.384982 .929572 5.79 0.000 3.551617 7.218346 | prog | 2 | .436372 1.230379 0.35 0.723 -1.990265 2.863009 3 | -2.219748 1.359353 -1.63 0.104 -4.900756 .4612603 | _cons | 15.16272 3.225088 4.70 0.000 8.801985 21.52346 ------------------------------------------------------------------------------ test 2.prog 3.prog ( 1) Iprog_2 = 0.0 ( 2) Iprog_3 = 0.0 F( 2, 194) = 2.31 Prob > F = 0.1022 regress write read math female Source | SS df MS Number of obs = 200 ---------+------------------------------ F( 3, 196) = 72.52 Model | 9405.34864 3 3135.11621 Prob > F = 0.0000 Residual | 8473.52636 196 43.2322773 R-squared = 0.5261 ---------+------------------------------ Adj R-squared = 0.5188 Total | 17878.875 199 89.843593 Root MSE = 6.5751 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | .3252389 .0607348 5.355 0.000 .2054613 .4450166 math | .3974826 .0664037 5.986 0.000 .266525 .5284401 female | 5.44337 .9349987 5.822 0.000 3.59942 7.287319 _cons | 11.89566 2.862845 4.155 0.000 6.249728 17.5416 ------------------------------------------------------------------------------ listcoef /* from Long & Freese - findit spostado */ regress (N=200): Unstandardized and Standardized Estimates Observed SD: 9.478586 SD of Error: 6.5751257 --------------------------------------------------------------------------- write | b t P>|t| bStdX bStdY bStdXY SDofX ---------+----------------------------------------------------------------- read | 0.32524 5.355 0.000 3.3347 0.0343 0.3518 10.2529 math | 0.39748 5.986 0.000 3.7238 0.0419 0.3929 9.3684 female | 5.44337 5.822 0.000 2.7174 0.5743 0.2867 0.4992 --------------------------------------------------------------------------- linktest Source | SS df MS Number of obs = 200 ---------+------------------------------ F( 2, 197) = 116.16 Model | 9674.70222 2 4837.35111 Prob > F = 0.0000 Residual | 8204.17278 197 41.6455471 R-squared = 0.5411 ---------+------------------------------ Adj R-squared = 0.5365 Total | 17878.875 199 89.843593 Root MSE = 6.4533 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- _hat | 3.306865 .9095168 3.636 0.000 1.513226 5.100504 _hatsq | -.0215942 .008491 -2.543 0.012 -.0383392 -.0048492 _cons | -60.58511 24.08436 -2.516 0.013 -108.0814 -13.08885 ------------------------------------------------------------------------------ ovtest Ramsey RESET test using powers of the fitted values of write Ho: model has no omitted variables F(3, 193) = 3.06 Prob > F = 0.0295 whitetst /* downloaded via the Internet - findit whitetst */ White's general test statistic : 15.17126 Chi-sq( 8) P-value = .0559 regress write read math female science socst Source | SS df MS Number of obs = 200 ---------+------------------------------ F( 5, 194) = 58.60 Model | 10756.9244 5 2151.38488 Prob > F = 0.0000 Residual | 7121.9506 194 36.7110855 R-squared = 0.6017 ---------+------------------------------ Adj R-squared = 0.5914 Total | 17878.875 199 89.843593 Root MSE = 6.059 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | .1254123 .0649598 1.931 0.055 -.0027059 .2535304 math | .2380748 .0671266 3.547 0.000 .1056832 .3704665 female | 5.492502 .8754227 6.274 0.000 3.765935 7.21907 science | .2419382 .0606997 3.986 0.000 .1222221 .3616542 socst | .2292644 .0528361 4.339 0.000 .1250575 .3334713 _cons | 6.138759 2.808423 2.186 0.030 .599798 11.67772 ------------------------------------------------------------------------------ linktest Source | SS df MS Number of obs = 200 ---------+------------------------------ F( 2, 197) = 155.20 Model | 10937.2369 2 5468.61843 Prob > F = 0.0000 Residual | 6941.63813 197 35.2367418 R-squared = 0.6117 ---------+------------------------------ Adj R-squared = 0.6078 Total | 17878.875 199 89.843593 Root MSE = 5.9361 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- _hat | 2.577803 .6998344 3.683 0.000 1.197674 3.957931 _hatsq | -.0150213 .0066404 -2.262 0.025 -.0281166 -.0019259 _cons | -40.62334 18.21521 -2.230 0.027 -76.54518 -4.701504 ------------------------------------------------------------------------------ ovtest Ramsey RESET test using powers of the fitted values of write Ho: model has no omitted variables F(3, 191) = 2.03 Prob > F = 0.1117 whitetst /* downloaded via the Internet - findit whitetst */ White's general test statistic : 23.69338 Chi-sq(19) P-value = .2082 rvfplot, yline(0) jitter(2) rvfplot2, rsta rscale(sqrt(abs(X))) jitter(2) /* downloaded via the Internet - findit rvfplot2 */ rvpplot read, yline(0) jitter(2) rvpplot math, yline(0) jitter(2) lvr2plot, mlabel(id) acprplot read, lowess jitter(2) acprplot math, lowess jitter(2) predict e, resid predict rstu, rstu predict h, hat predict d, cooksd dfbeta kdensity e, normal list id rstu h d if abs(rstu)>2 id rstu h d 6. 126 -2.911042 .0306996 .0430727 21. 86 -2.159355 .0471286 .0377245 42. 187 -2.801592 .012483 .0159722 71. 52 -2.102624 .0243733 .0180889 119. 38 2.225437 .0551778 .0472426 127. 104 2.12688 .0164202 .0123619 137. 30 2.077298 .0271825 .0197581 156. 44 2.543062 .0270288 .0291219 160. 83 2.393317 .0556316 .0549 list id rstu h d if h>(2*5+2)/200 id rstu h d 38. 167 .2202322 .1129052 .0010339 74. 198 1.536736 .0626366 .0261174 140. 170 .615899 .0624444 .0042243 174. 165 .4212941 .0611342 .0019344 190. 150 -1.481232 .096819 .0389597 list id rstu h d if d>4/200 id rstu h d 6. 126 -2.911042 .0306996 .0430727 21. 86 -2.159355 .0471286 .0377245 48. 24 1.995236 .0473758 .0324974 57. 3 1.937731 .0350953 .0224428 74. 198 1.536736 .0626366 .0261174 119. 38 2.225437 .0551778 .0472426 144. 81 -1.843365 .0358387 .020794 156. 44 2.543062 .0270288 .0291219 160. 83 2.393317 .0556316 .0549 190. 150 -1.481232 .096819 .0389597 196. 89 -1.771696 .0422013 .022799 global a = 2/sqrt(200) list id DFread DFmath DFfemale DFscienc DFsocst if abs(DFread)>$a | abs(DFmat > h)>$a | abs(DFfemale)>$a | abs(DFscienc)>$a | abs(DFsocst)>$a id DFread DFmath DFfemale DFscienc DFsocst 3. 51 -.0515056 -.014398 -.0626913 .1468613 .0604603 6. 126 .3206731 -.3131511 .2656341 .1226514 -.0953587 9. 175 -.1430553 -.0606901 .1151235 .1475843 -.0274637 21. 86 .0980499 -.1335611 .1154265 -.1851264 .3323804 24. 62 .1088972 -.2401099 -.1083746 .1068583 .1222108 33. 50 -.0349264 -.2120363 -.1395483 .0761597 .1938863 42. 187 -.0829706 -.0537039 -.1999146 -.0223069 .1177308 48. 24 -.0347862 .3662628 -.1715901 -.2201185 -.1457788 55. 60 -.0166963 -.1514111 -.1222548 .1529634 .1088403 57. 3 .1622479 -.2598199 -.1198602 .1595348 .0001202 71. 52 .1520838 .0367512 -.1116373 -.0425734 -.2523727 74. 198 -.0183564 -.0090244 .1612648 .2468743 -.2791837 76. 186 .0420188 .0984112 .0757014 -.0094653 -.1424532 81. 103 -.2083498 .0167522 .0938213 .0255195 .0635244 102. 189 -.1159227 .2095263 -.1127493 -.0281062 -.0706801 109. 41 .0333516 .1437377 .1196449 -.0881504 -.1007374 110. 185 .1353671 -.012621 -.0607331 .0121166 -.1576565 113. 46 .081996 .0152051 .0863532 -.173799 -.0803784 119. 38 -.0400404 .1510486 -.2598652 -.4282386 .2007007 127. 104 .0250837 .0949613 -.147284 .0018263 -.1392599 134. 159 .0034894 .0056749 -.1422074 -.0914222 .1120808 137. 30 -.0398192 -.033048 .0875061 -.1889646 .1237977 139. 200 .0064968 -.2036626 .1237578 .0268331 -.0128401 144. 81 -.1431379 -.0059714 .0933692 -.1201475 .2423838 154. 133 -.1600672 .0340288 .0858443 .1343556 .1441613 155. 98 .1430875 -.0204046 .1143515 .0250712 -.2119574 156. 44 .1320848 .0129432 .1237565 -.3070052 -.0404682 160. 83 .2011815 -.2430923 .2346933 .2569073 -.3997001 162. 18 -.1112641 -.0427497 .1235689 .1017029 .191583 166. 117 -.143197 -.0619351 -.10833 .0033026 .1643665 169. 153 .0994511 .0801062 .1757641 .0828039 -.1559673 186. 16 -.110932 -.0121689 .1209114 .1604108 .1236713 190. 150 .1878171 -.0798048 .0538377 -.3412711 .2439169 192. 142 .0095487 -.086411 -.0723012 .1508893 -.0155141 196. 89 .1260335 .0847179 -.1482973 -.2127656 .1433555 diag, id(id) /* downloaded via the Internet - findit diag */ Summary statistics for Leverage/Residuals (Panel 1) and dfbetas (Panel 2) Signals lists the obs that warrant attention (criteria: see online help) Variable | Obs Mean Std. Dev. Min Max %Signals ---------+-------------------------------------------------------------------- _hat | 200 .03 .0137395 .0105296 .1129052 0.0250 _rstu | 200 5.13e-06 1.00769 -2.911042 2.543062 0.0800 _dfits | 200 .0013422 .1790669 -.5180668 .5808853 0.0350 _cooksd | 200 .0052647 .0085635 9.85e-08 .0549 0.0550 _welsch | 200 .0194215 2.574987 -7.423061 8.432303 0.0100 _covrati | 200 1.031876 .0446067 .8223104 1.161025 0.0250 ---------+-------------------------------------------------------------------- read | 200 .0001225 .0663172 -.2083498 .3206731 0.0550 math | 200 -.0001245 .0732815 -.3131511 .3662628 0.0550 female | 200 .0000807 .0748938 -.2598652 .2656341 0.0500 science | 200 -.000107 .0791052 -.4282386 .2569073 0.0800 socst | 200 .0000599 .0793047 -.3997001 .3323804 0.0850 ---------+-------------------------------------------------------------------- Frequency distribution of #signals _Signals | Freq. Percent Cum. ------------+----------------------------------- 0 | 159 79.50 79.50 1 | 17 8.50 88.00 2 | 9 4.50 92.50 3 | 4 2.00 94.50 4 | 3 1.50 96.00 5 | 2 1.00 97.00 6 | 3 1.50 98.50 7 | 1 0.50 99.00 8 | 1 0.50 99.50 9 | 1 0.50 100.00 ------------+----------------------------------- Total | 200 100.00 Observations with #signals >= 1 ------------------------------------------------------------------------------- _Signals #signals for the observation _diag signals for _HAT _RSTU _DFITS _COOKSD _WELSCH _COVRATIO _dfbeta signals for dbetas of read math female _Iprog_2 _Iprog_3 +-----------------------------------+ | id _Signals _diag _dfbeta | |-----------------------------------| | 111 1 000000 00001 | | 8 1 000000 10000 | | 200 1 000000 01000 | | 128 1 000000 00010 | | 121 1 000001 00000 | |-----------------------------------| | 142 1 000000 00001 | | 108 1 000000 00001 | | 138 1 000000 00001 | | 30 1 000000 00010 | | 103 1 000000 10000 | |-----------------------------------| | 63 1 000000 00010 | | 170 2 000000 11000 | | 92 2 000000 00011 | | 105 2 000000 01010 | | 81 2 010000 00100 | |-----------------------------------| | 109 2 000000 00011 | | 143 2 100001 00000 | | 175 2 000000 00011 | | 167 2 100001 00000 | | 60 2 010000 00100 | |-----------------------------------| | 44 2 010000 00001 | | 178 2 000000 01001 | | 144 2 000000 00011 | | 21 2 000000 01010 | | 133 3 010000 01001 | |-----------------------------------| | 18 3 010000 00101 | | 16 3 010000 00101 | | 43 4 010100 01010 | | 51 4 010100 00011 | | 32 5 101100 01001 | |-----------------------------------| | 85 5 010100 00111 | | 117 5 011100 00101 | | 187 6 010101 00111 | | 3 6 011100 11100 | | 62 7 011100 11011 | |-----------------------------------| | 50 7 011100 01111 | | 83 7 011100 11101 | | 126 8 010101 11111 | | 86 8 010101 11111 | +-----------------------------------+
Categorical Data Analysis Course
Phil Ender