Linear Statistical Models: Regression

Polynomial Regression

Updated for Stata 11


Polynomial regression can be used to fit a regression line to a curved set of points. Contrary to how it sounds, curvilinear regression uses a linear model to fit a curved line to data points. Curvilinear regression makes use of various transformations of variables to achieve its fit. An example of a curvilinear model is

where X2 = X12.

Curvilinear regression should not be confused with nonlinear regression (NL). Nonlinear regression fits arbitrary nonlinear functions to the dependent variable. An example of a nonlinear model is

Example 1

From Pedhazur (1997), a study looks at practice time (x) in minutes and the number of correct responses (y).

Stata Curvilinear Regression Program

use http://www.philender.com/courses/data/curve, clear

scatter y x

Remarks

From Pedhazur (1997), a study looks at practice time (x) in minutes and the number of correct responses (y). Inspection of the y vs x plot reveals a degree of curvilinearity.

Based upon the scatterplot we will try three models:
model 1 -- y = bo + b1x + e -- linear
model 2 -- y = bo + b1x + b2x2 + e -- quadratic
model 3 -- y = bo + b1x + b2x2 + b3x3 + e -- cubic

regress y x   /* linear */

  Source |       SS       df       MS                  Number of obs =      18
---------+------------------------------               F(  1,    16) =   32.72
   Model |  380.112798     1  380.112798               Prob > F      =  0.0000
Residual |  185.887202    16  11.6179501               R-squared     =  0.6716
---------+------------------------------               Adj R-squared =  0.6511
   Total |      566.00    17  33.2941176               Root MSE      =  3.4085

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
       x |   1.284165   .2245067      5.720   0.000       .8082319    1.760098
   _cons |    4.89154    1.73176      2.825   0.012       1.220372    8.562708
------------------------------------------------------------------------------

regress y c.x##c.x   /* linear and quadratic */

      Source |       SS       df       MS              Number of obs =      18
-------------+------------------------------           F(  2,    15) =   31.90
       Model |  458.245766     2  229.122883           Prob > F      =  0.0000
    Residual |  107.754234    15  7.18361562           R-squared     =  0.8096
-------------+------------------------------           Adj R-squared =  0.7842
       Total |         566    17  33.2941176           Root MSE      =  2.6802

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   4.151667   .8872181     4.68   0.000     2.260607    6.042728
             |
     c.x#c.x |   -.209529   .0635329    -3.30   0.005    -.3449462   -.0741119
             |
       _cons |  -2.236083    2.55445    -0.88   0.395    -7.680764    3.208598
------------------------------------------------------------------------------

regress y c.x##c.x##c.x   /* linear, quadratic and cubic */

      Source |       SS       df       MS              Number of obs =      18
-------------+------------------------------           F(  3,    14) =   20.30
       Model |  460.224174     3  153.408058           Prob > F      =  0.0000
    Residual |  105.775826    14  7.55541616           R-squared     =  0.8131
-------------+------------------------------           Adj R-squared =  0.7731
       Total |         566    17  33.2941176           Root MSE      =  2.7487

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   2.267499   3.792818     0.60   0.559    -5.867288    10.40229
             |
     c.x#c.x |   .0975798   .6036817     0.16   0.874    -1.197189    1.392348
             |
 c.x#c.x#c.x |  -.0144026   .0281457    -0.51   0.617    -.0747692     .045964
             |
       _cons |   .7460164     6.3894     0.12   0.909    -12.95788    14.44992
------------------------------------------------------------------------------

test x c.x#c.x

 ( 1)  x = 0
 ( 2)  c.x#c.x = 0

       F(  2,    14) =   15.21
            Prob > F =    0.0003
            
/* rerun regression with linear and quadratic */

regress y c.x##c.x

[output omitted]
            
predict p

scatter y p x, msym(o i) con(. l)

Remarks

From the above analysis, it appears that model 2 appears to be our best bet. The linear model is
y = -2.236083 + 4.151667x -0.209529x2. A plot of y vs x with the predicted scores connect by a curved line is displayed above.

Example 2

Here is another artifical example. This time we are looking at the relationship between test perfromance and anxiety.

input anxiety perform 
1  11  
1  13  
2  24  
2  20  
3  42  
3  36  
4  48  
4  42  
5  46  
5  38  
6  23  
6  19  
7   9  
7  11  
end

These data graph into an inverted-U shape. Let's run a second degree polynomial regression.

scatter perform anxiety



regress perform c.anxiety##c.anxiety

      Source |       SS       df       MS              Number of obs =      14
-------------+------------------------------           F(  2,    11) =   44.51
       Model |  2334.38095     2  1167.19048           Prob > F      =  0.0000
    Residual |   288.47619    11  26.2251082           R-squared     =  0.8900
-------------+------------------------------           Adj R-squared =  0.8700
       Total |  2622.85714    13  201.758242           Root MSE      =   5.121

------------------------------------------------------------------------------
     perform |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     anxiety |   29.63095    3.23401     9.16   0.000     22.51294    36.74896
             |
   c.anxiety#|
   c.anxiety |   -3.72619   .3950972    -9.43   0.000    -4.595794   -2.856587
             |
       _cons |  -16.71429   5.643117    -2.96   0.013     -29.1347   -4.293868
------------------------------------------------------------------------------

predict p

scatter perform p anxiety, msym(o i) con(. l)


In social psychology, this inverted-U curve is called the Yerkes-Dodson curve.

Example 3

Let's try this using the hsb2 dataset.


Linear Statistical Models Course

Phil Ender, 21Jun99