_rmcoll question: Stata and Python OLS regression comparison

Cyrus Levy

Join Date: Nov 2014

Posts: 99
#1

_rmcoll question: Stata and Python OLS regression comparison

29 Apr 2024, 10:38

I am trying to replicate a simple Stata OLS regression generated with Stata's regress command in Python. This is both for learning purposes and double-checking the Stata results. I own a Stata license and I trust Stata's statistical abilities , but I would like to understand what it does with the _rmcoll routine before the regression runs.

Stata regress command uses _rmcoll to omit collinear variables. _rmcoll is built-in so it does not have an ado file to view what it does. I would like to replicate the same regression in Python but statsmodels package does not have a tool similar to _rmcoll. So I programmed the code below. It does a good job with smaller datasets: the same variables are dropped in Stata and Python, and the regression results in Stata and Python are identical. However, with larger datasets, the Python code starts dropping more variables and the regression results of Stata and Python are different. How can I improve the code below so that Python can generate identical regression results even with larger datasets? Note that just changing the tolerance value does not fix the issue.

Code:

# Convert the data to a NumPy array X = reg_data.values # Perform QR decomposition Q, R = np.linalg.qr(X.T) # Determine an appropriate tolerance value based on the machine epsilon tol = np.finfo(X.dtype).eps * max(X.shape) * 1e-6 # Identify and remove perfectly predicted variables keep_cols = np.where(np.abs(np.diag(R)) > tol)[0] reg_data = reg_data.iloc[:, keep_cols]
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10190

29 Apr 2024, 13:52

Originally posted by Cyrus Levy View Post

I am trying to replicate a simple Stata OLS regression generated with Stata's regress command in Python. This is both for learning purposes and double-checking the Stata results. I own a Stata license and I trust Stata's statistical abilities

, but I would like to understand what it does with the _rmcoll routine before the regression runs.

Stata regress command uses _rmcoll to omit collinear variables. _rmcoll is built-in so it does not have an ado file to view what it does. I would like to replicate the same regression in Python but statsmodels package does not have a tool similar to _rmcoll. So I programmed the code below. It does a good job with smaller datasets: the same variables are dropped in Stata and Python, and the regression results in Stata and Python are identical. However, with larger datasets, the Python code starts dropping more variables and the regression results of Stata and Python are different. How can I improve the code below so that Python can generate identical regression results even with larger datasets? Note that just changing the tolerance value does not fix the issue.

Code:

# Convert the data to a NumPy array
X = reg_data.values

# Perform QR decomposition
Q, R = np.linalg.qr(X.T)

# Determine an appropriate tolerance value based on the machine epsilon
tol = np.finfo(X.dtype).eps * max(X.shape) * 1e-6

# Identify and remove perfectly predicted variables
keep_cols = np.where(np.abs(np.diag(R)) > tol)[0]
reg_data = reg_data.iloc[:, keep_cols]

Your post is hardly legible with a white font on a light blue background. You can take a look at Ben Jann's Mata function

Code:

help mata mm_ls()

that is part of

Code:

ssc install moremata

which behaves more like regress compared to other "hand-coded" least squares estimators. Here is an example where the function will not drop higher-order terms like regress whereas other commands that implement least squares or maximum likelihood estimators will. Then you can see if you can implement something similar in Python.

Code:

ssc install moremata, replace
clear
set seed 999
set obs 400
gen y = rnormal()
gen x1 = ln(_n + 1000)
gen x2 = x1^2
gen x3 = x1^3

regress y x1-x3
glm y x1-x3
*MM_LSFIT
putmata y=y X=(x1 x2 x3), replace
mata
b = mm_lsfit(y, X)
b
end

Res.:

Code:

. regress y x1-x3

      Source |       SS           df       MS      Number of obs   =       400
-------------+----------------------------------   F(3, 396)       =      1.44
       Model |  4.67519272         3  1.55839757   Prob > F        =    0.2315
    Residual |  429.449514       396  1.08446847   R-squared       =    0.0108
-------------+----------------------------------   Adj R-squared   =    0.0033
       Total |  434.124707       399  1.08803185   Root MSE        =    1.0414

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |  -1298.321   10929.67    -0.12   0.906    -22785.75     20189.1
          x2 |    179.992   1544.121     0.12   0.907    -2855.707    3215.691
          x3 |  -8.320985   72.70972    -0.11   0.909    -151.2663    134.6243
       _cons |   3122.863   25785.13     0.12   0.904    -47569.99    53815.72
------------------------------------------------------------------------------

.
. glm y x1-x3
note: x3 omitted because of collinearity.

Iteration 0:  Log likelihood = -581.78996  

Generalized linear models                         Number of obs   =        400
Optimization     : ML                             Residual df     =        397
                                                  Scale parameter =   1.081773
Deviance         =  429.4637175                   (1/df) Deviance =   1.081773
Pearson          =  429.4637175                   (1/df) Pearson  =   1.081773

Variance function: V(u) = 1                       [Gaussian]
Link function    : g(u) = u                       [Identity]

                                                  AIC             =    2.92395
Log likelihood   = -581.7899562                   BIC             =  -1949.148

------------------------------------------------------------------------------
             |                 OIM
           y | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |  -47.55762   87.55635    -0.54   0.587    -219.1649    124.0497
          x2 |   3.282378   6.183074     0.53   0.596    -8.836224    15.40098
          x3 |          0  (omitted)
       _cons |   172.1966   309.9173     0.56   0.578      -435.23    779.6233
------------------------------------------------------------------------------

.
. *MM_LSFIT

.
. putmata y=y X=(x1 x2 x3), replace
(1 vector, 1 matrix posted)

.
. mata
------------------------------------------------- mata (type end to exit) ----------------------------------------------
:
: b = mm_lsfit(y, X)

:
: b
                  1
    +----------------+
  1 |  -1298.323697  |
  2 |   179.9923073  |
  3 |  -8.321001467  |
  4 |   3122.868871  |
    +----------------+

:
: end
------------------------------------------------------------------------------------------------------------------------

Last edited by Andrew Musau; 29 Apr 2024, 13:57.

Announcement

_rmcoll question: Stata and Python OLS regression comparison

Comment