Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • _rmcoll question: Stata and Python OLS regression comparison

    I am trying to replicate a simple Stata OLS regression generated with Stata's regress command in Python. This is both for learning purposes and double-checking the Stata results. I own a Stata license and I trust Stata's statistical abilities , but I would like to understand what it does with the _rmcoll routine before the regression runs.

    Stata regress command uses _rmcoll to omit collinear variables. _rmcoll is built-in so it does not have an ado file to view what it does. I would like to replicate the same regression in Python but statsmodels package does not have a tool similar to _rmcoll. So I programmed the code below. It does a good job with smaller datasets: the same variables are dropped in Stata and Python, and the regression results in Stata and Python are identical. However, with larger datasets, the Python code starts dropping more variables and the regression results of Stata and Python are different. How can I improve the code below so that Python can generate identical regression results even with larger datasets? Note that just changing the tolerance value does not fix the issue.

    Code:
    # Convert the data to a NumPy array
    X = reg_data.values
    
    # Perform QR decomposition
    Q, R = np.linalg.qr(X.T)
    
    # Determine an appropriate tolerance value based on the machine epsilon
    tol = np.finfo(X.dtype).eps * max(X.shape) * 1e-6
    
    # Identify and remove perfectly predicted variables
    keep_cols = np.where(np.abs(np.diag(R)) > tol)[0]
    reg_data = reg_data.iloc[:, keep_cols]

  • #2
    Originally posted by Cyrus Levy View Post
    I am trying to replicate a simple Stata OLS regression generated with Stata's regress command in Python. This is both for learning purposes and double-checking the Stata results. I own a Stata license and I trust Stata's statistical abilities , but I would like to understand what it does with the _rmcoll routine before the regression runs.

    Stata regress command uses _rmcoll to omit collinear variables. _rmcoll is built-in so it does not have an ado file to view what it does. I would like to replicate the same regression in Python but statsmodels package does not have a tool similar to _rmcoll. So I programmed the code below. It does a good job with smaller datasets: the same variables are dropped in Stata and Python, and the regression results in Stata and Python are identical. However, with larger datasets, the Python code starts dropping more variables and the regression results of Stata and Python are different. How can I improve the code below so that Python can generate identical regression results even with larger datasets? Note that just changing the tolerance value does not fix the issue.

    Code:
    # Convert the data to a NumPy array
    X = reg_data.values
    
    # Perform QR decomposition
    Q, R = np.linalg.qr(X.T)
    
    # Determine an appropriate tolerance value based on the machine epsilon
    tol = np.finfo(X.dtype).eps * max(X.shape) * 1e-6
    
    # Identify and remove perfectly predicted variables
    keep_cols = np.where(np.abs(np.diag(R)) > tol)[0]
    reg_data = reg_data.iloc[:, keep_cols]
    Your post is hardly legible with a white font on a light blue background. You can take a look at Ben Jann's Mata function

    Code:
    help mata mm_ls()
    that is part of

    Code:
    ssc install moremata
    which behaves more like regress compared to other "hand-coded" least squares estimators. Here is an example where the function will not drop higher-order terms like regress whereas other commands that implement least squares or maximum likelihood estimators will. Then you can see if you can implement something similar in Python.

    Code:
    ssc install moremata, replace
    clear
    set seed 999
    set obs 400
    gen y = rnormal()
    gen x1 = ln(_n + 1000)
    gen x2 = x1^2
    gen x3 = x1^3
    
    regress y x1-x3
    glm y x1-x3
    *MM_LSFIT
    putmata y=y X=(x1 x2 x3), replace
    mata
    b = mm_lsfit(y, X)
    b
    end
    Res.:

    Code:
    . regress y x1-x3
    
          Source |       SS           df       MS      Number of obs   =       400
    -------------+----------------------------------   F(3, 396)       =      1.44
           Model |  4.67519272         3  1.55839757   Prob > F        =    0.2315
        Residual |  429.449514       396  1.08446847   R-squared       =    0.0108
    -------------+----------------------------------   Adj R-squared   =    0.0033
           Total |  434.124707       399  1.08803185   Root MSE        =    1.0414
    
    ------------------------------------------------------------------------------
               y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
              x1 |  -1298.321   10929.67    -0.12   0.906    -22785.75     20189.1
              x2 |    179.992   1544.121     0.12   0.907    -2855.707    3215.691
              x3 |  -8.320985   72.70972    -0.11   0.909    -151.2663    134.6243
           _cons |   3122.863   25785.13     0.12   0.904    -47569.99    53815.72
    ------------------------------------------------------------------------------
    
    .
    . glm y x1-x3
    note: x3 omitted because of collinearity.
    
    Iteration 0:  Log likelihood = -581.78996  
    
    Generalized linear models                         Number of obs   =        400
    Optimization     : ML                             Residual df     =        397
                                                      Scale parameter =   1.081773
    Deviance         =  429.4637175                   (1/df) Deviance =   1.081773
    Pearson          =  429.4637175                   (1/df) Pearson  =   1.081773
    
    Variance function: V(u) = 1                       [Gaussian]
    Link function    : g(u) = u                       [Identity]
    
                                                      AIC             =    2.92395
    Log likelihood   = -581.7899562                   BIC             =  -1949.148
    
    ------------------------------------------------------------------------------
                 |                 OIM
               y | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
              x1 |  -47.55762   87.55635    -0.54   0.587    -219.1649    124.0497
              x2 |   3.282378   6.183074     0.53   0.596    -8.836224    15.40098
              x3 |          0  (omitted)
           _cons |   172.1966   309.9173     0.56   0.578      -435.23    779.6233
    ------------------------------------------------------------------------------
    
    .
    . *MM_LSFIT
    
    .
    . putmata y=y X=(x1 x2 x3), replace
    (1 vector, 1 matrix posted)
    
    .
    . mata
    ------------------------------------------------- mata (type end to exit) ----------------------------------------------
    :
    : b = mm_lsfit(y, X)
    
    :
    : b
                      1
        +----------------+
      1 |  -1298.323697  |
      2 |   179.9923073  |
      3 |  -8.321001467  |
      4 |   3122.868871  |
        +----------------+
    
    :
    : end
    ------------------------------------------------------------------------------------------------------------------------
    Last edited by Andrew Musau; 29 Apr 2024, 13:57.

    Comment

    Working...
    X