I am trying to replicate a simple Stata OLS regression generated with Stata's regress command in Python. This is both for learning purposes and double-checking the Stata results. I own a Stata license and I trust Stata's statistical abilities
, but I would like to understand what it does with the _rmcoll routine before the regression runs.
Stata regress command uses _rmcoll to omit collinear variables. _rmcoll is built-in so it does not have an ado file to view what it does. I would like to replicate the same regression in Python but statsmodels package does not have a tool similar to _rmcoll. So I programmed the code below. It does a good job with smaller datasets: the same variables are dropped in Stata and Python, and the regression results in Stata and Python are identical. However, with larger datasets, the Python code starts dropping more variables and the regression results of Stata and Python are different. How can I improve the code below so that Python can generate identical regression results even with larger datasets? Note that just changing the tolerance value does not fix the issue.

Stata regress command uses _rmcoll to omit collinear variables. _rmcoll is built-in so it does not have an ado file to view what it does. I would like to replicate the same regression in Python but statsmodels package does not have a tool similar to _rmcoll. So I programmed the code below. It does a good job with smaller datasets: the same variables are dropped in Stata and Python, and the regression results in Stata and Python are identical. However, with larger datasets, the Python code starts dropping more variables and the regression results of Stata and Python are different. How can I improve the code below so that Python can generate identical regression results even with larger datasets? Note that just changing the tolerance value does not fix the issue.
Code:
# Convert the data to a NumPy array X = reg_data.values # Perform QR decomposition Q, R = np.linalg.qr(X.T) # Determine an appropriate tolerance value based on the machine epsilon tol = np.finfo(X.dtype).eps * max(X.shape) * 1e-6 # Identify and remove perfectly predicted variables keep_cols = np.where(np.abs(np.diag(R)) > tol)[0] reg_data = reg_data.iloc[:, keep_cols]
Comment