How many observations within a category are sufficient for multiple linear regression?

Matthew Berg

Join Date: Jun 2023
Posts: 53

How many observations within a category are sufficient for multiple linear regression?

08 Feb 2024, 06:31

Dear Statalist,

As part of my independent control variables, I use the operating device, operating system, gender, and whether the instructions were clicked.

After consultation with my colleagues and the internet, I'm not so sure how many observations within each variable in my experiment I need. I have 238 participants and roughly 850 observations (after data-cleaning) (each individual participant participates in four rounds of decision-making).

Code:

. tab  system

   DE01_PRV |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        825       97.40       97.40
    Android |         12        1.42       98.82
      Apple |         10        1.18      100.00
------------+-----------------------------------
      Total |        847      100.00

. tab device 

   DE01_FmF |      Freq.     Percent        Cum.
------------+-----------------------------------
   Computer |        776       91.62       91.62
     Tablet |         10        1.18       92.80
 Smartphone |         61        7.20      100.00
------------+-----------------------------------
      Total |        847      100.00

. tab instruclick

instruclick |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        618       73.92       73.92
          1 |        218       26.08      100.00
------------+-----------------------------------
      Total |        836      100.00


. tab gender

       GE05 |      Freq.     Percent        Cum.
------------+-----------------------------------
       male |        714       84.30       84.30
     female |        121       14.29       98.58
 not stated |         12        1.42      100.00
------------+-----------------------------------
      Total |        847      100.00

Question 1: Can I use Tablet (N=10 observations) in my regression below and just not comment on the significance (because I have less than 30 observations), or am I required to drop it from the regression because I don't interpret the results? Same with gender - I have 12 observations without a gender, compared to men, they are 43 percentage points more likely to show behavior as measured by my dep var. Though I am more interested in the effect on women vs men (-13pp but no significance).

Do I need to recode female and not stated into not male or is it fine to leave as is?

Code:

. regress behavior i.T_C i.exp i.indep i.year i.confidence i.round i.gender i.device i.system i.instr
> uclick, vce(cluster ID)
note: 2.system omitted because of collinearity.

Linear regression                               Number of obs     =        586
                                                F(27, 230)        =      11.86
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1511
                                                Root MSE          =     .47169

                                  (Std. err. adjusted for 231 clusters in ID)
-------------------------------------------------------------------------------
              |               Robust
 behavior     | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
          T_C |
           2  |  -.0002946   .0590163    -0.00   0.996    -.1165764    .1159872
              |
          exp |
           2  |   .2807108    .200137     1.40   0.162    -.1136255    .6750471
           3  |   .2531873   .1738806     1.46   0.147    -.0894152    .5957898
           4  |    .291963   .1566806     1.86   0.064    -.0167498    .6006758
              |
      2.indep |  -.0663562   .0961313    -0.69   0.491    -.2557668    .1230543
              |
         year |
        2021  |  -.0300914     .07163    -0.42   0.675    -.1712263    .1110436
        2020  |   -.104248   .0978701    -1.07   0.288    -.2970845    .0885885
        2019  |   .0638098   .0715169     0.89   0.373    -.0771022    .2047218
        2018  |  -.0986539   .0727007    -1.36   0.176    -.2418984    .0445906
        2017  |  -.0526477   .0759911    -0.69   0.489    -.2023755      .09708
        2016  |   -.370949   .0932669    -3.98   0.000    -.5547158   -.1871821
        2015  |  -.1115331   .0842104    -1.32   0.187    -.2774556    .0543894
        2014  |   .0579834   .1116252     0.52   0.604    -.1619552    .2779221
        2013  |  -.0026666   .0713856    -0.04   0.970    -.1433199    .1379867
              |
   confidence |
           2  |  -.0342457   .0417133    -0.82   0.413    -.1164348    .0479433
           3  |   .0463232   .0512147     0.90   0.367    -.0545867    .1472332
           4  |  -.1103279   .0489368    -2.25   0.025    -.2067498   -.0139061
              |
        round |
           2  |   .0050149   .0445307     0.11   0.910    -.0827254    .0927553
           3  |  -.0003592   .0496134    -0.01   0.994     -.098114    .0973957
           4  |   .0531307   .0476901     1.11   0.266    -.0408345     .147096
              |
       gender |
      female  |  -.1317427   .0815155    -1.62   0.107    -.2923552    .0288698
  not stated  |   .4370247   .0508592     8.59   0.000     .3368151    .5372343
              |
       device |
      Tablet  |   .5374121   .0783172     6.86   0.000     .3831012     .691723
  Smartphone  |  -.2212038   .1358939    -1.63   0.105    -.4889598    .0465522
              |
     system   |
     Android  |  -.0551672   .2526156    -0.22   0.827    -.5529037    .4425693
       Apple  |          0  (omitted)
              |
1.instruclick |     .15917   .0563434     2.82   0.005     .0481548    .2701853
        _cons |   .2354242   .2071774     1.14   0.257     -.172784    .6436324
-------------------------------------------------------------------------------

Question 2: Apple versus an unknown operating system is omitted due to multicollinearity, does this mean I need to drop system as a control variable?

My rationale for including controls was: 1) perhaps the device and operating system change the participant's behavior, as some literature finds an effect of increased risk-taking (i.e., changed behavior) when retail investors trade on smartphones compared to desktop computers. 2) gender because I want to analyze whether there is a difference in their behavior, and 3) instructions: because perhaps more diligent people (as defined by reading the instructions during the experiment) behave differently than those who do not click on the instructions.

Question 3: Regressions without controls (including only T_C, exp, indep, year, confidence and round) sometimes have different N and clusters compared to regressions with controls (as some variables get dropped - as above with Apple): Can I still compare them? I.e., without controls, these factors have a statistical influence; including control changes the coefficients from the regression slightly (2-3 pp while maintaining significance), and now consulting the instructions is highly significant.

Thank you very much in advance!

I very much appreciate your help and comments!

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17608
#2

08 Feb 2024, 08:11

Matthew:
1) your sample size is large enough for a linear regerssion;
2) you seem to have a panel dataset: therrefore, you first choice should be -xtreg-;
3) controls (actually, you do not control as you're not making an experiment -set aside for a while causal inference - but adjust for the remaining independent variables) can be omitted if redundant, not because collinear;
4) I do not understand your pint #1 about -Tablet-: what does N=10 mean in that contest?
5) Statistical significance is not, per se, a scientific criterium to inform decisions about dependent variables. You should give a fair and true view of the data generating process you'investigating.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Matthew Berg

Join Date: Jun 2023

Posts: 53
#3

08 Feb 2024, 08:40

Carlo, Thank you very much!
Can you please elaborate on your part 3? I have a basic regression (including only T_C, exp, indep, year, confidence and round) and one regression where I include all previously mentioned variables and, in addition, device, system, gender, and whether the instructions were clicked. Are you saying that this is not really "controlling" for variables? Previous experimental papers, listed their "control" variables, which they added to regressions and compared the results (between controls and no controls). Therefore, I decided to copy this approach.

Regarding 4) N=10 means that I have 10 observations, in which people completed the study on a tablet compared to 776 observations on a desktop computer. My concern now is that I have way too few observations of tablet users to actually analyze their behavior compared to desktop users and the result : completion of the study on a tablet is associated with 53 percentage points increase in the dep var is not really meaningful (because of my tiny amount of people who use a tablet).

May I kindly ask if you have any answers regarding my questions 2 and 3?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17608
#4

08 Feb 2024, 08:55

Matthew:
2) I'd not drop -Apple- because of collinearity;
3) you can add the so-called controls, but you do not actually control (since what you're studying has already happened), but simply adjust. That said, you shoud consider what is the best specification for the right-hand side of your regerssion equation to give a fair and true view of the data generating process you're investigating;
4) Stata reduces the -e(sample)- to the lowest number of onservations available in any of the variables included in the dataset. Therefore, including -tablet- drops you sample to 10 observations.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Matthew Berg

Join Date: Jun 2023

Posts: 53
#5

09 Feb 2024, 05:44

Thank you so very much!
Comment

Announcement

How many observations within a category are sufficient for multiple linear regression?

Comment

Comment

Comment

Comment