Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How many observations within a category are sufficient for multiple linear regression?

    Dear Statalist,

    As part of my independent control variables, I use the operating device, operating system, gender, and whether the instructions were clicked.

    After consultation with my colleagues and the internet, I'm not so sure how many observations within each variable in my experiment I need. I have 238 participants and roughly 850 observations (after data-cleaning) (each individual participant participates in four rounds of decision-making).

    Code:
    . tab  system
    
       DE01_PRV |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |        825       97.40       97.40
        Android |         12        1.42       98.82
          Apple |         10        1.18      100.00
    ------------+-----------------------------------
          Total |        847      100.00
    
    . tab device 
    
       DE01_FmF |      Freq.     Percent        Cum.
    ------------+-----------------------------------
       Computer |        776       91.62       91.62
         Tablet |         10        1.18       92.80
     Smartphone |         61        7.20      100.00
    ------------+-----------------------------------
          Total |        847      100.00
    
    . tab instruclick
    
    instruclick |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |        618       73.92       73.92
              1 |        218       26.08      100.00
    ------------+-----------------------------------
          Total |        836      100.00
    
    
    . tab gender
    
           GE05 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
           male |        714       84.30       84.30
         female |        121       14.29       98.58
     not stated |         12        1.42      100.00
    ------------+-----------------------------------
          Total |        847      100.00
    Question 1: Can I use Tablet (N=10 observations) in my regression below and just not comment on the significance (because I have less than 30 observations), or am I required to drop it from the regression because I don't interpret the results? Same with gender - I have 12 observations without a gender, compared to men, they are 43 percentage points more likely to show behavior as measured by my dep var. Though I am more interested in the effect on women vs men (-13pp but no significance).

    Do I need to recode female and not stated into not male or is it fine to leave as is?


    Code:
    . regress behavior i.T_C i.exp i.indep i.year i.confidence i.round i.gender i.device i.system i.instr
    > uclick, vce(cluster ID)
    note: 2.system omitted because of collinearity.
    
    Linear regression                               Number of obs     =        586
                                                    F(27, 230)        =      11.86
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.1511
                                                    Root MSE          =     .47169
    
                                      (Std. err. adjusted for 231 clusters in ID)
    -------------------------------------------------------------------------------
                  |               Robust
     behavior     | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    --------------+----------------------------------------------------------------
              T_C |
               2  |  -.0002946   .0590163    -0.00   0.996    -.1165764    .1159872
                  |
              exp |
               2  |   .2807108    .200137     1.40   0.162    -.1136255    .6750471
               3  |   .2531873   .1738806     1.46   0.147    -.0894152    .5957898
               4  |    .291963   .1566806     1.86   0.064    -.0167498    .6006758
                  |
          2.indep |  -.0663562   .0961313    -0.69   0.491    -.2557668    .1230543
                  |
             year |
            2021  |  -.0300914     .07163    -0.42   0.675    -.1712263    .1110436
            2020  |   -.104248   .0978701    -1.07   0.288    -.2970845    .0885885
            2019  |   .0638098   .0715169     0.89   0.373    -.0771022    .2047218
            2018  |  -.0986539   .0727007    -1.36   0.176    -.2418984    .0445906
            2017  |  -.0526477   .0759911    -0.69   0.489    -.2023755      .09708
            2016  |   -.370949   .0932669    -3.98   0.000    -.5547158   -.1871821
            2015  |  -.1115331   .0842104    -1.32   0.187    -.2774556    .0543894
            2014  |   .0579834   .1116252     0.52   0.604    -.1619552    .2779221
            2013  |  -.0026666   .0713856    -0.04   0.970    -.1433199    .1379867
                  |
       confidence |
               2  |  -.0342457   .0417133    -0.82   0.413    -.1164348    .0479433
               3  |   .0463232   .0512147     0.90   0.367    -.0545867    .1472332
               4  |  -.1103279   .0489368    -2.25   0.025    -.2067498   -.0139061
                  |
            round |
               2  |   .0050149   .0445307     0.11   0.910    -.0827254    .0927553
               3  |  -.0003592   .0496134    -0.01   0.994     -.098114    .0973957
               4  |   .0531307   .0476901     1.11   0.266    -.0408345     .147096
                  |
           gender |
          female  |  -.1317427   .0815155    -1.62   0.107    -.2923552    .0288698
      not stated  |   .4370247   .0508592     8.59   0.000     .3368151    .5372343
                  |
           device |
          Tablet  |   .5374121   .0783172     6.86   0.000     .3831012     .691723
      Smartphone  |  -.2212038   .1358939    -1.63   0.105    -.4889598    .0465522
                  |
         system   |
         Android  |  -.0551672   .2526156    -0.22   0.827    -.5529037    .4425693
           Apple  |          0  (omitted)
                  |
    1.instruclick |     .15917   .0563434     2.82   0.005     .0481548    .2701853
            _cons |   .2354242   .2071774     1.14   0.257     -.172784    .6436324
    -------------------------------------------------------------------------------
    Question 2: Apple versus an unknown operating system is omitted due to multicollinearity, does this mean I need to drop system as a control variable?

    My rationale for including controls was: 1) perhaps the device and operating system change the participant's behavior, as some literature finds an effect of increased risk-taking (i.e., changed behavior) when retail investors trade on smartphones compared to desktop computers. 2) gender because I want to analyze whether there is a difference in their behavior, and 3) instructions: because perhaps more diligent people (as defined by reading the instructions during the experiment) behave differently than those who do not click on the instructions.

    Question 3: Regressions without controls (including only T_C, exp, indep, year, confidence and round) sometimes have different N and clusters compared to regressions with controls (as some variables get dropped - as above with Apple): Can I still compare them? I.e., without controls, these factors have a statistical influence; including control changes the coefficients from the regression slightly (2-3 pp while maintaining significance), and now consulting the instructions is highly significant.

    Thank you very much in advance!

    I very much appreciate your help and comments!

  • #2
    Matthew:
    1) your sample size is large enough for a linear regerssion;
    2) you seem to have a panel dataset: therrefore, you first choice should be -xtreg-;
    3) controls (actually, you do not control as you're not making an experiment -set aside for a while causal inference - but adjust for the remaining independent variables) can be omitted if redundant, not because collinear;
    4) I do not understand your pint #1 about -Tablet-: what does N=10 mean in that contest?
    5) Statistical significance is not, per se, a scientific criterium to inform decisions about dependent variables. You should give a fair and true view of the data generating process you'investigating.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Carlo, Thank you very much!
      Can you please elaborate on your part 3? I have a basic regression (including only T_C, exp, indep, year, confidence and round) and one regression where I include all previously mentioned variables and, in addition, device, system, gender, and whether the instructions were clicked. Are you saying that this is not really "controlling" for variables? Previous experimental papers, listed their "control" variables, which they added to regressions and compared the results (between controls and no controls). Therefore, I decided to copy this approach.

      Regarding 4) N=10 means that I have 10 observations, in which people completed the study on a tablet compared to 776 observations on a desktop computer. My concern now is that I have way too few observations of tablet users to actually analyze their behavior compared to desktop users and the result : completion of the study on a tablet is associated with 53 percentage points increase in the dep var is not really meaningful (because of my tiny amount of people who use a tablet).

      May I kindly ask if you have any answers regarding my questions 2 and 3?

      Comment


      • #4
        Matthew:
        2) I'd not drop -Apple- because of collinearity;
        3) you can add the so-called controls, but you do not actually control (since what you're studying has already happened), but simply adjust. That said, you shoud consider what is the best specification for the right-hand side of your regerssion equation to give a fair and true view of the data generating process you're investigating;
        4) Stata reduces the -e(sample)- to the lowest number of onservations available in any of the variables included in the dataset. Therefore, including -tablet- drops you sample to 10 observations.
        Kind regards,
        Carlo
        (StataNow 18.5)

        Comment


        • #5
          Thank you so very much!

          Comment

          Working...
          X