Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ttest

    Dear all,

    I hope you are all doing great.

    I am trying to calculate the ttest of x (total dollar's placement) by y (with 1 = inside the US and 0 = outside the US) over the period 1995 and 2000.
    I tried the following code:

    Code:
    ttest x, by(y)
    By doing so, I found the result. Nonetheless, I need to calculated the mean of x for each y (1 & 0) scaled by the total x for the period above 1998 & the period below 1998. So it would give us ((mean x before 1998) - (mean x after 1998) / (total mean over period)) for each y. So we would get 4 means in total.

    Code:
    egen mean_1 = mean(x) if year <= 1998 & y == 1
    fillmissing mean_1, with(any)
    egen mean_2 = mean(x) if year > 1998 & y == 1
    fillmissing mean_2, with(any)
    egen mean_3 = mean(x)
    gen x_1 = ((mean_1-mean_2)/mean_3)
    
    egen mean_11 = mean(x) if year <= 1998 & y == 0
    fillmissing mean_11, with(any)
    egen mean_12 = mean(x) if year > 1998 & y == 0
    fillmissing mean_12, with(any)
    gen x_2 = ((mean_11-mean_12)/mean_3)
    
    gen Pre_Post = .
    replace Pre_Post = x_1 if y == 1
    replace Pre_Post = x_2 if y == 0
    
    ttest Pre_Post, by(y)
    I did the following but the ttest at the end does not work and I find the code too long. Is there any way to solve the ttest (t-statistics = .) and to reduce the length?

    Thanks in advance,
    Eugene

  • #2
    Given your setup, you should probably change the t-test command from ttest to ttesti, the immediate version of the command. The immediate version requires the standard deviation of the two means and the overall number of observation.
    The code for this version would be a bit shorter than what you posted but not by much.
    Please also note that fillmissing is a community-contributed program by Attaullah Shah, which you should have noted in your post because people without this command cannot replicate your code.
    The command is also probably not needed for the task at hand.
    You should post an example of your dataset generated by running the
    dataex command so that other people can test their solution attempts and understand your problem better.

    Comment


    • #3
      Dear Sven,

      Thank you very much for your answer.
      You can see below an example of the code as per my code in my first post:

      Code:
      input float(x y mean_1 mean_2 mean_3 mean_11 mean_12 x_1 n_2 Pre_Post)
      10.026775 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
       206.7967 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
       6.487872 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
      137.74028 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
       4.877652 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
          32.33 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
       9.869688 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
       55.75826 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
      146.87474 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
       38.81283 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
        175.912 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
       151.4229 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
       48.86275 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
      126.48425 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
      242.65695 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
       42.28203 0 368.2543 370.1607 296.10037 289.71872 292.70178 -.006438469 -.010074498 -.010074498
      I used the fillmissing because the respective column of the mean (mean_1, mean_2, mean_11 & mean_12) were almot empty (see below) since we have the condition y == 1 & the period.

      In addition, since I need to the differences between the mean calculated for y =0 (& y= 1) when the year is below <= 1998 and above >1998 (scaled by the mean of the full period from 1995-2000). I could not figured out how to do it besides fillmissing.

      Code:
      input float(mean_1 mean_2 mean_11 mean_12)
      . . 289.71872         .
      . . 289.71872         .
      . .         . 292.70178
      . . 289.71872         .
      . .         . 292.70178
      . .         . 292.70178
      . . 289.71872         .
      . . 289.71872         .
      . . 289.71872         .
      . .         . 292.70178
      . .         . 292.70178
      . . 289.71872         .
      . . 289.71872         .
      . . 289.71872         .
      . .         . 292.70178
      . . 289.71872         .
      Let me know if it is better this time !
      Eugene

      Comment


      • #4
        Your data example has no observations for y==1. Besides that, you probably need to reformulate what you want to achieve. At the moment, the t-test should not work. The standard deviation of your two groups is 0 because create one constant value per group.
        If I understand your code correctly then you try to compare the differences in the mean before 1998 and after 1998 for the two groups.

        You could create a new dummy variable for the time period, regress on the interaction of the two dummies and then test if the difference is significant. Something like the code below could, but I could not test it with your data example.
        Code:
        gen t =( year > 1998)
        regress x i.y#i.t
        test (_cons - _b[0.y#1.t] = _b[1.y#0.t]- _b[1.y#1.t])
        The constant in this regression is equal to the mean of x when t==0 and y==0.
        I ignore the scaling with the mean of x because it does not matter for the test statistic.

        Comment


        • #5
          Hello Sven,

          Thanks for your answer.
          Actually since the dataset is pretty long, the 1 for y does not show up at the beginning. I added some lines now so that you understand better what it is going on.

          Code:
           
          input float(x y mean_1 mean_2 mean_11 mean_12 mean_3 x_1 x_2 Pre_Post)
            44456 0 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .04644477
            41720 0 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .04644477
            31892 0 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .04644477
            36592 0 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .04644477
            40499 0 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .04644477
            39237 0 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .04644477
            40407 0 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .04644477
            28013 0 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .04644477
            25607 1 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .08318283
            18086 0 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .04644477
            10192 1 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .08318283
            25279 0 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .04644477
            11128 1 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .08318283
            29601 0 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .04644477
            13351 1 363218.3 339503.25 284954.94 271713.72 285096.03 .08318283 .04644477 .08318283
          I will try to resume step by step: I want to compare the differences in the mean before 1998 and after 1998 for two groups (y = 0 & y = 1)

          To calculate mean_1, I did the following:

          Code:
           egen mean_1 = mean(x) if year <= 1998 & y == 1
          Since I had missing values in rows where y == 0 & year >1998, I used the fillmissing because otherwise I would not have been able to calculate the means since the observations are not on the same rows.

          I followed the same reasoning for mean_2, mean_11 & mean_12.

          Code:
          egen mean_2 = mean(x) if year > 1998 & y == 1
          fillmissing mean_2, with(any)
          egen mean_11 = mean(x) if year <= 1998 & y == 0
          fillmissing mean_11, with(any)
          egen mean_12 = mean(x) if year > 1998 & y == 0
          fillmissing mean_12, with(any)
          Now, I would like to divide the differences between the mean before 1998 and after 1998 by the mean of the total period.
          Code:
           egen mean_3 = mean(x)
          I create a new variable called x_1 for the differences between the means for y ==1 before 1998 and after 1998 which is then divided by mean_3:
          Code:
          gen x_1 = ((mean_1-mean_2)/mean_3)
          I do the same for y ==1 with x_2: the difference between the means before 1998 and after 1998 divided by mean_3:
          Code:
             
           gen x_2 = ((mean_11-mean_12)/mean_3)
          With x_1 and x_2 I am trying to do a t-test and I only came up with this code where I gather together the result I got for x_1 & x_2:
          Code:
            
           gen Pre_Post = . replace Pre_Post = x_1 if y == 1 replace Pre_Post = x_2 if y == 0   
           ttest Pre_Post, by(y)
          What I want to show is the mean of y == 1 & y == 0 (as per the calculation above) and the difference between the two means with the t-statistics.
          Unfortunately I got this as results:

          Code:
          Two-sample t test with equal variances
          ------------------------------------------------------------------------------
             Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
          ---------+--------------------------------------------------------------------
                 0 |  12,477    .0472881           0           0    .0472881    .0472881
                 1 |     911    .0668536           0           0    .0668536    .0668536
          ---------+--------------------------------------------------------------------
          combined |  13,388    .0486195    .0000426    .0049273     .048536    .0487029
          ---------+--------------------------------------------------------------------
              diff |           -.0195655           0               -.0195655   -.0195655
          ------------------------------------------------------------------------------
              diff = mean(0) - mean(1)                                      t =        .
          Ho: diff = 0                                     degrees of freedom =    13386
          ​​​​​​​

          I hope this time it is better !!

          Comment


          • #6
            Like I said before, I understand what your code does. Your data example has also no year variable, so I still cannot run your code.
            I suggested a way which should give you the expected outcome without using your code. Unfortunately, I cannot modify your code to get the desired result with the ttest-command.
            To make your code work, you need to calculate the standard deviation of the variables x_1 and x_2. Only then you can run the immediate version of the t-test which is the ttesti command. However, x_1 and x_2 are by construction constants and therefore they have no standard deviation other than 0.
            The output of the t-test command shows this issue. The output also does not show the t-statistic because it is either a missing value or too small to display.


            Instead of using the egen-command to calculate the means you can calculate them with the summarize-command.
            See the example code below
            Code:
            forvalues i=0/1{
                summarize x if year <= 1998 & y == `i', meanonly
                scalar mean_1`i' = r(mean)
                scalar var_1`i' = r(Var)
                summarize x if year > 1998 & y == `i',meanonly
                scalar mean_2`i' = r(mean)
                scalar var_2`i' = r(Var)
            }
            summarize x ,meanonly
            scalar mean_3 = r(mean)
            scalar x_1 = (mean_10-mean_20)/mean_3
            scalar x_1sd = (1/mean_3)^2*( (var_10 + var_20)^0.5)
            scalar x_2 = (mean_11-mean_21)/mean_3
            scalar x_2sd=(1/mean_3)^2*( (var_11 + var_21)^0.5)
            count if y==0
            scalar n0 = r(N)
            count if y==1
            scalar n1 = r(N)
            ttesti n0 x_1 x_1sd n1 x_2 x_2sd, unequal
            The code is not tested but it hopefully uses the correct formula to calculate the standard deviations for your problem. So the code hopefully point into the right direction.

            Comment


            • #7
              Hello Sven,

              Thank you very much for the explanations: now I get why I could not display correctly the ttsest.

              I just tried your code and had the following error:

              Code:
              . ttesti `=n0' `=x_1' `=x_1sd' `=n1'  `=x_2' `=x_2sd', unequal
              mean of first sample is missing
              r(416)
              Do you mind explaining what it means since I thought we define the mean in the forvalues?

              Code:
              input float(x y year)
                44456 0 1999
                31892 0 1999
                41720 1 1999
                36592 1 1995
                40499 1 1995
                40499 1 1996
                39237 0 1997
                40407 0 1996
                28013 0 1995
                25607 1 1999
                10192 1 2000
              Thank you very much for explaning everything again.

              Comment


              • #8
                I do not know why the mean of the first sample is missing. You can check if the scalar exists by running
                Code:
                scalar dir
                My previously posted code has some errors. Below is the corrected version which worked with your last posted data example.
                Code:
                forvalues i=0/1{
                    summarize x if year <= 1998 & y == `i'
                    scalar mean_1`i' = r(mean)
                    scalar var_1`i' = r(Var)
                    summarize x if year > 1998 & y == `i'
                    scalar mean_2`i' = r(mean)
                    scalar var_2`i' = r(Var)
                }
                summarize x ,meanonly
                scalar mean_3 = r(mean)
                scalar x_1 = (mean_10-mean_20)/mean_3
                scalar x_1sd = (1/mean_3)^2*( (var_10 + var_20)^0.5)
                scalar x_2 = (mean_11-mean_21)/mean_3
                scalar x_2sd=(1/mean_3)^2*( (var_11 + var_21)^0.5)
                count if y==0
                scalar n0 = r(N)
                count if y==1
                scalar n1 = r(N)
                ttesti `=n0' `=x_1' `=x_1sd' `=n1' `=x_2' `=x_2sd', unequal

                Comment


                • #9
                  Dear Sven,

                  I would like to thank you again for your help.
                  The code is working !

                  Thanks again for your helpful explanations !
                  Eugene

                  Comment

                  Working...
                  X