Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating weights for a diproportional stratified sample

    Dear All,

    I am using Stata 14 on Windows 10 an I am dealing with a weighting problem.

    I have a population with 12000 observations from which I took a disproportional stratified sample with the following commands:

    . sort Typ

    . sample 25 if Typ == 11

    . sample 15 if Typ == 12

    . sample 30 if Typ == 21

    . sample 40 if Typ == 22

    The numerical designations like 11 etc. name the different stratas.

    My problem is that I want to use different weights to see, which one is the best for describing the population by a sample.

    The weights I want to use for the extrapolation are:
    1. {the propotion of firms (in one strata) in the total amount of firms in the population } divided through {the proportion of firms from one strata (in the sample) in the total amount of firms in the sample}

    2. {the proportion of the employees („y_besch“) (in one strata) inthe total amount of employees in the population} divided through {the proportion of employees in one strata (in the sample) in the total amount of employees in the sample}
    3. {the proportion of the revenue („Gewinn“) of one strata inthe total amount of the revenue oft he population} divided through { the proportion of the revenue of one strata inthe total amount of revenue of the sample}
    4. the amount of employees in one strata in the sample divided through the amount of employees in the same strata in the population
    I know this is very specific, but that are the 4 weights I would like to create in Stata. Is that even possible?

    If not, does anyone know another way how I could do this?

    Thanks a lot!
    Aileen

  • #2
    There is a linked question:http://www.statalist.org/forums/foru...or-each-strata

    None of the weights you describe is a sampling weight (pweight). You are attempting to match the sample estimates to figures known for the population in each stratum. This is a form of post-stratification. No single weight will be best for all purposes. Luckily, you may be able match on all four characteristics at once )with John De' Souza's calibrate command (SSC). Then use his calibest, also SSC. to estimate means.
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Hello Steve,

      I tried to work with the post-stratification, but I have the feeling that I made a mistake.

      Here are the commands I entered in Stata:




      //creating a sample of 1000 observations\\
      sample 6.25 if Typ==11
      sample 4 if Typ==12
      sample 50 if Typ ==21
      sample 42.5 if Typ==2




      //creating the weights as I computed them before in form a new variable\\

      gen weight = 1.33333 if Typ==11
      replace weight = 2.08333 if Typ==12
      replace weight = 0.16667 if Typ==21
      replace weight = 0.19608 if Typ==22


      //using poststratification\\

      svyset _n, poststrata(Typ) postweight(weight) vce(linearized) singleunit(missing)

      svy: mean Gewinn


      I thought that by creating These weighting variable it would be easier to use the post-stratification, because thus I just need to Change the weighting factors for each stratum to use my different types of weights.

      The second weighting type I used to compare its results with the first ones, is a Horvity-Thompson estimator, which basically is the inverse selection probability. Therefore I used the following commands:

      gen weight = 16 if Typ==11
      replace weight = 25 if Typ==12
      replace weight = 2 if Typ==21
      replace weight = 2.35294 if Typ==22


      To compare the both of them, I will repeat this procedure over a hundred times, to see which one has the best results for the mean Gewinn on average.

      Here is one example of a result for my first weighting type:

      svy: mean Gewinn
      (running mean on estimation sample)

      Survey: Mean estimation

      Number of strata = 1 Number of obs = 1,000
      Number of PSUs = 1,000 Population size = 3.77940995
      N. of poststrata = 4 Design df = 999

      --------------------------------------------------------------
      | Linearized
      | Mean Std. Err. [95% Conf. Interval]
      -------------+------------------------------------------------
      Gewinn | 118.435 2.401422 113.7226 123.1475
      --------------------------------------------------------------




      And here it is for the Horvitz Thompson weighting Type:

      svy: mean Gewinn
      (running mean on estimation sample)

      Survey: Mean estimation

      Number of strata = 1 Number of obs = 1,000
      Number of PSUs = 1,000 Population size = 45.3529401
      N. of poststrata = 4 Design df = 999

      --------------------------------------------------------------
      | Linearized
      | Mean Std. Err. [95% Conf. Interval]
      -------------+------------------------------------------------
      Gewinn | 117.1767 2.469403 112.3309 122.0225
      --------------------------------------------------------------


      As you can see, the means are very close to each other, which is okay because the both shall be an estimator for the same population.

      But there are some things very confusing to me:

      1) The Horvitz Thompson estimator is supposed to have a much lower Standard error than the other weighting type, because it is one of his characteristics to enlarge the sample on the size of the population. Consequently, the size of the sample leads to a lower standard error. But here the both of them are nearly the same and this ist not changing if you repeat this procedure over and over again. Accordingly, I am convinced that I have done something wrong, but I have no idea what this could be?

      2) I am getting confused by the term population size. What does it tell me? From my point of view, it is the total of the for weighting variables, isn't it?


      I am apologising for my bad English, but as a non native speaker I am not used to describe statistical problems in English.

      Thank you very much in advance!
      Greetings,
      Aileen

      Comment


      • #4
        None of your specifications is correct.

        1. You have sampling weights and post-stratification weights. They are different and both must appear in the svyset statement:
        Code:
        svyset _n [pwt = sampwt], strata(Typ) poststrata(Typ) postweight(postwt)
        2. Neither weight is correct. Post-stratification weights should be known (post)stratum totals (adding to the population size 12,000). If you omit the post-stratification options in svyset, the total of sampling weights should be about the population size, 12,000.

        By alternating responses between two threads, you have confused this discussion. I've closed the other thread and referred people here. Also, in the future, enter all commands, results, and data listings between CODE delimiters, described in FAQ 12.
        Last edited by Steve Samuels; 30 Mar 2016, 16:22.
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment


        • #5
          Thank your for closing the other discussion, I did not meant to confuse you!

          1) I followed your instrucitons regarding the svyset commands:

          Code:
           svyset _n [pweight=weight], strata(Typ) poststrata(Typ) postweight(weight) vce
          > (linearized) singleunit(missing)
          Here are the reslults:

          Code:
                pweight: weight
                    VCE: linearized
             Poststrata: Typ
             Postweight: weight
            Single unit: missing
               Strata 1: Typ
                   SU 1: <observations>
                  FPC 1: <zero>
          
          . svy: mean Gewinn
          (running mean on estimation sample)
          
          Survey: Mean estimation
          
          Number of strata =       4        Number of obs   =      1,000
          Number of PSUs   =   1,000        Population size = 3.77940995
          N. of poststrata =       4        Design df       =        996
          
          --------------------------------------------------------------
                       |             Linearized
                       |       Mean   Std. Err.     [95% Conf. Interval]
          -------------+------------------------------------------------
                Gewinn |   114.5318   2.350008      109.9203    119.1433
          --------------------------------------------------------------
          Since I created my sample without weighting, but with sampling a fixed share of each stratum, I am not quite sure if I still Need a sampling weigt, do I?

          Here is my command again for the samlpling:

          Code:
          sample 6.25 if Typ==11
          sample 4 if Typ==12
          sample 50 if Typ ==21
          sample 42.5 if Typ==22
          Thus, I get the amount of samples for each stratum I want to have in my sample.

          As you can see in the results, the Population size did not change, allthough it should be 12000 (because that kind of weight leaves the size of the population unchanged. Furthermore, the Standard Error is still the same.

          Comment


          • #6
            Thanks for using code delimiters. Let's step back. You have four strata aand know the the sampling fraction in each. If \(N_i\) is the known number of firms in stratum \(i\) and \(f_i\) is the sampling fraction, then the number of sampled firms (assuming 100% response) is \( n_i = f_i N_i\)
            The sampling weight in stratum \(i\) is
            \[
            w_i = \frac{1}{f_i} = \frac{N_i}{n_i}
            \]
            and the sum of the weights in the stratum is \(n_i \times w_i = N_i\), the population total for the stratum.

            Thus with sampling weights alone, the sample correctly represents the stratum counts and relative proportions of firms. This is the first goal you had for weighting. So try following code:
            Code:
            svyset _n [pw = sampwt], strata(Typ)
            After you do this, svy: mean should report a population size equal to 12,000 and
            Code:
            total sampwt, over(Typ)
            should reproduce the known \(N_i\). No extra post-stratification weight is needed here. To be persuaded that you've done the right thing, we'd need to see the estimated counts from svy: mean and from total. Please show us.

            (Aside: you can also add a finite population correction (fpc) to the svyset statement, but that's irrelevant to the discussion of weights.)

            Your other three extrapolation goals are that the sample should estimate known stratum totals (or proportions) of revenue and of numbers of employees. (Your goals 2 and 4 look identical to me). The only proof that you've accomplished one of these goals would be a demonstration that the sample estimates the known totals or proportions. Unfortunately, single-factor post-stratication is harmful, because it distorts the distributions of the other factors. Thus I recommended calibrate.
            Last edited by Steve Samuels; 31 Mar 2016, 08:28.
            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              Hello Steve! Thank you so much for taking the time and patience for solving my problem.

              Code:
              svyset _n [pw = weight], strata(Typ)
              
                    pweight: weight
                        VCE: linearized
                Single unit: missing
                   Strata 1: Typ
                       SU 1: <observations>
                      FPC 1: <zero>
              
              .
              . svy: mean Gewinn
              (running mean on estimation sample)
              
              Survey: Mean estimation
              
              Number of strata =       4        Number of obs   =      1,000
              Number of PSUs   =   1,000        Population size =     12,000
                                                Design df       =        996
              
              --------------------------------------------------------------
                           |             Linearized
                           |       Mean   Std. Err.     [95% Conf. Interval]
              -------------+------------------------------------------------
                    Gewinn |   118.2277   2.453058       113.414    123.0415
              --------------------------------------------------------------
              With that instrucitons, I could finally solve the problem!

              Thanks a lot again,


              Comment

              Working...
              X