Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding Median for weighteds survey data

    Good afternoon,

    I am using survey data and because I need to account for the sampling & pweights, I'm using svy command for my analysis. I am wondering how to correctly obtain the median values. Currently, I am using epctile (sample code below), but the median value I am obtaining is outside of the 95% CI range I get within the weighted mean in the svy command, even though it allows me to apply the pweights. This seems odd, though one option I thought would be to report the 95% CI of both the mean and median separately. Any insights would be incredibly welcome!


    . svy, subpop(if analytical_pop==1 & first_cancer==wave): mean percent_asset_change1
    (running mean on estimation sample)

    Survey: Mean estimation

    Number of strata = 51 Number of obs = 15,736
    Number of PSUs = 102 Population size = 71,582,508
    Subpop. no. obs = 263
    Subpop. size = 892,926
    Design df = 51

    -----------------------------------------------------------------------
    | Linearized
    | Mean Std. Err. [95% Conf. Interval]
    ----------------------+------------------------------------------------
    percent_asset_change1 | .0203422 .5888416 -1.161807 1.202491
    -----------------------------------------------------------------------
    Note: 20 strata omitted because they contain no subpopulation members.





    . epctile percent_asset_change1 if analytical_pop==1 & first_cancer==wave [pweight=pre_rwtresp], p(50)



    Percentile estimation
    ------------------------------------------------------------------------------
    percent_as~1 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    p50 | -.0451306 .0387276 -1.17 0.244 -.1210354 .0307741
    ------------------------------------------------------------------------------



  • #2
    First, as specified in Statalist FAQ 12.1, "If you are using community-contributed (also known as user-written) commands, explain that and say where they came from." epctile was created by Stas Kolenikov and can be found with findit epctile.

    Second, I think you want to use the svy option with epctile (see below).

    Third, your output will will be more readable if you use the code tags in your post (see below).

    Finally, there's no reason a priori to expect the standard errors of the median and the mean of a distribution to be the same, weighted or not. I think the SE of a median is generally about 30% larger than the SE of a mean, but that is just a rule of thumb approximation. Below is an example where the SE of the median is 65% larger than the SE of the mean. (Also, the mean value differs from the median value, but that is also to be expected in most cases.)

    In any case, if appropriate, I encourage you to report the SEs of both the mean and median.

    Code:
    . webuse nhanes2
    
    . svyset
    
    Sampling weights: finalwgt
                 VCE: linearized
         Single unit: missing
            Strata 1: strata
     Sampling unit 1: psu
               FPC 1: <zero>
    
    . svy: mean age
    (running mean on estimation sample)
    
    Survey: Mean estimation
    
    Number of strata = 31            Number of obs   =      10,351
    Number of PSUs   = 62            Population size = 117,157,513
                                     Design df       =          31
    
    --------------------------------------------------------------
                 |             Linearized
                 |       Mean   std. err.     [95% conf. interval]
    -------------+------------------------------------------------
             age |   42.25264   .3026691      41.63534    42.86994
    --------------------------------------------------------------
    
    
    . epctile age, p(50) svy
    (running mean on estimation sample)
    
    Survey: Mean estimation
    
    Number of strata = 31            Number of obs   =      10,351
    Number of PSUs   = 62            Population size = 117,157,513
                                     Design df       =          31
    
    --------------------------------------------------------------
                 |             Linearized
                 |       Mean   std. err.     [95% conf. interval]
    -------------+------------------------------------------------
        __000006 |  -.0152412   .0089141     -.0334216    .0029393
    --------------------------------------------------------------
    
    Percentile estimation
    ------------------------------------------------------------------------------
                 |             Linearized
             age | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
             p50 |         40         .5    80.00   0.000     39.02002    40.97998
    ------------------------------------------------------------------------------
    David Radwin
    Senior Researcher, California Competes
    californiacompetes.org
    Pronouns: He/Him

    Comment


    • #3
      This was very helpful! Really appreciate the thoughtful response. I'll be sure to report the CI for both.

      Comment


      • #4
        Hi David,
        Thanks for your weigh-in on the above question which set me on the right path as well.
        Is there a way to modify the code to get the Median, Interquartile range (IQR), and statistical signifcance instead of the Median and Confidence Intervals?

        I am currently using the CDC’s NHANES - a complex survey design - to compare several variables across quintiles of an exposure variable e.g to compare distribution of age across quintiles of sleep duration. However, all of the variables are not normally distributed, thus I am more inclined to report the number using median (interquartile range), instead of mean ± standard deviation. How can I get the median and also check for statistical significance?

        I used Stas Kolenikov's epcitile code across various categories. Please see my output below and kindly advise how to get IQR with p-values nstead. Thanks for your help!


        Click image for larger version

Name:	Statalist Pic.png
Views:	2
Size:	34.9 KB
ID:	1734614

        Comment


        • #5
          You can get the lower and upper values of the IQR, and their corresponding p-values, by using p(25) and p(75) in addition to p(50). p(50) is the 50th percentile, which is the same as the median value. Is that what you mean?

          I don't know if there's a way to get a standard error or p-value of the IQR itself. I have never seen that reported, but that doesn't mean it's not possible. If that's what you seek, I recommend starting a new thread as recommended by FAQ extra 1.5.
          David Radwin
          Senior Researcher, California Competes
          californiacompetes.org
          Pronouns: He/Him

          Comment


          • #6
            Hi David. First of all, THANK YOU so much for your quick response and the tip on getting the IQR. It was very helpful but I had to run the p(25) and p(75) as separate commands from the p(50) in order to get the IQR.
            And you are right. I have not seen p-values reported for IQR either. (I had used mean values with their associated p-values but one member of the research team thought median and IQR would be better)

            I will start a new thread right away and see if anyone has a different thought. Thanks once again



            Click image for larger version

Name:	Statalist Pic2.png
Views:	1
Size:	80.7 KB
ID:	1734637
            Attached Files

            Comment

            Working...
            X