Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • psmatch2 and fweight option of regress

    Dear Statalist users,

    I have two related questions:

    Question 1
    ---------------

    I need some clarification regarding what exactly "fweight" option does, there is nothing about it in stata help and couldn't find online as well. I know that is stands for frequency weight, but how exactly stata's regress command takes it into account?

    Here is what I am trying to do:
    I want to run a regression on a matched sample.
    I use psmatch2 to create the match between two groups of my observations. psmatch also creates the_weight variable that gives weight to each observation based on the match.
    I then ran regression on my dataset as following:

    Code:
    regress y x1 .....xn [fweight=_weight]]
    as was suggested here

    Question 2 ​​(if someone can help with that)
    ---------------
    if I use more than one nearest neighbor, the _weight variable is not an integer anymore. Which is strange as it should be the number of potential matches, and thus an integer. And because its not an integer I cannot use it as fweight value

    thank you in advance

  • #2
    Hi Constantin,

    I will try to work through addressing your two questions. First, frequency weights just indicate how many observations a single observation should count for. If you type --help weight-- Stata will provide a clear defitinon of how frequency weights are considered.

    fweights, or frequency weights, are weights that indicate the number of duplicated observations.

    This may be a little abstract to think about so I will demonstrate it with a very simple example:
    Code:
    input age weight
    40 10
    20 5
    end
    
    mean age
    
    mean age [fweight = weight]
    Running the above code will show you that if you treat your data as just two observations for indiviuals aged 40 and 20, the mean age is 30, right in the middle like we might expect. However, we have frequency weights. We want to treat our data as if we have 10 forty year olds, and 5 twenty year olds. This changes how we calculate the mean, and again not surprisingly after running the mean command including the weights, we see the mean is now 33.3 and the output indicates we now have 15 observations instead of 2. So simply put, frequency weights just tell Stata how many observations we want each row to count for. This concept is extended to commands like --regress-- or others.

    For your second question, I was able to reproduce your issue with the following code:
    Code:
    webuse nhanes2 , clear
    psmatch2 diabetes age sex , out(heartatk) logit neighbor(2)
    logit heartatk age sex [fweight = _weight]
    The logit command will not execute as the _weight variable is non-integer, as you said. This is is because for each treated individual they can have up to 2 neighbor matches in my example. Thus, each untreated individual matched to a treated individual only counts for a half an observation (1 out of 2 matches). So for any untreated individual, if they ONLY get matched to a single treated, they will have a weight of 0.5, if they match to two treated individuals their weight will be 1, and so on. So yes, you will get fractional weights and this is expected. You will notice the variable _nn is integer which is the number of matches but the weight will not be integer necessarily. You will have to analyze the data another way, either with a different type of matching, or your can use the propensity score directly for inverse-probability of treatment weighting (IPTW) or simply including the PS as a covariate in the outcome model. There are many ways to perform the analysis, but I hope this helps clarify your issues.
    Last edited by Matt Warkentin; 23 May 2018, 08:49.

    Comment


    • #3
      Thanks Matt, your explanation was very clear and supported what I suspected happens. It seems that you are also well familiar with the matching, so if you don't mind I will ask your a few more questions on the same topic?

      1. what I am trying to address with running regression on a matched sample is to address endogeneity issue. let me first describe how I do it now.
      n my case I do not have treatment and control group as such, instead i simulate it using the interaction variable i.e. my original model is
      Code:
      reg y x1 x2 moderator moderator#x2 moderator#x2#x2 x3 x4 x5
      i then define treatment variable as trt=moderator > 75th percentile of moderator values
      and i run
      Code:
      psmatch2 (trt x1 x2), out(y) logit ate
      */ moderator is replace by trt
      reg y x1 x2 trt trt#x2 trt#x2#x2 x3 x4 x5 [fweight=_weight], robust
      as we just established this won't work with neighbor != 1, as weights are not integers. You then suggests to use different type of matching. can you elaborate more how I do it (or refer me to an explanation somewhere), and more importantly, how I run regression on matched sample aftwerwards

      2. I have several dependent variables and for most of them both psmatch2 and teffects work and result in the same outcomes. however, for some of my DVs psmatch2 works fine, but teffects gives me an error message that I couldn't interpret:

      Code:
      there is 1 missing propensity score
      there is 1 propensity score greater than 1 - 1.00e-05
      The treatment overlap assumption has been violated; computations cannot proceed
      r(459);
      Any ideas?

      3. How I can assess the quality of the match? i.e. the numerical balance diagnostics

      Comment


      • #4
        Hi Constantin,

        I would be happy to try and offer advice on these issues, though I'm sure others who use these forums are more expert with these techniques than I am.

        For your first question, I am really not clear on what exactly you mean regarding simulating your treatment based on an interaction variable. Then in the next line you seem to define treatment by dichotomizing your variable called 'moderator'. You'll have to provide more insight into this process. There seems to be some code missing after the first regression model you present.

        For the question about alternative matching procedures. psmatch2 supports several matching algorithms but they may not solve the problem you're having. As I mentioned in my last post, you could consider ditching the matching altogether and simply use the propensity score directly, and there are advantages and disadvantages of this approach.

        I can't be certain of the underlying cause of the error messages you've received but the first one seems to indicated a participant does not have a propensity score, which is probably due to missing data for that person for one or more of the covariates used in your propensity score model. The second error suggests a propensity score is very close to or exactly 1, you'll need to look into this further but is probably related to near-perfect predictors that are giving you probabilities near or equal to exactly 1. Due to this near-perfect prediction you do not have sufficient propensity overlap and the program gives you the error you see. There needs to be overlap in the propensity between treated and untreated to satisfy this assumption.

        For the last question, you can assess the quality of the match in several ways. Some simple ways are to compare the means or proportions when applying the weights. Similarly, you could compare means or proportions in quantiles of the propensity score (e.g. deciles). Using the propensity score, you could compute inverse-probability weights and use Stata's svy suite of methods.

        Comment


        • #5
          Thank you Matt,

          1. simulating - i simply meant that I considered a treated group as a group exposed to some condition (e.g. environmental). however, this condition is measured using a continuous variable (e.g. temperature). which I dichotomized int HOT/COLD variable, by saying all temperatures above XXX are HOT, all below YYY are COLD. Thus treated group is a group exposed to HOT temperature.

          I am interested in your suggestion to simply use propensity scores directly to assess potential endogeneity. can you please elaborate on it?

          2. yes, i understood it this way, what is confusing is why psmatch2 is still working, while teffects is not

          3. "compare the means when applying the weights" - do you mean to calculate the mean of propensity scores in treatment group and compare it to the mean of propensity score in control group:(mean_trt - mean_ctrl)/sd_trt?

          Comment


          • #6
            Hi Constantin,

            I now understand what you mean for creating your treatment/exposure variable.

            There are several approaches to using the propensity directly, instead of by matching.

            These approaches include:
            1. Creating quantiles based on the PS and doing stratified regression
            2. Include the propensity score into the model as a covariate, potentially as a non-linear effect with the outcome
            3. Since the PS is a probability of getting the treatment, you can compute the inverse-probability of treatment weight (IPTW) which is 1 / PS. There is also a stabilized version of the IPTW which is defined as the marginal propensity / model-based propensity
            I will demonstrate each approach below using a toy example of whether sex (male v. female) is related to heart attack. I want to balance covariates for age, height, and weight.

            Code:
            webuse nhanes2 , clear
            recode sex (1 = 0) (2 = 1)
            label define sex 0 "male" 1 "female" , replace
            label val sex sex
            
            psmatch2 sex age height weight , out(heartatk) logit
            logistic sex age height weight , nolog
            predict ps , pr /* you can see this is the same as the psmatch2 _pscore */
            
            * Creating deciles based on propensity score
            xtile PSq = _pscore , nq(4)
            tab1 PSq
            
            bysort PSq : logistic heartatk i.sex , nolog base
            
            * Generate IPTW
            gen     iptw = 1 / _pscore if sex==1
            replace iptw = 1 / (1 - _pscore) if sex==0
            logistic heartatk sex [pweight = iptw]
            
            * Generate stabilized IPTW
            logit sex
            predict marg_ps , pr
            
            gen     siptw = marg_ps / _pscore if sex==1
            replace siptw = (1 - marg_ps) / (1 - _pscore) if sex==0
            
            logistic heartatk sex [pweight = siptw]
            
            
            * PS as a covariate
            logistic heartatk sex _pscore
            mfp : logistic heartatk sex _pscore
            Generally, from each modeling approach we can see that being a female reduces your odds of having a heart attack, with an odds ratio of around 0.4.

            As for evaluating covariate balance, here is one example. Let's say after doing our propensity matching or computing our IPTW we want to check that males and females are balance on weight. We can first check the difference in body weight by sex without any matching or weighting, and we see men are heavier on average which is not surprising. We then can compare the men and women after matching or after applying probability weights. We see matching was superior in this case but IPTW was not bad and keep in mind the propensity model was far from good and only the single nearest neighbor was matched so superior performance is expected.

            Code:
            mean weight , over(sex)
            mean weight [fw = _weight] , over(sex)
            mean weight [pw = siptw] , over(sex)

            Comment


            • #7
              thanks, i test this approach as well!

              Comment


              • #8
                This post is five years old and all so perhaps this isn't useful, but in case anyone else stumbles on it by way of google I want to note for posterity that fweights alone do not account for correlated errors among comparison individuals selected multiple times. fweights simply treats the data as though you've run the expand command prior to the regression, as in "expand _weight." To account for correlated standard errors folks should consider clustering on an individual ID, using pweights instead of fweights, or looking into using teffects instead of psmatch2.

                Comment


                • #9
                  Outcome: k6cat2v2
                  Treatment: disability_child
                  Covariates: i.healthCond_cat2 i.affectLife i.symptom i.ageCat3 i.workingHourPerWeek_cat4 i.singleFather_n

                  Should I get similar result from the following two codes:
                  1. using psmatch2
                  psmatch2 disability_child i.healthCond_cat2 i.affectLife i.symptom i.ageCat3 i.workingHourPerWeek_cat4 i.singleFather_n, outcome(k6cat2v2) logit neighbor(1) caliper(0.01)
                  logistic k6cat2v2 i.disability_child i.healthCond_cat2 i.affectLife i.symptom i.ageCat3 i.workingHourPerWeek_cat4 i.singleFather_n [fweight=_weight]


                  2. using gmatch
                  logistic disability_child i.healthCond_cat2 i.affectLife i.symptom i.ageCat3 i.workingHourPerWeek_cat4 i.singleFather_n
                  predict ps
                  gmatch disability_child ps, maxc(1) set(set1) diff(diff1) cal(0.01)
                  logistic k6cat2v2 i.disability_child i.healthCond_cat2 i.affectLife i.symptom i.ageCat3 i.workingHourPerWeek_cat4 i.singleFather_n if set1 < .

                  The result I get is very different.


                  Comment

                  Working...
                  X