psmatch2 and fweight option of regress

Constantin Alba

Join Date: Sep 2014

Posts: 80
#1

psmatch2 and fweight option of regress

23 May 2018, 06:33

Dear Statalist users,

I have two related questions:

Question 1
---------------

I need some clarification regarding what exactly "fweight" option does, there is nothing about it in stata help and couldn't find online as well. I know that is stands for frequency weight, but how exactly stata's regress command takes it into account?

Here is what I am trying to do:
I want to run a regression on a matched sample.
I use psmatch2 to create the match between two groups of my observations. psmatch also creates the_weight variable that gives weight to each observation based on the match.
I then ran regression on my dataset as following:

Code:

regress y x1 .....xn [fweight=_weight]]

as was suggested here

Question 2 (if someone can help with that)
---------------
if I use more than one nearest neighbor, the _weight variable is not an integer anymore. Which is strange as it should be the number of potential matches, and thus an integer. And because its not an integer I cannot use it as fweight value

thank you in advance
Tags: None
Matt Warkentin

Join Date: May 2016

Posts: 104
#2

23 May 2018, 08:45

Hi Constantin,

I will try to work through addressing your two questions. First, frequency weights just indicate how many observations a single observation should count for. If you type --help weight-- Stata will provide a clear defitinon of how frequency weights are considered.

fweights, or frequency weights, are weights that indicate the number of duplicated observations.

This may be a little abstract to think about so I will demonstrate it with a very simple example:

Code:

input age weight 40 10 20 5 end mean age mean age [fweight = weight]

Running the above code will show you that if you treat your data as just two observations for indiviuals aged 40 and 20, the mean age is 30, right in the middle like we might expect. However, we have frequency weights. We want to treat our data as if we have 10 forty year olds, and 5 twenty year olds. This changes how we calculate the mean, and again not surprisingly after running the mean command including the weights, we see the mean is now 33.3 and the output indicates we now have 15 observations instead of 2. So simply put, frequency weights just tell Stata how many observations we want each row to count for. This concept is extended to commands like --regress-- or others.

For your second question, I was able to reproduce your issue with the following code:

Code:

webuse nhanes2 , clear psmatch2 diabetes age sex , out(heartatk) logit neighbor(2) logit heartatk age sex [fweight = _weight]

The logit command will not execute as the _weight variable is non-integer, as you said. This is is because for each treated individual they can have up to 2 neighbor matches in my example. Thus, each untreated individual matched to a treated individual only counts for a half an observation (1 out of 2 matches). So for any untreated individual, if they ONLY get matched to a single treated, they will have a weight of 0.5, if they match to two treated individuals their weight will be 1, and so on. So yes, you will get fractional weights and this is expected. You will notice the variable _nn is integer which is the number of matches but the weight will not be integer necessarily. You will have to analyze the data another way, either with a different type of matching, or your can use the propensity score directly for inverse-probability of treatment weighting (IPTW) or simply including the PS as a covariate in the outcome model. There are many ways to perform the analysis, but I hope this helps clarify your issues.

Last edited by Matt Warkentin; 23 May 2018, 08:49.
1 like
Comment
Constantin Alba

Join Date: Sep 2014

Posts: 80
#3

23 May 2018, 18:01

Thanks Matt, your explanation was very clear and supported what I suspected happens. It seems that you are also well familiar with the matching, so if you don't mind I will ask your a few more questions on the same topic?

1. what I am trying to address with running regression on a matched sample is to address endogeneity issue. let me first describe how I do it now.
n my case I do not have treatment and control group as such, instead i simulate it using the interaction variable i.e. my original model is

Code:

reg y x1 x2 moderator moderator#x2 moderator#x2#x2 x3 x4 x5

i then define treatment variable as trt=moderator > 75th percentile of moderator values
and i run

Code:

psmatch2 (trt x1 x2), out(y) logit ate */ moderator is replace by trt reg y x1 x2 trt trt#x2 trt#x2#x2 x3 x4 x5 [fweight=_weight], robust

as we just established this won't work with neighbor != 1, as weights are not integers. You then suggests to use different type of matching. can you elaborate more how I do it (or refer me to an explanation somewhere), and more importantly, how I run regression on matched sample aftwerwards

2. I have several dependent variables and for most of them both psmatch2 and teffects work and result in the same outcomes. however, for some of my DVs psmatch2 works fine, but teffects gives me an error message that I couldn't interpret:

Code:

there is 1 missing propensity score there is 1 propensity score greater than 1 - 1.00e-05 The treatment overlap assumption has been violated; computations cannot proceed r(459);

Any ideas?

3. How I can assess the quality of the match? i.e. the numerical balance diagnostics
Comment
Matt Warkentin

Join Date: May 2016

Posts: 104
#4

24 May 2018, 09:23

Hi Constantin,

I would be happy to try and offer advice on these issues, though I'm sure others who use these forums are more expert with these techniques than I am.

For your first question, I am really not clear on what exactly you mean regarding simulating your treatment based on an interaction variable. Then in the next line you seem to define treatment by dichotomizing your variable called 'moderator'. You'll have to provide more insight into this process. There seems to be some code missing after the first regression model you present.

For the question about alternative matching procedures. psmatch2 supports several matching algorithms but they may not solve the problem you're having. As I mentioned in my last post, you could consider ditching the matching altogether and simply use the propensity score directly, and there are advantages and disadvantages of this approach.

I can't be certain of the underlying cause of the error messages you've received but the first one seems to indicated a participant does not have a propensity score, which is probably due to missing data for that person for one or more of the covariates used in your propensity score model. The second error suggests a propensity score is very close to or exactly 1, you'll need to look into this further but is probably related to near-perfect predictors that are giving you probabilities near or equal to exactly 1. Due to this near-perfect prediction you do not have sufficient propensity overlap and the program gives you the error you see. There needs to be overlap in the propensity between treated and untreated to satisfy this assumption.

For the last question, you can assess the quality of the match in several ways. Some simple ways are to compare the means or proportions when applying the weights. Similarly, you could compare means or proportions in quantiles of the propensity score (e.g. deciles). Using the propensity score, you could compute inverse-probability weights and use Stata's svy suite of methods.
1 like
Comment
Constantin Alba

Join Date: Sep 2014

Posts: 80
#5

24 May 2018, 18:12

Thank you Matt,

1. simulating - i simply meant that I considered a treated group as a group exposed to some condition (e.g. environmental). however, this condition is measured using a continuous variable (e.g. temperature). which I dichotomized int HOT/COLD variable, by saying all temperatures above XXX are HOT, all below YYY are COLD. Thus treated group is a group exposed to HOT temperature.

I am interested in your suggestion to simply use propensity scores directly to assess potential endogeneity. can you please elaborate on it?

2. yes, i understood it this way, what is confusing is why psmatch2 is still working, while teffects is not

3. "compare the means when applying the weights" - do you mean to calculate the mean of propensity scores in treatment group and compare it to the mean of propensity score in control group:(mean_trt - mean_ctrl)/sd_trt?
Comment
Matt Warkentin

Join Date: May 2016

Posts: 104
#6

25 May 2018, 10:08

Hi Constantin,

I now understand what you mean for creating your treatment/exposure variable.

There are several approaches to using the propensity directly, instead of by matching.

These approaches include:
Creating quantiles based on the PS and doing stratified regression

Include the propensity score into the model as a covariate, potentially as a non-linear effect with the outcome

Since the PS is a probability of getting the treatment, you can compute the inverse-probability of treatment weight (IPTW) which is 1 / PS. There is also a stabilized version of the IPTW which is defined as the marginal propensity / model-based propensity

I will demonstrate each approach below using a toy example of whether sex (male v. female) is related to heart attack. I want to balance covariates for age, height, and weight.

Code:

webuse nhanes2 , clear recode sex (1 = 0) (2 = 1) label define sex 0 "male" 1 "female" , replace label val sex sex psmatch2 sex age height weight , out(heartatk) logit logistic sex age height weight , nolog predict ps , pr /* you can see this is the same as the psmatch2 _pscore */ * Creating deciles based on propensity score xtile PSq = _pscore , nq(4) tab1 PSq bysort PSq : logistic heartatk i.sex , nolog base * Generate IPTW gen iptw = 1 / _pscore if sex==1 replace iptw = 1 / (1 - _pscore) if sex==0 logistic heartatk sex [pweight = iptw] * Generate stabilized IPTW logit sex predict marg_ps , pr gen siptw = marg_ps / _pscore if sex==1 replace siptw = (1 - marg_ps) / (1 - _pscore) if sex==0 logistic heartatk sex [pweight = siptw] * PS as a covariate logistic heartatk sex _pscore mfp : logistic heartatk sex _pscore

Generally, from each modeling approach we can see that being a female reduces your odds of having a heart attack, with an odds ratio of around 0.4.

As for evaluating covariate balance, here is one example. Let's say after doing our propensity matching or computing our IPTW we want to check that males and females are balance on weight. We can first check the difference in body weight by sex without any matching or weighting, and we see men are heavier on average which is not surprising. We then can compare the men and women after matching or after applying probability weights. We see matching was superior in this case but IPTW was not bad and keep in mind the propensity model was far from good and only the single nearest neighbor was matched so superior performance is expected.

Code:

mean weight , over(sex) mean weight [fw = _weight] , over(sex) mean weight [pw = siptw] , over(sex)
1 like
Comment
Constantin Alba

Join Date: Sep 2014

Posts: 80
#7

25 May 2018, 17:19

thanks, i test this approach as well!
Comment
Paul Burkander

Join Date: May 2015

Posts: 13
#8

14 Jun 2019, 11:28

This post is five years old and all so perhaps this isn't useful, but in case anyone else stumbles on it by way of google I want to note for posterity that fweights alone do not account for correlated errors among comparison individuals selected multiple times. fweights simply treats the data as though you've run the expand command prior to the regression, as in "expand _weight." To account for correlated standard errors folks should consider clustering on an individual ID, using pweights instead of fweights, or looking into using teffects instead of psmatch2.
Comment
bibha dhungel

Join Date: Oct 2020

Posts: 35
#9

01 Feb 2021, 05:49

Outcome: k6cat2v2
Treatment: disability_child
Covariates: i.healthCond_cat2 i.affectLife i.symptom i.ageCat3 i.workingHourPerWeek_cat4 i.singleFather_n

Should I get similar result from the following two codes:
1. using psmatch2
psmatch2 disability_child i.healthCond_cat2 i.affectLife i.symptom i.ageCat3 i.workingHourPerWeek_cat4 i.singleFather_n, outcome(k6cat2v2) logit neighbor(1) caliper(0.01)
logistic k6cat2v2 i.disability_child i.healthCond_cat2 i.affectLife i.symptom i.ageCat3 i.workingHourPerWeek_cat4 i.singleFather_n [fweight=_weight]

2. using gmatch
logistic disability_child i.healthCond_cat2 i.affectLife i.symptom i.ageCat3 i.workingHourPerWeek_cat4 i.singleFather_n
predict ps
gmatch disability_child ps, maxc(1) set(set1) diff(diff1) cal(0.01)
logistic k6cat2v2 i.disability_child i.healthCond_cat2 i.affectLife i.symptom i.ageCat3 i.workingHourPerWeek_cat4 i.singleFather_n if set1 < .

The result I get is very different.
Comment

Announcement

psmatch2 and fweight option of regress

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment