Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Inverse probability weighted logistic regression

    Dear Statalisters,

    I have a cross-sectional survey dataset (N ~ 500,000) with approximately 50% non-response. I have demographic data on almost all individuals in the original ~ 500,000 sample, whether they responded to the survery or not. I've run a logistic regression to examine which characteristics (e.g., ethnicity) are associated with non-response:

    logistic response i.ethnicity

    As an example, this gives me an odds ratio of 0.83 (i.e., ethnicity x was less likely than ethnicity y to respond to the survey)

    I want to incorporate inverse probability weights to generate an new set of adjusted odds ratios that take into account the increased rate of non-response in certain demographics, following methods described by Hoefler et al., 2005 (https://pubmed.ncbi.nlm.nih.gov/15834780/).

    I've calculated weights as follows:

    logistic response i.ethnicity
    predict ipw

    replace ipw = 1 - ipw if response == 1
    replace ipw = 1 / (1-ipw) if response == 0


    However, when I run the adjusted regression:

    logistic response i.ethnicity [pw = ipw]

    The output returns an odds ratio of 1 for all ethnic groups, which means I've almost certainly misunderstood how to apply the weights correctly. The weights themselves look OK to me (i.e., they are larger for observations from ethnic groups that are less preponderant, and vice versa). What's going on here? I'd expect the OR to be similar to the unweighted estimate.

    Any advice would be greatly appreciated.

  • #2
    You calculated the weights using the wrong outcome variable. What you need here are weights that reflect the inverse probability of having a non-missing response. So:
    Code:
    gen byte responded = !missing(response)
    logistic responded i.ethnicity
    predict ipw
    
    replace ipw = 1/ipw if responded == 1
    replace ipw = 1 / (1-ipw) if responded == 0
    
    logistic response i.ethnicity [pw = ipw]
    Note that the weights are calculated from the new variable responded, not the original response variable. Note also that when responded == 1. the weight should be 1/ipw, not 1-ipw.

    Comment


    • #3
      Thanks a lot Clyde for your help, but I'm not sure I understand.

      The variable response (my outcome variable) is a binary indicator (1 if they responded, 0 otherwise). It doesn't contain any missing values. I'm sorry that I wasn't clear about this initially. Thus, when I run the code you suggested:

      Code:
      gen byte responded = !missing(response)
      The new responded variable consequently has the value 1 for all observations in the dataset, which means that it can't run the logistic regression:

      Code:
       logistic responded i.ethnicity
      As the outcome doesn't vary. Or am I missing something?

      Note also that when responded == 1. the weight should be 1/ipw, not 1-ipw
      Thanks, that was a typo on my part. It's correct in my STATA environment, which I access through a VPN that prevents me from copying over lines of code.

      Is there more information that I can provide to provide add clarity perhaps?

      Many thanks again.

      Comment


      • #4
        Then what did you mean when you said
        I have demographic data on almost all individuals in the original ~ 500,000 sample, whether they responded to the survery or not.
        For the ones who did not respond to the survey, the response variable would be missing, right?

        Comment


        • #5
          I've probably complicated things by trying to simplify them! Basically, I have a datset A, which I've merged with a dataset B (N ~ 500,000). Around 50% of observations from B match records in A. I'm trying to generate a list of predictors for successful matching. Since individuals with certain characeristics might be underrepresented in the matched portion of dataset B, I thought I could correct for that using weights.

          I've generated an equivalent example using simulated data. Perhaps that will clarify where I'm going wrong with this:

          Code:
          clear
          
          set obs 1000
          
          set seed 11
          gen response = runiformint(0,1)
          set seed 24
          gen sex = runiformint(0,1)
          
          * Massaging the data here a bit so that there's an effect of sex
          replace response = 0 in 150/300
          replace sex = 0 in 150/300
          
          label def sexlab 0 "Male" 1 "Female
          label val sex sexlab
          logistic response i.sex
          predict ipw
          
          replace ipw = 1/ipw if response == 1
          replace ipw = 1 / (1-ipw) if response == 0
          
          logistic response i.sex [pw = ipw]
          This is essentially equivalent to what my data looks like, although it contains some missing data for some of the demographic variables that I want to use as predictors. Perhaps this will clarify where I'm going wrong and where I'm misunderstanding things.

          Comment


          • #6
            Originally posted by Konrad Heller View Post
            ...
            Code:
            logistic response i.ethnicity
            predict ipw
            
            replace ipw = 1 - ipw if response == 1
            replace ipw = 1 / (1-ipw) if response == 0
            
            [logistic response i.ethnicity [pw = ipw]
            The output returns an odds ratio of 1 for all ethnic groups, which means I've almost certainly misunderstood how to apply the weights correctly. The weights themselves look OK to me (i.e., they are larger for observations from ethnic groups that are less preponderant, and vice versa). What's going on here? I'd expect the OR to be similar to the unweighted estimate.

            Any advice would be greatly appreciated.
            Let's take a step back. We might choose IPW to correct for non-response bias when we fit a model on a different outcome.

            Here, you are creating an IPW model using race. You're then applying the race-based weights to a model for the probability of response - which is the same thing you based the weights on. I don't know how to explain it exactly, but intuitively, the weights are based on race, and they're cancelling out the effect of race in the regression.

            If you apply those IPWs to a regression for a different outcome, you're fine. It's just that you are applying IPWs for non-response based on race to a regression on the probability of response. This is not an interesting question. If you were interested in the probability of response by race, you should have stopped at the first regression, or built on it.

            I have a feeling that if you pick any Stata dataset and you try to recreate this type of example, you'll get similar results.
            Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

            When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

            Comment

            Working...
            X