Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Truncated data with known share of population which is truncated

    Dear all,
    I have GPS-based data on walking tracks of participants derived from a representative sample of Jerusalem residents over 24 hours.

    I want to estimate the impact of the terrain attributes (height, slope) on the distance walked (the DV).
    I only observe cases where participants have decided to walk rather than use motorized means (for otherwise walkable distance)
    I do however know that the share of the population which tends to walk for walkable distances is 12%.
    Can I use this info to improve the estimation using the -truncreg- command?
    Anat

  • #2
    Anat Tchetchik, since your outcome is distance walked, truncation implies that you lost data on a range of the observed distances, e.g., greater than 10 kilometers. So you have data on distances \(\leq\) 10 kilometers but not for distances \(>\) 10 kilometers, even though data were collected for both ranges. If some individuals in the sample always report 0 values for distance walked, and you are confident that the reason is because they have access to other means of transport, then you can argue that you have censored data (right-censored at zero). You do not know what distances they would have walked absent their access to alternative means of transport. Otherwise, I read your description as your sample collected over a 24-hour period not being representative of the population of interest, but I may be missing something.
    Last edited by Andrew Musau; 09 Apr 2022, 07:20.

    Comment


    • #3
      Andrew Musau you are correct, We have a right-censored at zero (not a truncated one). A GPS device was attached to each member of a sample of 5000 families in Jerusalem for 24 hours. After 24 hours, each participant was asked via which means of transport each movement event was carried. The municipality was only interested only in the motorized tracks and removed all walking tracks from its data, delivering it to us (maximal walking distance was 3 km). We want to estimate the effect of path attributes on the distance walked but we do not have access to courses that could have been done, otherwise, by foot. So it is defiantly a censored dataset. I'm not sure which Stata command is best suited here (cnreg? truncreg? I use Stata 15.1).
      Above that, I know from other surveys done in Jerusalem that the probability to reach a destination by walking is 6%. Can this piece of knowledge be used in the censored regression analysis?
      I hope I was clearer..

      Comment


      • #4
        Originally posted by Anat Tchetchik View Post
        [USER="4687"]A GPS device was attached to each member of a sample of 5000 families in Jerusalem for 24 hours. After 24 hours, each participant was asked via which means of transport each movement event was carried. The municipality was only interested only in the motorized tracks and removed all walking tracks from its data, delivering it to us (maximal walking distance was 3 km).
        Do you mean that within the motorized tracks, some participants chose to walk while others used motorized transportation? Perhaps the best course of action is to first ask whether the municipality has access to the data from the walking tracks. If that is a dead end, then you can use the data that you have.

        I'm not sure which Stata command is best suited here (cnreg? truncreg? I use Stata 15.1).
        cnsreg is constrained regression and truncreg is for truncated data. None of these apply to your situation. Assuming that you have data from individuals who walked and those who used motorized transportation, you can consider your data as being 0/continuous, i.e., 0 kilometers walked for those who used motorized transportation and continuous in the range 0-3 km. for those who walked. The censored data estimators are Tobit and Heckman's version of Tobit (Heckit).

        Code:
        help tobit
        The main problem with these estimators is that they suppose that the same equation governs the selection (censoring) process and the continuous process. In general, this will not hold. Therefore, you need a general selection model that distinguishes between the observation and selection processes. Such a model is the Heckman selection model which will specify:

        1. A probit model for what determines the choice between walking and using motorized transportation.
        2. An augmented regression over the sample for which distance walked is positive [i.e., includes the inverse Mills ratios from (1)].

        The issue then is that you need exclusion restrictions. Some variables that are included in the selection equation should not be included in the behavioral equation. See

        Code:
        help heckman
        for examples.


        Above that, I know from other surveys done in Jerusalem that the probability to reach a destination by walking is 6%. Can this piece of knowledge be used in the censored regression analysis?
        In case I misunderstood your explanation and you mean that you just have data on movements of the participants but not whether they walked or used motorized means, so you want to try to determine which participants walked based on some probabilities, then I am no expert on this. Perhaps it is possible if you can observe the time it took for them to complete the course, but you need to inquire from people who are familiar with these matters.
        Last edited by Andrew Musau; 10 Apr 2022, 03:46.

        Comment


        • #5
          Andrew Musau
          by Anat Tchetchik A GPS device was attached to each member of a sample of 5000 families in Jerusalem for 24 hours. After 24 hours, each participant was asked via which means of transport each movement event was carried. The municipality was only interested only in the motorized tracks and removed all walking tracks from its data, delivering it to us (maximal walking distance was 3 km).
          by Andew Musau Do you mean that within the motorized tracks, some participants chose to walk while others used motorized transportation?
          Yes, it is reasonable to assume that for some tracks, which are considered to be in the range of a walkable distance (3 km), some people chose a car/bus/motorcycle. We cannot obtain access to these data- it's a dead-end. This is why we cannot carry sample-selection model.
          I believe that data is censored at ll=0, and truncated at 3 km...

          Anat Tchetchik Above that, I know from other surveys done in Jerusalem that the probability to reach a destination by walking is 6%. Can this piece of knowledge be used in the censored regression analysis?
          I meant that data from other surveys indicate that the general probability to choose walking among the city's residents is 6% (it's a hilly city)..So while I cannot estimate the Probit model for what determines the choice between walking and using motorized transportation, I already know that this probability is 6%. I thought I could this knowledge in the estimation..

          However, the bigger question is whether I can estimate a model censored at 0 and truncated at 3 km. The DV is within walking distance (larger than zero and up to 3km). The main IVs. are the terrain slope, height, and shadow along the route.

          Comment


          • #6
            I see cases of either censoring or truncation, but I guess that it is possible to have both, something like below:

            Code:
            webuse mroz87, clear
            set scheme s1mono
            hist whrs75, title("Censured at 0") saving(gr1, replace)
            drop if whrs75>1000
            hist whrs75, scheme(s1mono) title("Censured at 0, truncated at 1000") saving(gr2, replace)
            gr combine gr1.gph gr2.gph
            Click image for larger version

Name:	Graph.png
Views:	1
Size:	35.6 KB
ID:	1658959



            As far as I know (and I may be wrong), there is no official command that can handle both. However, there is an illustration here showing some possibility using cmp from SSC. I have not looked at the example closely, but I would recommend that you send an email to David Roodman, the author of cmp, who may be able to advise you. Eventually, if you are successful in implementing the estimation, I would encourage you to post the code here as this would be some novel application which may be of interest to many people.

            Comment


            • #7
              Read "Censured" as "Censored" in the graph titles. This is a typo in the first title copied and pasted to the second.
              Last edited by Andrew Musau; 11 Apr 2022, 03:56.

              Comment


              • #8
                Andrew Musau Thank you very much. I will definitely pursue the direction you offered (cmp), and share the code. ..I really appreciate your effort and help!

                Comment


                • #9
                  Andrew Musau just to update that I have emailed David Roodman and to my best understanding -cmp- allows for both censored and truncated models.
                  In my case, were the DV is the distance walked, the data is censored at zero and truncated at 3000 meters. so the code is:
                  Code:
                  cmp (length  = Slope other_vars , truncpoints(. 3000)), ind("cond(length >0, $cmp_cont, $cmp_left)") vce(cluster household ) qui
                  I think that the title of this shred should be changed from
                  Truncated data with known share of population which is truncated
                  to
                  Model with truncated and censored data
                  . I'm not sure if it possible.


                  Comment


                  • #10
                    Thanks for sharing the code Anat Tchetchik. It is not possible to change the title of the thread once it is more than one hour from the initial post, but the thread will still be searchable from its contents.

                    Comment

                    Working...
                    X