Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create a variable that has only one value per category

    I have data on automobile collisions in a particular region and, among other things, have data on the street intersection that each collision took place at. I have a variable that records the number of times the intersection that a collision took place at is found in the dataset. So for instance, if I had data on 4 collisions, one of which was at the intersection "Main & Side" and three of which were at the intersection "King & Queen", the data would look like this:

    Collision ID Intersection intcount
    1 Main & Side 1
    2 King & Queen 3
    3 King & Queen 3
    4 King & Queen 3

    I want to regress the number of times an intersection appears in the dataset on particular traits of that intersection. So the dependent variable would be intcount and the independent variables would be things like traffic control, speed limit, etc.

    The problem is that because of the way I have set up the variable intcount, Stata counts each collision as a separate observation. To reference the table above, really I only have two observations -- one of Main & Side with an intcount value of 1, and one of King & Queen with an intcount value of 3. But Stata thinks I have four observations because it sees three instances of intcount=3 and one instance of intcount=1.

    I'm looking for a way to recode the intcount variable so that its value is missing in all collisions except one for each intersection. That way when I run the regression the number of observations will be the number of distinct intersections, not the total number of collisions. Any help on how to do this would be greatly appreciated. Feel free also to let me know if there is an entirely different way to approach this issue that would be better.

  • #2
    There are two easy ways to do this. One of them actually reduces the data set to one observation per intersection:
    Code:
    by Intersection (intcount), sort: assert intcount[1] == intcount[_N]
    collapse (first) intcount, by(Intersection)
    The first line of this code verifies that the value of intcount is the same in all observations referring to a given intersection. (If that's not true, then your problem is ill-posed and requires an additional decision rule for which observation's value of intcount is to be used.)

    The second way will preserve the overall dataset but replace the value of intcount by missing value in all but one observation per intersection:
    Code:
    by Intersection (intcount), sort: assert intcount[1] == intcount[_N]
    by Intersection: replace intcount = . if _n > 1

    Comment


    • #3
      Thank you, that's exactly what I was looking for!

      Comment


      • #4
        See also the tag() function of egen.

        Comment

        Working...
        X