Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • matching single county with unique congressional district based on county area

    According to us house election data a single county for a specific state belongs to multiple congressional districts. However, based on the total geographical area I can match unique county to a specific district for that state.

    In the following data nhgisnam means the name of the county , district means Congressional district , cnty_area means the total geographical area of that county and cnty_part_area means the portion of total area of that county belonging to that particular congressional district. For example : In th first line for Autauga county the cnty_part_area's value is 1564828723. That means in district 2 Autauga county's total area is 1564828723 - out of its total area (cnty_area) of 1565529773. Autauga county also belong to district 6 and district 7 of state 1. But, the major portion of it's area belong to district 2 which I can figure out from that cnty_part_area variable.

    Can anyone kindly guide me how I can code the data so that for each county in a particular state I can keep the observation where each county is assigned to a single district in that state based on the highest value of cnty_part_area variable for that specific county ??

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str12 nhgisnam byte(cd_statefip district) float county double(cnty_area cd_area cnty_part_area)
    "Autauga"      1  2  1  1565529773 27471139257 1564828723
    "Autauga"      1  2  1  1565529773 27471139257  .000155932
    "Autauga"      1  6  1  1565529773 12040322876 272866.6476
    "Autauga"      1  6  1  1565529773 12040322876  .133551671
    "Autauga"      1  7  1  1565529773 22740093504 428181.5623
    "Baldwin"      1  1  3  4232265763 16566047005  4228366412
    "Baldwin"      1  1  3  4232265763 16566047005  .556012851
    "Baldwin"      1  7  3  4232265763 22740093504 149732.2219
    "Barbour"      1  2  5  2342716428 27471139257  .836149027
    "Barbour"      1  2  5  2342716428 27471139257  2341574170
    "Barbour"      1  2  5  2342716428 27471139257  .361295549
    "Barbour"      1  2  5  2342716428 27471139257  .707204586
    "Barbour"      1  3  5  2342716428 20681592749  .553967503
    "Barbour"      1  3  5  2342716428 20681592749 601774.0707
    "Bibb"         1  6  7  1621774445 12040322876   .08641998
    "Bibb"         1  6  7  1621774445 12040322876  .212694208
    "Bibb"         1  6  7  1621774445 12040322876  1621293818
    "Bibb"         1  7  7  1621774445 22740093504 480626.7196
    "El Dorado"    6  3 17  4631169089  8860803999  .049026728
    "El Dorado"    6  3 17  4631169089  8860803999 844075.6488
    "El Dorado"    6  4 17  4631169089 44439160065  4630300584
    "El Dorado"    6  4 17  4631169089 44439160065 1.154827007
    "Fresno"       6 17 19 15585347209 12459983416 939434.4757
    "Fresno"       6 18 19 15585347209  8030525729 104149507.3
    "Fresno"       6 19 19 15585347209 17561371799 921125720.7
    "Fresno"       6 20 19 15585347209 12921313091  6140610563
    "Fresno"       6 21 19 15585347209 20952217690  8417589179
    "Fresno"       6 25 19 15585347209 56000534108 932808.8571
    "Glenn"        6  1 21  3437311730 28853910260 146543.9293
    "Glenn"        6  2 21  3437311730 56920751720  .407955162
    "Glenn"        6  2 21  3437311730 56920751720  .481264625
    "Glenn"        6  2 21  3437311730 56920751720  3437165187
    "Kern"         6 20 29 21138168964 12921313091  3175426923
    "Kern"         6 21 29 21138168964 20952217690 507400.1421
    "Kern"         6 22 29 21138168964 27074266282 1.937903539
    end
    Last edited by Tariq Abdullah; 26 Jan 2023, 05:02.

  • #2
    I'm confused by your data. You have multiple observations for some districts in the same county. For example, there are two different observations for Autauga county and district 2, or Glenn county and district 2 has three. But they show different values for county partial area. Those are just a few: much of your example data is like this. What does this signify? Which of these observations is correct?

    Comment


    • #3
      Mr. Schechter,

      My apologies for the misunderstanding. This is a data I found from American Economic Review datafile where they converted the us district to county. Here, is the snippet of all the variables from where I showed a portion in my above dataset :

      Now that you mention it, I see why the district is being repeated. My guess is this repetition for some other variables in the following data - which I haven't figured out yet. This is the Michigan university publicly provided dataset ( https://www.openicpsr.org/openicpsr/....zip&type=file )

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str7 nhgisnam byte(nhgisst nhgiscty) double cnty_area int congress byte(cd_statefip district) double cd_area float(cnty_part_area cnty_part_pop_m2 cnty_part_pop_m3 cnty_part_pop_m4)
      "Autauga" 10 10 1565529773 107 1 2 26486314732 .000169704         0         0         0
      "Autauga" 10 10 1565529773 107 1 2 26486314732   .3336922         0         0         0
      "Autauga" 10 10 1565529773 107 1 2 26486314732 1564525312  43658.16  43657.57  43656.27
      "Autauga" 10 10 1565529773 107 1 3 22944203868  .13353914         0         0         0
      "Autauga" 10 10 1565529773 107 1 3 22944203868  163081.55         0         0         0
      "Autauga" 10 10 1565529773 107 1 7 22864197387   841329.7 12.841208 13.431708 14.732046
      "Autauga" 10 10 1565529773 107 1 7 22864197387  1.0301834         0         0         0
      "Baldwin" 10 30 4232265763 107 1 1 17800687959 4228519168 139443.42 139442.17 139369.92
      "Barbour" 10 50 2342716428 107 1 2 26486314732    .836149         0         0         0
      "Barbour" 10 50 2342716428 107 1 2 26486314732 2341554432   29029.1     29038     29038
      "Barbour" 10 50 2342716428 107 1 2 26486314732   .7072046         0         0         0
      "Barbour" 10 50 2342716428 107 1 2 26486314732  .36129555         0         0         0
      "Barbour" 10 50 2342716428 107 1 3 22944203868   .5539675         0         0         0
      end

      Comment


      • #4
        Well, it seems that to get the data set from that link requires establishing an account. And it seems that even with an account I might not be able to get that data set. Let me just say that when you downloaded that data set, I imagine that along with the data itself there was one or more other file that provide explanations of how the data was gathered and how the data can be used. I think you will have to read that to figure out what is going on here.

        When I look at the example shown in #3, I observe that the only variables that change within a given combination of cd_statefip, nhgiscty and district are cnty_part_area, cnty_part_pop_m2, cnty_part_pop_m3, cnty_part_pop_m4. While I don't get what these last three variables represent, neither their names nor their values suggest that they are identifiers that distinguish subunits of the district or timepoints. (The variable congress, by its values seems to indicate time points--but these are also reproduced multiple times for the same cd_statefip, nhgiscty and district. So this does not help with identifying which observation to use here.)

        So I think you need to study the documentation for this data set. If it was not included in your download, you should go back to the website and search for explanatory information about it. Or, perhaps some Forum member who is familiar with it will join the thread and resolve the mystery.


        Comment


        • #5
          I'll do as you advised , Mr schechter ! honestly the whole thing is kind of out of my league, and I'm just trying to keep up with it. Nonetheless, you've been very resourceful and kind !

          Comment


          • #6
            I don't understand the data. cnty_area should be constant in a county and cd_area should be smaller to or equal to that size. Why do you have large numbers for some and tiny numbers for others?

            Geocorr could get you want you want. It will create a variable afact which says how much of the area of congressional district is in a county.

            HTML Code:
            https://mcdc.missouri.edu/applications/geocorr2022.html

            Comment


            • #7
              As I look at it, it appears the cnty_area is a constant. The larger numbers from the cd_area are peculiar, while the cd_area_part looks like the sqkm of the cd_area from some of the values (the others I have no idea what they mean). I'd dump any cnty_part_area<1, or keep the maximum value of cd_area. Then, compute area share as cnty_part_area/cnty_area.

              Comment


              • #8
                There may be cases where a district crosses a county boundary. I'd look closely at those cases.

                Comment


                • #9
                  Originally posted by George Ford View Post
                  I don't understand the data. cnty_area should be constant in a county and cd_area should be smaller to or equal to that size. Why do you have large numbers for some and tiny numbers for others?

                  Geocorr could get you want you want. It will create a variable afact which says how much of the area of congressional district is in a county.

                  HTML Code:
                  https://mcdc.missouri.edu/applications/geocorr2022.html
                  From the paper ( https://warwick.ac.uk/fac/soc/econom...wp588.2021.pdf) I collected data there they explained the whole data in this summary

                  Area-based harmonization procedures entail a simple process of spatial disaggregation and re- aggregation. To construct our county-to-CD crosswalks, this involves intersecting a county map from a particular Census year with a CD map from a particular Congress year. Counties are then disaggregated into a set of sub-county units (“county-parts”), based the CD in which they are located. We then calculate the areas (in square meters) of all counties, all CDs, and all county-parts, based on a “USA Contiguous Albers Equal Area Conic” projection.


                  Once counties are disaggregated based on CD intersections, county-parts are re-aggregated based on their CD, with the sum of the areas of the county-parts matching the area of the whole CD. How are the various data values of the initial counties (e.g. total population, total number of Blacks) associated with CDs in this process? Under an area-based procedure, each county-part is assigned each of its county’s data values, weighted by the share of the county’s total area that belongs to that county-part. These weights add up to 1 for each county. A given CD’s data values are in turn the aggregates of these weighted values, summed across all counties that have a county-part located in that CD.
                  Last edited by Tariq Abdullah; 26 Jan 2023, 21:39.

                  Comment


                  • #10
                    Originally posted by George Ford View Post
                    As I look at it, it appears the cnty_area is a constant. The larger numbers from the cd_area are peculiar, while the cd_area_part looks like the sqkm of the cd_area from some of the values (the others I have no idea what they mean). I'd dump any cnty_part_area<1, or keep the maximum value of cd_area. Then, compute area share as cnty_part_area/cnty_area.
                    This is very helpful advice. I'm going to drop all those observations where cnty_part_area<1. Those are totally meaningless. Helps me remove the existing problem of repeating the same district twice for the same county in my data

                    Comment

                    Working...
                    X