Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Representing string variable by the numerical categories.

    year unique_id psu_no hh_no zila zila_name
    2019 11005105 1005 105 1
    2020 11008141 1008 141 1
    2015 11005026 1005 26 1
    2018 11010044 1010 44 1
    2021 11004042 1004 42 1 Bagerhat
    2018 11010106 1010 106 1
    2015 11012102 1012 102 1
    2017 11015109 1015 109 1
    2020 11002154 1002 154 1
    2019 11015144 1015 144 1
    2020 11020008 1020 8 1
    2021 11005121 1005 121 1 Bagerhat



    In my data set, only the 2021 year has a district name which is a string variable. Other years only have "numerical categories" representing particular districts. For example in 2020 "1" represents "Bagerhat" district. Now I want to generate "Bagerhat" if zila==1 and so on. I just share a small portion of my data. Please help me to solve it. Thank you.

    #I tried to use dataex command but it gives this error message.

    dataex in 1/5
    input statement exceeds linesize limit. Try specifying fewer variables

  • #2
    I assume that you have a relatively large number of districts. On that assumption, the most convenient way to fill in the missing values of zila_name is to first prepare a data file containing two variables, zila and zila_name, for each possible value of zila. Then, you can use the -merge- command, with zila as the key variable, to fill in the missing names. For this -merge- commnd, the file you show us above would be the so-called "master" file and this new file would be the "using" file. Your -merge- command would be something like this:
    Code:
    merge m:1 zila using "Your Using File.dta", update
    This simple -merge- presumes that no district name changed over time; if names change over time, the merge key variables would also need to include "year," as would your "using" file.

    If you have trouble doing this, please post examples of the data from your original file and this new "using" file.

    Regarding your problem using -dataex-. The error message told what to do, namely specify a smaller number of variables on your -dataex- command. Because you did not specify any -varlist-, -dataex- defaulted to trying to include all of your variables, which presumably is a large number.

    You can see some examples of this kind of use of -dataex- specification by reading -help dataex-. You might, for example, do something like this:
    Code:
    dataex year unique_id psu_no hh_no zila zila_name in 1/5
    Last edited by Mike Lacy; 10 Feb 2024, 11:12.

    Comment


    • #3

      Here is the master data.

      Code:
      clear
      input float year long unique_id double(psu_no hh_no zila) str15 zila_name
      2019 11005105 1005 105 1 ""        
      2020 11008141 1008 141 1 ""        
      2015 11005026 1005  26 1 ""        
      2018 11010044 1010  44 1 ""        
      2021 11004042 1004  42 1 "Bagerhat"
      end

      Here is the using data

      clear
      input float year long unique_id double(psu_no hh_no) byte zila str15 zila_name
      2021 10001003 1 3 4 "Barguna"
      2021 10001004 1 4 4 "Barguna"
      2021 10001005 1 5 4 "Barguna"
      2021 10001006 1 6 4 "Barguna"
      2021 10001007 1 7 4 "Barguna"
      end
      [/CODE]

      Here 1 means "Bagerhat", 4 means "Barguna". In the main dataset, every number represents a district.

      Yes, you are right. There are lots of districts and sub-districts. I just share a small portion of my data. If I try to merge them by merge m:1 zila using "/Users/user/Desktop/Tafsil-2h_2021.dta" then the message generates
      "variable zila does not uniquely identify observations in the using data" perhaps for repeated district numbers. I tried m:m merge but doesn't generate fruitful results.

      Comment


      • #4
        In the using data, do this:
        Code:
        keep zila zila_name
        duplicates drop
        Now save this as a .dta file. Call it crosswalk.dta

        Then
        Code:
        use master_data, clear
        merge m:1 zila using crosswalk, update
        By the way, never use -merge m:m-. Even the Stata documentation says you shouldn't use it. It produces data salad.

        Comment


        • #5
          Thanks Clyde for your very good suggestion. Thanks Mike.

          Comment


          • #6
            One more problem related to it. Here is the using data for the merge. Similarly, except for 2021 in the master data set, I have no upz-zila name but code indicator in upza variable. Code ranges from 1 to 99 both in zila and upza. For example 1 14 represent a upza name 3 14 represent another. If I drops the duplicates then 1 zila represents 1 upza which is misleading. How I can gain the above result?


            using data

            clear
            input byte(zila upza) str15 zila_name str23 upz_name
            1 8 "Bagerhat" "Bagerhat Sadar"
            1 14 "Bagerhat" "Chitalmari"
            1 34 "Bagerhat" "Fakirhat"
            1 38 "Bagerhat" "Kachua"
            1 56 "Bagerhat" "Mollahat"
            1 58 "Bagerhat" "Mongla"
            1 60 "Bagerhat" "Morrelganj"
            1 73 "Bagerhat" "Rampal"
            1 77 "Bagerhat" "Sarankhola"
            3 4 "Bandarban" "Alikadam"
            3 14 "Bandarban" "Bandarban Sadar"
            3 51 "Bandarban" "Lama"
            3 73 "Bandarban" "Naikhongchhari"
            3 89 "Bandarban" "Rowangchhari"
            3 91 "Bandarban" "Ruma"
            3 95 "Bandarban" "Thanchi"
            4 9 "Barguna" "Amtali"
            4 19 "Barguna" "Bamna"
            4 28 "Barguna" "Barguna Sadar"
            4 47 "Barguna" "Betagi"
            end



            Master data
            clear
            input float year long unique_id double zila str15 zila_name double upza str23 upz_name
            2019 11005043 1 "Bagerhat" 8 "Bagerhat Sadar"
            2020 11003106 1 "Bagerhat" 8 "Bagerhat Sadar"
            2019 11005111 1 "Bagerhat" 8 "Bagerhat Sadar"
            2017 11002116 1 "Bagerhat" 8 "Bagerhat Sadar"
            2019 11002122 1 "Bagerhat" 8 "Bagerhat Sadar"
            2016 11005044 1 "Bagerhat" 8 "Bagerhat Sadar"
            2018 11003008 1 "Bagerhat" 8 "Bagerhat Sadar"
            2020 11005129 1 "Bagerhat" 8 "Bagerhat Sadar"
            2019 11003146 1 "Bagerhat" 8 "Bagerhat Sadar"
            2015 11003124 1 "Bagerhat" 8 "Bagerhat Sadar"
            2019 11003129 1 "Bagerhat" 8 "Bagerhat Sadar"
            2019 11002043 1 "Bagerhat" 8 "Bagerhat Sadar"
            2016 11003186 1 "Bagerhat" 8 "Bagerhat Sadar"
            2021 11005100 1 "Bagerhat" 8 "Bagerhat Sadar"
            2019 11005132 1 "Bagerhat" 8 "Bagerhat Sadar"
            2018 11004121 1 "Bagerhat" 8 "Bagerhat Sadar"
            2020 11003041 1 "Bagerhat" 8 "Bagerhat Sadar"
            2021 11003272 1 "Bagerhat" 8 "Bagerhat Sadar"
            2018 11005137 1 "Bagerhat" 8 "Bagerhat Sadar"
            2018 11002095 1 "Bagerhat" 8 "Bagerhat Sadar"
            end



            Comment


            • #7
              So, the actual data has a more complex structure than you implied in your original post. But the solution is not much different.
              Code:
              keep zila upsa zila_name upz_name
              duplicates drop
              and then -merge m:1 zila upsa-, again with the -update- option.

              Comment


              • #8
                Yes, I created an id for each lowest geographical unit for identification. Then I merge. Here is the sample code that I use. Thanks.



                Code:
                gen zila_new = "1"+ string(zila, "%02.0f")
                gen upza_new = string(upza, "%02.0f")
                gen union_new = string(union,"%02.0f")
                gen mauza_new = string(mauza,"%03.0f")
                gen id = zila_new + upza_new + union_new + mauza_new
                destring id, gen(unique_mauza)
                merge m:1 unique_mauza using mauza, update
                save, replace
                Last edited by Sk Moniruzzaman; 14 Feb 2024, 05:40.

                Comment

                Working...
                X