Representing string variable by the numerical categories.

Sk Moniruzzaman

Join Date: Jan 2024

Posts: 18
#1

Representing string variable by the numerical categories.

10 Feb 2024, 05:32

year unique_id psu_no hh_no zila zila_name
2019 11005105 1005 105 1
2020 11008141 1008 141 1
2015 11005026 1005 26 1
2018 11010044 1010 44 1
2021 11004042 1004 42 1 Bagerhat
2018 11010106 1010 106 1
2015 11012102 1012 102 1
2017 11015109 1015 109 1
2020 11002154 1002 154 1
2019 11015144 1015 144 1
2020 11020008 1020 8 1
2021 11005121 1005 121 1 Bagerhat

In my data set, only the 2021 year has a district name which is a string variable. Other years only have "numerical categories" representing particular districts. For example in 2020 "1" represents "Bagerhat" district. Now I want to generate "Bagerhat" if zila==1 and so on. I just share a small portion of my data. Please help me to solve it. Thank you.

#I tried to use dataex command but it gives this error message.

dataex in 1/5
input statement exceeds linesize limit. Try specifying fewer variables
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

10 Feb 2024, 09:29

I assume that you have a relatively large number of districts. On that assumption, the most convenient way to fill in the missing values of zila_name is to first prepare a data file containing two variables, zila and zila_name, for each possible value of zila. Then, you can use the -merge- command, with zila as the key variable, to fill in the missing names. For this -merge- commnd, the file you show us above would be the so-called "master" file and this new file would be the "using" file. Your -merge- command would be something like this:

Code:

merge m:1 zila using "Your Using File.dta", update

This simple -merge- presumes that no district name changed over time; if names change over time, the merge key variables would also need to include "year," as would your "using" file.

If you have trouble doing this, please post examples of the data from your original file and this new "using" file.

Regarding your problem using -dataex-. The error message told what to do, namely specify a smaller number of variables on your -dataex- command. Because you did not specify any -varlist-, -dataex- defaulted to trying to include all of your variables, which presumably is a large number.

You can see some examples of this kind of use of -dataex- specification by reading -help dataex-. You might, for example, do something like this:

Code:

dataex year unique_id psu_no hh_no zila zila_name in 1/5

Last edited by Mike Lacy; 10 Feb 2024, 10:12.
1 like
Comment
Sk Moniruzzaman

Join Date: Jan 2024

Posts: 18
#3

10 Feb 2024, 21:57

Here is the master data.

Code:

clear input float year long unique_id double(psu_no hh_no zila) str15 zila_name 2019 11005105 1005 105 1 "" 2020 11008141 1008 141 1 "" 2015 11005026 1005 26 1 "" 2018 11010044 1010 44 1 "" 2021 11004042 1004 42 1 "Bagerhat" end

Here is the using data

clear
input float year long unique_id double(psu_no hh_no) byte zila str15 zila_name
2021 10001003 1 3 4 "Barguna"
2021 10001004 1 4 4 "Barguna"
2021 10001005 1 5 4 "Barguna"
2021 10001006 1 6 4 "Barguna"
2021 10001007 1 7 4 "Barguna"
end
[/CODE]

Here 1 means "Bagerhat", 4 means "Barguna". In the main dataset, every number represents a district.

Yes, you are right. There are lots of districts and sub-districts. I just share a small portion of my data. If I try to merge them by merge m:1 zila using "/Users/user/Desktop/Tafsil-2h_2021.dta" then the message generates
"variable zila does not uniquely identify observations in the using data" perhaps for repeated district numbers. I tried m:m merge but doesn't generate fruitful results.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

10 Feb 2024, 22:02

In the using data, do this:

Code:

keep zila zila_name duplicates drop

Now save this as a .dta file. Call it crosswalk.dta

Then

Code:

use master_data, clear merge m:1 zila using crosswalk, update

By the way, never use -merge m:m-. Even the Stata documentation says you shouldn't use it. It produces data salad.
1 like
Comment
Sk Moniruzzaman

Join Date: Jan 2024

Posts: 18
#5

10 Feb 2024, 23:21

Thanks Clyde for your very good suggestion. Thanks Mike.
Comment
Sk Moniruzzaman

Join Date: Jan 2024

Posts: 18
#6

11 Feb 2024, 00:59

One more problem related to it. Here is the using data for the merge. Similarly, except for 2021 in the master data set, I have no upz-zila name but code indicator in upza variable. Code ranges from 1 to 99 both in zila and upza. For example 1 14 represent a upza name 3 14 represent another. If I drops the duplicates then 1 zila represents 1 upza which is misleading. How I can gain the above result?

using data

clear
input byte(zila upza) str15 zila_name str23 upz_name
1 8 "Bagerhat" "Bagerhat Sadar"
1 14 "Bagerhat" "Chitalmari"
1 34 "Bagerhat" "Fakirhat"
1 38 "Bagerhat" "Kachua"
1 56 "Bagerhat" "Mollahat"
1 58 "Bagerhat" "Mongla"
1 60 "Bagerhat" "Morrelganj"
1 73 "Bagerhat" "Rampal"
1 77 "Bagerhat" "Sarankhola"
3 4 "Bandarban" "Alikadam"
3 14 "Bandarban" "Bandarban Sadar"
3 51 "Bandarban" "Lama"
3 73 "Bandarban" "Naikhongchhari"
3 89 "Bandarban" "Rowangchhari"
3 91 "Bandarban" "Ruma"
3 95 "Bandarban" "Thanchi"
4 9 "Barguna" "Amtali"
4 19 "Barguna" "Bamna"
4 28 "Barguna" "Barguna Sadar"
4 47 "Barguna" "Betagi"
end

Master data
clear
input float year long unique_id double zila str15 zila_name double upza str23 upz_name
2019 11005043 1 "Bagerhat" 8 "Bagerhat Sadar"
2020 11003106 1 "Bagerhat" 8 "Bagerhat Sadar"
2019 11005111 1 "Bagerhat" 8 "Bagerhat Sadar"
2017 11002116 1 "Bagerhat" 8 "Bagerhat Sadar"
2019 11002122 1 "Bagerhat" 8 "Bagerhat Sadar"
2016 11005044 1 "Bagerhat" 8 "Bagerhat Sadar"
2018 11003008 1 "Bagerhat" 8 "Bagerhat Sadar"
2020 11005129 1 "Bagerhat" 8 "Bagerhat Sadar"
2019 11003146 1 "Bagerhat" 8 "Bagerhat Sadar"
2015 11003124 1 "Bagerhat" 8 "Bagerhat Sadar"
2019 11003129 1 "Bagerhat" 8 "Bagerhat Sadar"
2019 11002043 1 "Bagerhat" 8 "Bagerhat Sadar"
2016 11003186 1 "Bagerhat" 8 "Bagerhat Sadar"
2021 11005100 1 "Bagerhat" 8 "Bagerhat Sadar"
2019 11005132 1 "Bagerhat" 8 "Bagerhat Sadar"
2018 11004121 1 "Bagerhat" 8 "Bagerhat Sadar"
2020 11003041 1 "Bagerhat" 8 "Bagerhat Sadar"
2021 11003272 1 "Bagerhat" 8 "Bagerhat Sadar"
2018 11005137 1 "Bagerhat" 8 "Bagerhat Sadar"
2018 11002095 1 "Bagerhat" 8 "Bagerhat Sadar"
end
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

11 Feb 2024, 12:31

So, the actual data has a more complex structure than you implied in your original post. But the solution is not much different.

Code:

keep zila upsa zila_name upz_name duplicates drop

and then -merge m:1 zila upsa-, again with the -update- option.
Comment

Sk Moniruzzaman

Join Date: Jan 2024
Posts: 18

14 Feb 2024, 04:37

Yes, I created an id for each lowest geographical unit for identification. Then I merge. Here is the sample code that I use. Thanks.

Code:

gen zila_new = "1"+ string(zila, "%02.0f")
gen upza_new = string(upza, "%02.0f")
gen union_new = string(union,"%02.0f")
gen mauza_new = string(mauza,"%03.0f")
gen id = zila_new + upza_new + union_new + mauza_new
destring id, gen(unique_mauza)
merge m:1 unique_mauza using mauza, update
save, replace

Last edited by Sk Moniruzzaman; 14 Feb 2024, 04:40.

Announcement

Representing string variable by the numerical categories.

Comment

Comment

Comment

Comment

Comment

Comment

Comment