Need help with sorting

Abhishek Patil

Join Date: Dec 2020

Posts: 28
#1

Need help with sorting

30 Mar 2021, 15:23

Hello,

I am working with a dataset with merged data from 2 survey rounds (2005 & 20011).

The first column/variable 'id' is the unique identification number for the individuals.
The second column/variable "SURVEY" represents whether the observation is from survey 1 or 2
The third column/variable "HHBASE" represents the unique Household Base ID which remains unchanged for the two surveys

As seen in the sample data below, after I have already "sort id", I was expecting that observations from row 11 to row 20 should have been placed before right after the first observation, but it does not seem to be the case. Any suggestions on how this can be fixed?

The idea is that I would like to sort in a way that allows me to see the two observations (from surveys 1 and 2 respectively) for the same id to appear together (if that makes sense).

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str12 id int SURVEY double HHBASE "10102010201" 2 1010201020 "101020102010" 2 1010201020 "101020102010" 1 1010201020 "101020102011" 2 1010201020 "101020102012" 2 1010201020 "101020102013" 2 1010201020 "101020102014" 2 1010201020 "101020102015" 2 1010201020 "101020102016" 2 1010201020 "101020102017" 2 1010201020 "10102010203" 2 1010201020 "10102010204" 2 1010201020 "10102010205" 2 1010201020 "10102010205" 1 1010201020 "10102010206" 1 1010201020 "10102010206" 2 1010201020 "10102010207" 2 1010201020 "10102010207" 1 1010201020 "10102010208" 2 1010201020 "10102010209" 2 1010201020 end label values SURVEY SURVEY label def SURVEY 1 "IHDS1 1", modify label def SURVEY 2 "IHDS2 2", modify
Tags: data, panel, panel data, sample, sort
Nick Cox

Join Date: Mar 2014

Posts: 35696
#2

30 Mar 2021, 15:50

Dictionary rules for strings: other things being equal a string ending 17 always sorts before one ending 3. just as "heroic" sorts before "heroine". The last digits aren't treated as if 3 were less than 17 numerically.

Perhaps you should have 12 character identifiers consistently. If you arrange that
"10102010203" is instead
"101020102003" then you may get the sort order you expect. (Who needs 12 digit identifiers any way? There aren't 1 trillion people in your survey.)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#3

30 Mar 2021, 15:57

Your id variable is a string variable. Stata sorts string variables in alphabetical order: the sort order you see is correct from that perspective. Alphabetical order, however, does not correspond to numeric order, which, I gather, is what you would prefer. Your simplest solution is to create a numeric equivalent of id and then sort on that variable:

Code:

destring id, gen(id_n) format id_n %12.0f sort id_n

You could actually get rid of id and just work with id_n going forward. But if you will need to combine this data set with other data sets from the survey, that could create problems for you. So, unless you are sure you will never need to do that, just keep both variables around.
Comment
Abhishek Patil

Join Date: Dec 2020

Posts: 28
#4

31 Mar 2021, 03:02

Originally posted by Nick Cox View Post

Dictionary rules for strings: other things being equal a string ending 17 always sorts before one ending 3. just as "heroic" sorts before "heroine". The last digits aren't treated as if 3 were less than 17 numerically.

Perhaps you should have 12 character identifiers consistently. If you arrange that
"10102010203" is instead
"101020102003" then you may get the sort order you expect. (Who needs 12 digit identifiers any way? There aren't 1 trillion people in your survey.)

Hi Nick,

Thanks for the reply. I appreciate the explanation for the problem I was facing. It will definitely make my work with the string variables and sorting easier in the future.

I agree that the 12 digits identifiers are a bit excessive and only make things appear more complex. However, the data I am working with is based of an Indian census. The individuals' id is made up of a combination of State id + District id + village/neighbourhood id + Household id + Person id. However, I still do think that it could be done in an easier manner. I am not so experienced with STATA and so am afraid to try and create my own identifiers.

Furthermore, the task is a bit difficult since the survey in round 2 did not use an identification method that is exactly consistent with round 1. While in round 1, the household id was either a 1-digit or 2-digit number, round 2 identifies the same households using 2- and 3-digit household ids.
Comment
Abhishek Patil

Join Date: Dec 2020

Posts: 28
#5

31 Mar 2021, 03:10

Originally posted by Clyde Schechter View Post

Your id variable is a string variable. Stata sorts string variables in alphabetical order: the sort order you see is correct from that perspective. Alphabetical order, however, does not correspond to numeric order, which, I gather, is what you would prefer. Your simplest solution is to create a numeric equivalent of id and then sort on that variable:

Code:

destring id, gen(id_n) format id_n %12.0f sort id_n

You could actually get rid of id and just work with id_n going forward. But if you will need to combine this data set with other data sets from the survey, that could create problems for you. So, unless you are sure you will never need to do that, just keep both variables around.

Hi Clyde,

Thank you for your reply. I understand the intuition behind the sorting of string variables now, thank you for that!

I took your suggestion of destringing the variable and then formatting with the new id_n variable. To a great extent, this does the job.

I was hoping you could help guide me with the further problem I am now facing.

In the dataset now, I have both survey 1 and survey 2 observations. I only wish to keep the survey 2 observations (id's) that have a corresponding matching observation with the same id but for survey 1.

I had previously tried an approach using the string variable where the corresponding observation from the other survey would be placed right under the observation of interest. Then I was using the bysort function to create a new variable that would take value 1 whenever there was a matching id right after the observation, and then dropping the rest. However, now having sorted as numeric, the matching observation id for observation 1 gets pushed down somewhere in the second half of the dataset.

What do you suggest can be done in this case?

Kind regards,
Abhishek
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#6

31 Mar 2021, 04:54

As already hinted, it could be that something like direct concatenation will get you where you want to be. You can combine string variables by addition.

Code:

gen wanted = id1 + id2

The question is whether you need work on what is given to get id1 and id2 in exactly the desired form.

I can't advise on inconsistent schemes in different survey rounds but you can homogenise 1, 2 and 3 digit identifiers like this. If they are numeric then a format like %03.0f adds leading zeros as needed. These examples all use display

Code:

. di %03.0f 1 001 . di %03.0f 12 012 . di %03.0f 123 123

but

Code:

gen wanted = strofreal(given, "%03.0f")

could be used on a numeric variable.

If 1 12 123 or whatever arrive in a string variable then

Code:

gen wanted = substr("000", 3 - length(given)) + given

adds as many leading zeros as needed. (In practice, make sure that the string variable doesn't include any spaces.)
Comment

Announcement

Need help with sorting

Comment

Comment

Comment

Comment

Comment