Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need help with sorting

    Hello,

    I am working with a dataset with merged data from 2 survey rounds (2005 & 20011).

    The first column/variable 'id' is the unique identification number for the individuals.
    The second column/variable "SURVEY" represents whether the observation is from survey 1 or 2
    The third column/variable "HHBASE" represents the unique Household Base ID which remains unchanged for the two surveys

    As seen in the sample data below, after I have already "sort id", I was expecting that observations from row 11 to row 20 should have been placed before right after the first observation, but it does not seem to be the case. Any suggestions on how this can be fixed?

    The idea is that I would like to sort in a way that allows me to see the two observations (from surveys 1 and 2 respectively) for the same id to appear together (if that makes sense).

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str12 id int SURVEY double HHBASE
    "10102010201"  2 1010201020
    "101020102010" 2 1010201020
    "101020102010" 1 1010201020
    "101020102011" 2 1010201020
    "101020102012" 2 1010201020
    "101020102013" 2 1010201020
    "101020102014" 2 1010201020
    "101020102015" 2 1010201020
    "101020102016" 2 1010201020
    "101020102017" 2 1010201020
    "10102010203"  2 1010201020
    "10102010204"  2 1010201020
    "10102010205"  2 1010201020
    "10102010205"  1 1010201020
    "10102010206"  1 1010201020
    "10102010206"  2 1010201020
    "10102010207"  2 1010201020
    "10102010207"  1 1010201020
    "10102010208"  2 1010201020
    "10102010209"  2 1010201020
    end
    label values SURVEY SURVEY
    label def SURVEY 1 "IHDS1 1", modify
    label def SURVEY 2 "IHDS2 2", modify

  • #2
    Dictionary rules for strings: other things being equal a string ending 17 always sorts before one ending 3. just as "heroic" sorts before "heroine". The last digits aren't treated as if 3 were less than 17 numerically.

    Perhaps you should have 12 character identifiers consistently. If you arrange that
    "10102010203" is instead
    "101020102003" then you may get the sort order you expect. (Who needs 12 digit identifiers any way? There aren't 1 trillion people in your survey.)

    Comment


    • #3
      Your id variable is a string variable. Stata sorts string variables in alphabetical order: the sort order you see is correct from that perspective. Alphabetical order, however, does not correspond to numeric order, which, I gather, is what you would prefer. Your simplest solution is to create a numeric equivalent of id and then sort on that variable:

      Code:
      destring id, gen(id_n)
      format id_n %12.0f
      sort id_n
      You could actually get rid of id and just work with id_n going forward. But if you will need to combine this data set with other data sets from the survey, that could create problems for you. So, unless you are sure you will never need to do that, just keep both variables around.

      Comment


      • #4

        Originally posted by Nick Cox View Post
        Dictionary rules for strings: other things being equal a string ending 17 always sorts before one ending 3. just as "heroic" sorts before "heroine". The last digits aren't treated as if 3 were less than 17 numerically.

        Perhaps you should have 12 character identifiers consistently. If you arrange that
        "10102010203" is instead
        "101020102003" then you may get the sort order you expect. (Who needs 12 digit identifiers any way? There aren't 1 trillion people in your survey.)
        Hi Nick,

        Thanks for the reply. I appreciate the explanation for the problem I was facing. It will definitely make my work with the string variables and sorting easier in the future.

        I agree that the 12 digits identifiers are a bit excessive and only make things appear more complex. However, the data I am working with is based of an Indian census. The individuals' id is made up of a combination of State id + District id + village/neighbourhood id + Household id + Person id. However, I still do think that it could be done in an easier manner. I am not so experienced with STATA and so am afraid to try and create my own identifiers.

        Furthermore, the task is a bit difficult since the survey in round 2 did not use an identification method that is exactly consistent with round 1. While in round 1, the household id was either a 1-digit or 2-digit number, round 2 identifies the same households using 2- and 3-digit household ids.



        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Your id variable is a string variable. Stata sorts string variables in alphabetical order: the sort order you see is correct from that perspective. Alphabetical order, however, does not correspond to numeric order, which, I gather, is what you would prefer. Your simplest solution is to create a numeric equivalent of id and then sort on that variable:

          Code:
          destring id, gen(id_n)
          format id_n %12.0f
          sort id_n
          You could actually get rid of id and just work with id_n going forward. But if you will need to combine this data set with other data sets from the survey, that could create problems for you. So, unless you are sure you will never need to do that, just keep both variables around.
          Hi Clyde,

          Thank you for your reply. I understand the intuition behind the sorting of string variables now, thank you for that!

          I took your suggestion of destringing the variable and then formatting with the new id_n variable. To a great extent, this does the job.

          I was hoping you could help guide me with the further problem I am now facing.

          In the dataset now, I have both survey 1 and survey 2 observations. I only wish to keep the survey 2 observations (id's) that have a corresponding matching observation with the same id but for survey 1.

          I had previously tried an approach using the string variable where the corresponding observation from the other survey would be placed right under the observation of interest. Then I was using the bysort function to create a new variable that would take value 1 whenever there was a matching id right after the observation, and then dropping the rest. However, now having sorted as numeric, the matching observation id for observation 1 gets pushed down somewhere in the second half of the dataset.

          What do you suggest can be done in this case?

          Kind regards,
          Abhishek

          Comment


          • #6
            As already hinted, it could be that something like direct concatenation will get you where you want to be. You can combine string variables by addition.

            Code:
            gen wanted = id1 + id2
            The question is whether you need work on what is given to get id1 and id2 in exactly the desired form.

            I can't advise on inconsistent schemes in different survey rounds but you can homogenise 1, 2 and 3 digit identifiers like this. If they are numeric then a format like %03.0f adds leading zeros as needed. These examples all use display

            Code:
            . di %03.0f 1
            001
            
            . di %03.0f 12
            012
            
            . di %03.0f 123
            123
            but

            Code:
            gen wanted = strofreal(given, "%03.0f")
            could be used on a numeric variable.

            If 1 12 123 or whatever arrive in a string variable then

            Code:
            gen wanted = substr("000", 3 - length(given)) + given
            adds as many leading zeros as needed. (In practice, make sure that the string variable doesn't include any spaces.)


            Comment

            Working...
            X