Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to flag observations that do not have a particular value on a variable?

    Hi statalist,

    Please consider the following sample data:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str3 var1 str6 var2
    "id"  "itemid"
    "111" "1"     
    "111" "2"     
    "111" "3"     
    "112" "2"     
    "113" "1"     
    "113" "2"     
    "114" "1"     
    "114" "2"     
    "114" "3"     
    "115" "3"     
    end
    Here, hhs, are identified by the variable id. HHs can have different items of value, with the most valuable item having itemid=1 and items of lower value would therefore be numbered by 2, 3, etc. As can be seen, some hhs can have only one item of value, some can have more than one. Under these scheme, all hhs should have at least one item with itemid=1 (if it is their only valuable item, it would have itemid=1), and can have other items with higher values of itemid. Thus the number of observations when itemid=1 should equal the number of unique households, given by id.

    However, there is some error in my data where I suspect some hhs do not have any item with itemid=1. For eg, in the sample data, id 112, and id 115 have one item each and should have had itemid=1 against each of these items. But instead, 112 has itemid 2 against it, and 115 has itemid 3 against it.

    I am trying to write a code that would flag these two ids:112 and 115, which do not have 1 on their itemid but have some other number.

    it would be great if anyone could suggest how to go about this problem. Please comment if I have not been able to explain the problem well and you need clarification.

    Thanks,


  • #2
    Code:
    egen flag = min(real(itemid)>1) ,by(id)

    Comment


    • #3
      This is just to flag that the data example is messed up because of a basic mistake: the variables are string because the first observation containing metadata should not have been imported.

      Sometimes it is best to go back and re-do the import. In this case, you can also clean up like this


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str3 var1 str6 var2
      "id"  "itemid"
      "111" "1"     
      "111" "2"     
      "111" "3"     
      "112" "2"     
      "113" "1"     
      "113" "2"     
      "114" "1"     
      "114" "2"     
      "114" "3"     
      "115" "3"     
      end
      
      foreach v of var var* { 
          rename `v' `=`v'[1]'
      }
      
      drop in 1 
      
      destring, replace 
      
      list, sepby(id)
      
      
           +--------------+
           |  id   itemid |
           |--------------|
        1. | 111        1 |
        2. | 111        2 |
        3. | 111        3 |
           |--------------|
        4. | 112        2 |
           |--------------|
        5. | 113        1 |
        6. | 113        2 |
           |--------------|
        7. | 114        1 |
        8. | 114        2 |
        9. | 114        3 |
           |--------------|
       10. | 115        3 |
           +--------------+

      Comment


      • #4
        Originally posted by Nick Cox View Post
        This is just to flag that the data example is messed up because of a basic mistake: the variables are string because the first observation containing metadata should not have been imported.

        Sometimes it is best to go back and re-do the import. In this case, you can also clean up like this


        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str3 var1 str6 var2
        "id" "itemid"
        "111" "1"
        "111" "2"
        "111" "3"
        "112" "2"
        "113" "1"
        "113" "2"
        "114" "1"
        "114" "2"
        "114" "3"
        "115" "3"
        end
        
        foreach v of var var* {
        rename `v' `=`v'[1]'
        }
        
        drop in 1
        
        destring, replace
        
        list, sepby(id)
        
        
        +--------------+
        | id itemid |
        |--------------|
        1. | 111 1 |
        2. | 111 2 |
        3. | 111 3 |
        |--------------|
        4. | 112 2 |
        |--------------|
        5. | 113 1 |
        6. | 113 2 |
        |--------------|
        7. | 114 1 |
        8. | 114 2 |
        9. | 114 3 |
        |--------------|
        10. | 115 3 |
        +--------------+
        Hi Nick, could you please clarify what you meant by "first observation containing metadata should not have been imported"? These id variables are strings in my original data as well. Thanks

        Comment


        • #5
          Your first observation

          Code:
           "id" "itemid"
          manifestly should have been read in as variable names, not as data values.

          Many users of spreadsheets and similar applications put metadata in the first row(s) of a worksheet.

          So far, so good, but that approach makes no sense in Stata.

          First of all, names or other text in the first observation condemn a variable to be string, which renders numeric calculations impossible.

          Second, even if a variable really should be (or could be) string, the name of the variable should be ... the name of the variable.

          The problem arises from an
          import in which an entire worksheet has been read in. You should know what happened, and even if the blame lies on whoever supplied the data, the data need cleaning up.

          Comment


          • #6
            Originally posted by Nick Cox View Post
            Your first observation

            Code:
            "id" "itemid"
            manifestly should have been read in as variable names, not as data values.

            Many users of spreadsheets and similar applications put metadata in the first row(s) of a worksheet.

            So far, so good, but that approach makes no sense in Stata.

            First of all, names or other text in the first observation condemn a variable to be string, which renders numeric calculations impossible.

            Second, even if a variable really should be (or could be) string, the name of the variable should be ... the name of the variable.

            The problem arises from an
            import in which an entire worksheet has been read in. You should know what happened, and even if the blame lies on whoever supplied the data, the data need cleaning up.
            oh understood. Thanks a lot for clarifying!

            Comment


            • #7
              Originally posted by Øyvind Snilsberg View Post
              Code:
              egen flag = min(real(itemid)>1) ,by(id)
              Thank you! it worked perfectly

              Comment

              Working...
              X