Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drop values that are more than 0.1 decimal points

    Dear all
    I have a simple question but not sure how to address it. I have the following dataset


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double hhs_id int menu double weight
    474.1  315   20
    474.1 2891  585
    474.1 2773 2501
    474.1 2771  550
    474.1 2781 1080
    474.1 2772 2714
    474.1  303   20
    474.1 2522    3
    474.1 2772   60
    474.1  316   10
    474.1  313 1000
    474.1  303   20
    474.1 2881  117
    474.1 2892 2695
    474.2 2891  486
    474.2  621  250
    474.2  303   20
    474.2 2771  863
    474.2 2892  405
    474.2 2522    8
    474.2 2781 1152
    474.2 2881  706
    474.2 2772 1751
    end
    label values menu x1_05
    label def x1_05 303 "Biscuit", modify
    label def x1_05 313 "Any boiled food", modify
    label def x1_05 315 "Betel Leaf", modify
    label def x1_05 316 "Supari", modify
    label def x1_05 621 "White Sweet Potato", modify
    label def x1_05 2522 "Salt (Iodine)", modify
    label def x1_05 2771 "Rice/Jao", modify
    label def x1_05 2772 "Rice/Jao", modify
    label def x1_05 2773 "Rice/Jao", modify
    label def x1_05 2781 "Panta Bhaat", modify
    label def x1_05 2881 "Bhaji", modify
    label def x1_05 2891 "Jhol curry", modify
    label def x1_05 2892 "Jhol curry", modify
    For the variable hhs_id I want to drop the observations that are greater than .1 decimal points. For instance in the dataex example above I want to retain values that are precisely 474.1 but want to drop values that are 474.2. Any advice is welcome.


  • #2
    Shailja:
    I would recommend you to flag instead of -drop-ping the observations you're not interested in and rule them out from statatistical analyses via an -if- clause.
    You can start with flagging them this way:
    Code:
    . g wanted=1 if hhs_id>474.1
    (14 missing values generated)
    
    . replace wanted=0 if hhs_id<=474.1
    (14 real changes made)
    
    . list
    
         +-----------------------------------------------+
         | hhs_id                 menu   weight   wanted |
         |-----------------------------------------------|
      1. |  474.1           Betel Leaf       20        0 |
      2. |  474.1           Jhol curry      585        0 |
      3. |  474.1             Rice/Jao     2501        0 |
      4. |  474.1             Rice/Jao      550        0 |
      5. |  474.1          Panta Bhaat     1080        0 |
         |-----------------------------------------------|
      6. |  474.1             Rice/Jao     2714        0 |
      7. |  474.1              Biscuit       20        0 |
      8. |  474.1        Salt (Iodine)        3        0 |
      9. |  474.1             Rice/Jao       60        0 |
     10. |  474.1               Supari       10        0 |
         |-----------------------------------------------|
     11. |  474.1      Any boiled food     1000        0 |
     12. |  474.1              Biscuit       20        0 |
     13. |  474.1                Bhaji      117        0 |
     14. |  474.1           Jhol curry     2695        0 |
     15. |  474.2           Jhol curry      486        1 |
         |-----------------------------------------------|
     16. |  474.2   White Sweet Potato      250        1 |
     17. |  474.2              Biscuit       20        1 |
     18. |  474.2             Rice/Jao      863        1 |
     19. |  474.2           Jhol curry      405        1 |
     20. |  474.2        Salt (Iodine)        8        1 |
         |-----------------------------------------------|
     21. |  474.2          Panta Bhaat     1152        1 |
     22. |  474.2                Bhaji      706        1 |
     23. |  474.2             Rice/Jao     1751        1 |
         +-----------------------------------------------+
    
    .
    or, coded up more efficiently,:
    Code:
    gen wanted2=cond(missing( hhs_id ), ., cond(hhs_id >474.1,1,0))
    
    . list
    
         +---------------------------------------------------------+
         | hhs_id                 menu   weight   wanted   wanted2 |
         |---------------------------------------------------------|
      1. |  474.1           Betel Leaf       20        0         0 |
      2. |  474.1           Jhol curry      585        0         0 |
      3. |  474.1             Rice/Jao     2501        0         0 |
      4. |  474.1             Rice/Jao      550        0         0 |
      5. |  474.1          Panta Bhaat     1080        0         0 |
         |---------------------------------------------------------|
      6. |  474.1             Rice/Jao     2714        0         0 |
      7. |  474.1              Biscuit       20        0         0 |
      8. |  474.1        Salt (Iodine)        3        0         0 |
      9. |  474.1             Rice/Jao       60        0         0 |
     10. |  474.1               Supari       10        0         0 |
         |---------------------------------------------------------|
     11. |  474.1      Any boiled food     1000        0         0 |
     12. |  474.1              Biscuit       20        0         0 |
     13. |  474.1                Bhaji      117        0         0 |
     14. |  474.1           Jhol curry     2695        0         0 |
     15. |  474.2           Jhol curry      486        1         1 |
         |---------------------------------------------------------|
     16. |  474.2   White Sweet Potato      250        1         1 |
     17. |  474.2              Biscuit       20        1         1 |
     18. |  474.2             Rice/Jao      863        1         1 |
     19. |  474.2           Jhol curry      405        1         1 |
     20. |  474.2        Salt (Iodine)        8        1         1 |
         |---------------------------------------------------------|
     21. |  474.2          Panta Bhaat     1152        1         1 |
     22. |  474.2                Bhaji      706        1         1 |
     23. |  474.2             Rice/Jao     1751        1         1 |
         +---------------------------------------------------------+
    
    .
    Last edited by Carlo Lazzaro; 18 Nov 2021, 01:07.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Dear Carlo Lazzaro Thank you very much for your reply. Flagging the values more than .1 decimal would work very well for me.
      However my variable hhs_id ranges from 1 to 6503. What i shared above was a small portion of my data, so how can i adapt the code you mention above to it?
      I am sharing a larger portion of the data below


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input double hhs_id int(menu weight)
      495.1   12   28
      495.1  272  150
      495.1  293  140
      495.1  311  155
      495.1  315   12
      495.1  316   10
      495.1 2522   10
      495.1 2771 1918
      495.1 2772 3063
      495.1 2881  297
      495.1 2882  321
      495.1 2883  370
      495.1 2901  583
      495.1 2902  940
      495.1 2911  846
      495.2   57    3
      495.2  272  308
      495.2  272  140
      495.2  306   45
      495.2  310   20
      495.2  311  160
      495.2 2522    6
      495.2 2771 1767
      495.2 2772 3285
      495.2 2881  215
      495.2 2891 1093
      495.2 2901  415
      495.2 2902  552
      495.2 2903  955
      495.2 2904  509
      496.1   16  140
      496.1  132  180
      496.1  303  105
      496.1  315   10
      496.1  316    8
      496.2 2771 4571
      496.2 2811  480
      496.2 2881  206
      496.2 2882  506
      496.2 2891  969
      496.2 2901  788
      496.3  272  146
      496.3  284  360
      496.3  303   20
      497.1  315    2
      497.1  316    4
      497.1 2522    3
      497.1 2771 1169
      497.1 2772 2163
      497.1 2881  439
      497.2 2882  436
      497.2 2901  385
      497.2  298   35
      497.2  305   50
      497.2  312   90
      end
      label values menu x1_05
      label def x1_05 12 "Muri/Khoi (puffed rice)", modify
      label def x1_05 16 "Cerelac", modify
      label def x1_05 57 "Green chili", modify
      label def x1_05 132 "Milk", modify
      label def x1_05 272 "Tea ?prepared", modify
      label def x1_05 284 "Paes/firni/cooked firni", modify
      label def x1_05 293 "Sweets", modify
      label def x1_05 298 "Piaju", modify
      label def x1_05 303 "Biscuit", modify
      label def x1_05 305 "Patties", modify
      label def x1_05 306 "Chips", modify
      label def x1_05 310 "Murali", modify
      label def x1_05 311 "Nimki", modify
      label def x1_05 312 "Any fried food", modify
      label def x1_05 315 "Betel Leaf", modify
      label def x1_05 316 "Supari", modify
      label def x1_05 2522 "Salt (Iodine)", modify
      label def x1_05 2771 "Rice/Jao 1", modify
      label def x1_05 2772 "Rice/Jao 2", modify
      label def x1_05 2811 "Ruti/Parota 1", modify
      label def x1_05 2881 "Bhaji 1", modify
      label def x1_05 2882 "Bhaji 2", modify
      label def x1_05 2883 "Bhaji 3", modify
      label def x1_05 2891 "Jhol curry 1", modify
      label def x1_05 2901 "Bhuna curry 1", modify
      label def x1_05 2902 "Bhuna curry 2", modify
      label def x1_05 2903 "Bhuna curry 3", modify
      label def x1_05 2904 "Bhuna curry 4", modify
      label def x1_05 2911 "Daal 1", modify

      Comment


      • #4
        Carlo's solution only works when all hhid's have the same "stem" of 474. I suspect that that is not the case in Shailaja's dataset. I suspect that hhid contains two bits of information: the stem and some additional info behind the decimal point (maybe wave?).

        This is not the right way to store that information. The problem is that computers store numbers in binary, and .1 in binary is like 1/3 in decimal, and just cannot be stored exactly. The best thing you can do to first split the hhid variable up into its two parts

        Code:
        gen hhid_stem = floor(hhid)
        // choose between of the below:
        gen hhid_rest = round(mod(hhid,1)*10) // assuming that you are only interested in only the first digit behind the decimal point
        gen hhid_rest = round(mod(hhid,1)*100) // assuming that you are only interested in only the first two digits behind the decimal point
        gen hhid_rest = round(mod(hhid,1)*1000) // assuming that you are only interested in only the first three digits behind the decimal point
        etc.
        I would use more informative names, but since I don't know your dataset, I don't know what these parts are supposed to mean.

        After that you can use the if condition, something like sum weight if hhid_rest == 1

        You can drop that part of the dataset if you really don't want it: drop if hhid_rest > 1

        Or you can keep your original data intact, but move your selection to another frame:
        Code:
        frame put hhs_id_stem  menu weight if hhid_rest == 1, into(hhid1)
        frame change hhid1
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          For the case of #3, it would be:

          Code:
          drop if (hhs_id - int(hhs_id)) > float(0.1)
          Crossed with #4.

          Comment


          • #6
            Applying my solutions to your new data:

            Code:
            . gen hhs_id_stem = floor(hhs_id)
            
            . gen hhs_id_rest = round(mod(hhs_id,1)*10)
            
            .
            . // solution 1
            . sum weight if hhs_id_rest == 1
            
                Variable |        Obs        Mean    Std. dev.       Min        Max
            -------------+---------------------------------------------------------
                  weight |         26    502.5385    777.9512          2       3063
            
            .
            . // solution 3
            . frame put hhs_id_stem menu weight if hhs_id_rest == 1, into(hhs_id_1)
            
            . frame change hhs_id_1
            
            . sum weight
            
                Variable |        Obs        Mean    Std. dev.       Min        Max
            -------------+---------------------------------------------------------
                  weight |         26    502.5385    777.9512          2       3063
            
            . frame change default
            
            .
            . // solution 2
            . drop if hhs_id_rest > 1
            (29 observations deleted)
            
            . sum weight
            
                Variable |        Obs        Mean    Std. dev.       Min        Max
            -------------+---------------------------------------------------------
                  weight |         26    502.5385    777.9512          2       3063
            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #7
              Dear Maarten Buis and Fei Wang thank you very much for the help. The solutions work very well.

              Maarten Buis I will indeed try to use more informative names.
              The hhs_id variables denotes household identification numbers. and the decimal points (such as .1 and .2) denote is assigned to denote the number of households formed from the original households.

              Thanks again

              Comment


              • #8
                Shailaja:
                the next step is a loop, then.
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Thank you Carlo Lazzaro

                  Comment


                  • #10
                    Originally posted by Shailaja Tiwari View Post
                    Maarten Buis I will indeed try to use more informative names.
                    My comment on using more informative names did not refer to your choice of names, they are fine. I referred to the names I chose for the variables I created (hhs_id_stem and hhs_id_rest). There I chose generic names, because I did not know the context, and I encouraged you to chose better names than my generic ones.
                    ---------------------------------
                    Maarten L. Buis
                    University of Konstanz
                    Department of history and sociology
                    box 40
                    78457 Konstanz
                    Germany
                    http://www.maartenbuis.nl
                    ---------------------------------

                    Comment


                    • #11
                      I will throw another idea into the pot, which is to think in terms of the equivalent string.

                      But first, as explained in #7 these look like composite identifiers for individuals within households.

                      So, what If there are 10, 20, .... members in a household? How does Stata know the difference between 123.1 and 123.10, and so on?

                      Comment

                      Working...
                      X