Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tab and unique id

    Since observations in the dataset are around 500 that's why I have attached an excel file. I have only added variables that are relevant to this question. There are 20 variables along with these in the main file.

    My question pertains to oneway tab. I'm trying to find descriptive statistics using these variable. In the dataset HHID is the household ID, S1-3-2 is the relationship (1: Single-head 2: Spouse, 3: Son/Daughter,,,10:Other) , S1-3-3 is the gender (1: Male, 2:Female).

    I want to know what percentage of single headed (S1-3-2) households were managed by males and how many by females (S1-3-3). Or percentage of sons or daughters across my entire dataset.

    My problem issue is when I run -bysort s133: tab s132-, the results that i get are inflated as variable hhid is not uniquely defined. I want my results to have information per household (N=116).
    I'm avoiding using -collapse-, because I want to retain information of some variables that are not added in this file.
    Note: after importing, variable names dropped -dash-. That's why i have written variable name differently.

    Is there a way I can use tab and unique at the same time?

    I hope my question is clear.

    Thank You,

    Ritika
    Attached Files

  • #2
    It would be easier to give you specific advice with an adequately presented data example. If you read the FAQ, you will see that attachments of spreadsheets are not a helpful way to share data. There are responders who do not use spreadsheet softwre in the first place. Among those that do, some, myself included, will not download one from a stranger because they can contain active malware. When you read the FAQ, please concentrate on #12, where you will learn about using the -dataex- command to show example data here. You can get it by running -ssc install dataex-. Then run -help dataex- to read the simple instructions for using it. Then use it to post again with an example of your data generated by -dataex-. I'm confident you will get a timely and helpful solution if yo do that.

    I should also point out that you need not be overly concerned about loss of information when you run -collapse- because you can -preserve- your data before you do it, and then -restore- the data after you are done with the collapsed version. This will add a little bit of time to your runs, because -preserve- writes to the hard drive and -restore- reads from it. But unless your data set is huge, the delay will not be great. Indeed, with only 116 households, I imagine the time difference will be barely noticeable.

    Comment


    • #3
      Alright. I will be more precise. Thank you.

      Ritika

      Comment


      • #4
        ----------------------- copy starting from the next line -----------------------
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str19 hhid str40 s11211 float buyers byte(s11212 s11213)
        "1075bd13ab5fe870-1"  "Trader"                     1 100 3
        "1075bd13ab5fe870-2"  "Cooperative"                2  95 2
        "1075bd13ab5fe870-2"  "Buyer"                      2   5 2
        "1075bd13ab5fe870-3"  "Cooperative"                2 100 3
        "1075bd13ab5fe870-3"  "Trader"                     2 100 3
        "1075bd13ab5fe870-4"  "Buyer"                      1 100 2
        "1075bd13ab5fe870-5"  "Cooperative"                2 100 3
        "1075bd13ab5fe870-5"  "Trader"                     2 100 3
        "1075bd13ab5fe870-6"  "VFC"                        2 100 3
        "1075bd13ab5fe870-6"  "Cooperative"                2 100 .
        "1075bd13ab5fe870-7"  "Cooperative"                3 100 3
        "1075bd13ab5fe870-7"  "Trader"                     3 100 3
        "1075bd13ab5fe870-7"  "Buyer"                      3  30 3
        "1075bd13ab5fe870-8"  "Cooperative"                2  90 3
        "1075bd13ab5fe870-8"  "Trader"                     2  10 3
        "1075bd13ab5fe870-9"  "VFC"                        2 100 3
        "1075bd13ab5fe870-9"  "Cooperative"                2 100 3
        "1075bd13ab5fe870-10" "Buyer"                      2 100 4
        "1075bd13ab5fe870-10" "Trader"                     2 100 1
        "1075bd13ab5fe870-11" "Trader"                     2  80 3
        "1075bd13ab5fe870-11" "Processing plant in town"   2 100 3
        "1075bd13ab5fe870-12" "Collection trader"          2 100 4
        "1075bd13ab5fe870-12" "Market"                     2 100 4
        "1075bd13ab5fe870-13" "Trader"                     2 100 4
        "1075bd13ab5fe870-13" "Market"                     2 100 4
        "1075bd13ab5fe870-14" "Cooperative"                2 100 2
        "1075bd13ab5fe870-14" "Trader"                     2 100 2
        "1075bd13ab5fe870-15" "Cooperative"                3 100 5
        "1075bd13ab5fe870-15" "Trader"                     3 100 5
        "1075bd13ab5fe870-15" "Market"                     3 100 5
        "1075bd13ab5fe870-16" "Trader"                     2  95 2
        "1075bd13ab5fe870-16" "SHG"                        2   5 2
        "1075bd13ab5fe870-17" "Trader"                     1 100 3
        "1075bd13ab5fe870-18" "Trader"                     1 100 3
        "1075bd13ab5fe870-19" "Buyer"                      2 100 4
        "1075bd13ab5fe870-19" "Trader"                     2 100 1
        "1075bd13ab5fe870-20" "Trader"                     1 100 3
        "1075bd13ab5fe870-21" "Cooperative"                2 100 3
        "1075bd13ab5fe870-21" "Buyer"                      2 100 4
        "1075bd13ab5fe870-22" ""                           1   . .
        "1075bd13ab5fe870-23" ""                           1   . .
        "1075bd13ab5fe870-24" "Trader"                     1 100 3
        "1075bd13ab5fe870-25" "Cooperative"                2 100 4
        "1075bd13ab5fe870-25" "Trader"                     2 100 4
        "1075bd13ab5fe870-26" "Cooperative"                2 100 3
        "1075bd13ab5fe870-26" "Trader"                     2 100 4
        "1075bd13ab5fe870-27" "Cooperative"                2 100 3
        "1075bd13ab5fe870-27" "Buyer"                      2 100 3
        "1075bd13ab5fe870-28" "Cooperative"                2 100 4
        "1075bd13ab5fe870-28" "Middlemen"                  2 100 4
        "13b393bcaf5a6704-2"  ""                           1   . .
        "13b393bcaf5a6704-3"  "Collection trader"          1 100 3
        "13b393bcaf5a6704-1"  "Collection trader"          3 100 3
        "13b393bcaf5a6704-1"  "Collection trader"          3 100 3
        "13b393bcaf5a6704-1"  "Collection trader"          3 100 3
        "13b393bcaf5a6704-4"  "Collection trader"          2 100 3
        "13b393bcaf5a6704-4"  "Collection trader"          2 100 3
        "13b393bcaf5a6704-5"  "Middlemen"                  4 100 3
        "13b393bcaf5a6704-5"  "Middlemen"                  4 100 3
        "13b393bcaf5a6704-5"  "Middlemen"                  4 100 3
        "13b393bcaf5a6704-5"  ""                           4   . .
        "13b393bcaf5a6704-6"  "VFC"                        1 100 3
        "13b393bcaf5a6704-10" "Collection trader"          1 100 3
        "13b393bcaf5a6704-11" "Collection trader"          1 100 3
        "13b393bcaf5a6704-12" "Collection trader"          2 100 3
        "13b393bcaf5a6704-12" ""                           2   . .
        "13b393bcaf5a6704-13" ""                           1   . .
        "13b393bcaf5a6704-14" "Collection trader"          3 100 3
        "13b393bcaf5a6704-14" "Collection trader"          3 100 3
        "13b393bcaf5a6704-14" "Collection trader"          3 100 3
        "13b393bcaf5a6704-15" ""                           1   . .
        "13b393bcaf5a6704-16" "VFC"                        2 100 3
        "13b393bcaf5a6704-16" "Middlemen"                  2 100 3
        "13b393bcaf5a6704-17" "Collection trader"          1 100 3
        "13b393bcaf5a6704-18" ""                           1   . .
        "13b393bcaf5a6704-19" ""                           1   . .
        "13b393bcaf5a6704-20" "Collection trader"          2 100 2
        "13b393bcaf5a6704-20" "Collection trader"          2 100 2
        "13b393bcaf5a6704-21" "Collection trader"          1 100 3
        "13b393bcaf5a6704-22" ""                           1   . .
        "13b393bcaf5a6704-23" "Collection trader"          1 100 3
        "13b393bcaf5a6704-24" ""                           1   . .
        "13b393bcaf5a6704-25" "Collection trader"          2 100 3
        "13b393bcaf5a6704-25" "Collection trader"          2 100 3
        "13b393bcaf5a6704-26" "Collection trader"          2 100 3
        "13b393bcaf5a6704-26" "Collection trader"          2 100 3
        "13b393bcaf5a6704-27" "Collection trader"          2 100 3
        "13b393bcaf5a6704-27" "Collection trader"          2 100 3
        "13b393bcaf5a6704-28" "Collection trader"          2 100 3
        "13b393bcaf5a6704-28" "Collection trader"          2 100 3
        "13b393bcaf5a6704-29" "Middlemen"                  2 100 3
        "13b393bcaf5a6704-29" "Middlemen"                  2 100 3
        "13b393bcaf5a6704-30" "Cooperative"                2 100 3
        "13b393bcaf5a6704-30" "Cooperative"                2 100 3
        "267324b030583819-1"  "Trader"                     2  70 3
        "267324b030583819-1"  "VFC"                        2  30 4
        "267324b030583819-2"  "Middlemen"                  1 100 3
        "267324b030583819-3"  "VFC"                        1 100 4
        "267324b030583819-4"  "Target crops are not sold." 1   . .
        "267324b030583819-5"  "Middlemen"                  1 100 2
        end
        ------------------ copy up to and including the previous line ------------------
        Note: these are only 100 observations of 158 observations generated by dataex.

        Variable names:
        hhid: household id
        s11211: Type of buyer
        buyers: no of buyers
        s11212: Current share of target crop
        s11213: Trend (1: Much less than before, 2: A bit less than before, 3: Unchanged, 4: A bit more then before, 5: A lot more than before, 6: Don't know)

        I would like to ask that same question again. This time I have pasted the dataset here.

        When I run tab command -tab s11211-, for example, percentage of Collection traders is inflated. That is because hhid is not unique. How can I calculate percentage of type of buyers without counting a single household more than once.

        I hope I could convey my question properly.

        Thank You,

        Ritika

        Comment


        • #5
          Thanks for the -dataex- example.

          I'm really having a hard time understanding what you want. You say you only want to count each hhid once, but each hhid in your example data only occurs once, so there is no way for it to be counted more than once!

          So then I thought, well if you look at the hhid values, they come in two parts, a long string, followed by a hyphen and what appears to be a sequence number (although it isn't quite that because sometimes it repeats in consecutive observations). So maybe when you say hhid, you really mean the part before the hyphen. But if that's the case, then you can't count each hhid once in any sensible way because all of the information in each hhid's observation is different, so I wouldn't have a clue which of the many observations to pick for inclusion in the table.

          So I'm going to take a stab in the dark here. I'm going to speculate that you do in fact consider only the pre-hyphen part of the hhid to be the household id. And I'm guessing that what you want is a table that gives a single row for each hhid, the columns being the percentage of each type of buyer for that household, where each row of the original data contributes a number of buyers given by the variable buyers. As far as I can tell, variables s11212 and s11213 are irrelevant here.

          Code:
          //    EXTRACT HOUSEHOLD FROM HHID
          gen hyphen_location = strpos(hhid, "-")
          gen household = substr(hhid, 1, hyphen_location-1)
          
          //    REDUCE TO ONE OBSERVATION PER HOUSEHOLD WITH NUMBERS OF BUYERS OF EACH TYPE
          collapse (sum) buyers, by(household s11211)
          
          //    CALCULATE PERCENTAGE OF BUYERS OF EACH TYPE
          by household, sort: egen total_buyers = total(buyers)
          gen percent_buyers = 100*buyers/total_buyers
          drop total_buyers
          
          //    DISPLAY A TABLE
          tabdisp household s11211, c(buyers percent_buyers) format(%2.1f)
          If you need your original data back after you create this table, do -preserve- before the beginning and -restore- after the end of this code.

          Comment


          • #6
            As I understand your question, you want to calculate the proportion of all households that have a certain type of buyer. For example, 3 of the 60 households or 5% have "Market" as buyer. If this interpretation is correct, you can get the numbers with the code below.

            You have to install fsum for the last command because fsum can show variable labels instead of variable names.
            Code:
            ssc install fsum
            I have a feeling that there is an easier way to do this but at least you get the result you are looking for with the code that follows.
            Code:
            * Generate separate variables for each type of buyer
            encode s11211, gen(buyer)
            levelsof(buyer), local(buyers)
            foreach b of local buyers {
              gen var`b' = (buyer == `b')
            }
            
            * Count total number of buyers per type of buyer
            collapse (sum) var1 - var10, by(hhid)
            
            * Replace all numbers > 1 by 1 so that for households with more
            * than one buyer of the same type the buyer is only counted once
            foreach var of varlist var1 - var10 {
              replace `var' = 1 if `var' > 1
            }
            
            * Label the variables with the type of buyer
            foreach b of local buyers {
              local varname : label buyer `b'
              lab var var`b' "`varname'"
            }
            
            * List proportion of households that have a certain buyer
            fsum, stats(mean) f(%9.3f) uselabel
            The output of fsum shows that 5% of households have a buyer of the type "Market" and 10% of household have a buyer of the type "VFC", for example.
            Code:
                               Variable |        N     Mean
            ----------------------------+------------------
                                  Buyer |       60    0.117
                      Collection trader |       60    0.267
                            Cooperative |       60    0.250
                                 Market |       60    0.050
                              Middlemen |       60    0.100
               Processing plant in town |       60    0.017
                                    SHG |       60    0.017
             Target crops are not sold. |       60    0.017
                                 Trader |       60    0.317
                                    VFC |       60    0.100
            Edit: I just saw Clyde's answer and I interpret the household ID differently. The variable hhid has 60 unique values and if I understand the data correctly, this means there are 100 observations from 60 households.
            Last edited by Friedrich Huebler; 12 Sep 2017, 19:39.

            Comment


            • #7
              Here is another way to do it building on Friedrich's excellent analysis. The main trick is to avoid double counting by tagging each cross-combination just once. The same trick is also one way to get the number of distinct (not unique) households.

              Code:
              * distinct households
              egen tag_hh = tag(hhid)
              count if tag_hh
              local n_hh = r(N)
              
               
              egen tag = tag(hhid s11211)
              bysort s11211 tag: gen freq = _N
              gen mean = string(freq/`n_hh', "%4.3f")  
              tabdisp s11211 if tag, c(freq  mean)  
              
              ---------------------------------------------------
                                  s11211 |       freq        mean
              ---------------------------+-----------------------
                                   Buyer |          7       0.117
                       Collection trader |         16       0.267
                             Cooperative |         15       0.250
                                  Market |          3       0.050
                               Middlemen |          6       0.100
                Processing plant in town |          1       0.017
                                     SHG |          1       0.017
              Target crops are not sold. |          1       0.017
                                  Trader |         19       0.317
                                     VFC |          6       0.100
              ---------------------------------------------------

              Comment


              • #8
                Thank you Clyde. I know they are slightly weird hhid. They all belong to distinct household. Those questionnaire tablet generated ids. But i appreciate you for answering my question.

                Thank you Friedrich and Nick for giving me a different perspective to solve this.

                Comment

                Working...
                X