Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate variable based on data in another data set

    Hello all,

    My question in a nutshell:

    How can I make a variable ‘farmer’ as shown in Data Set 2 with the information from Data Set 1? (Farmer being a variable that differentiates between non-apple farmers (like HH2), apple farmers (like HH1 and 4), and non-farmers (like HH3).

    (End goal is to compare income among non-apple farmers, apple farmers and other households in Data Set 2.)

    Sample Data:
    Data Set 1:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 A str7 B str6 C
    "hh_id" "product" "amt_kg"
    "1"     "apples"  "10"    
    "1"     "bananas" "30"    
    "1"     "pears"   "40"    
    "2"     "oranges" "50"    
    "2"     "grapes " "60"    
    "2"     "bananas" "70"    
    "4"     "grapes " "90"    
    "4"     "apples"  "100"   
    "4"     "pears"   "100"   
    end
    Data Set 2:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 A str6(B C)
    "hh_id" "income" "farmer"
    "1"     "1000"   "2"     
    "2"     "1500"   "0"     
    "3"     "2000"   "1"     
    "4"     "3000"   "2"     
    end

    More information:

    I have two data sets from a household survey. The first one contains only farming data, and lists products households produce with multiple rows for each household. (See Data Set 1 above)

    The second data set contains data from all households, including non-farming households and the income of each. How do I generate a variable that would say (as in the example above) if the household doesn’t farm=0, farms not apples=1 and farms apples=2.

    I’m not sure how to get this information from one data set to another to generate the variable 'farmer' without doing it by hand for the rest of the households.

    I think the answer might lie in generating a list through command 'levelsof' this is as much as I managed, but it didn't work: there's no error message but it doesn't replace any datapoints.

    Code:
    gen apple_farmer=1 if product==“apples”
    levelsof hh_id if apple_farmer==1, local(level)
    foreach level of local hh_id {
    replace apple_farmer=1 if hh_id=r(level)
    }
    Any help would be very appreciated!!

  • #2
    If, as you show in the data example, dataset 1 has multiple entries per hh_id and dataset 2 only has one entry per hh_id, just merge the two. You will need to do some preprocessing first to rename the variables using the values in the first row, or, if these data were imported from an Excel spreadsheet, reimport them using the option -firstrow-. Subsequently:

    Code:
    use dataset_1.dta, clear
    merge m:1 hh_id using dataset_2, keepusing(farmer)
    Last edited by Ali Atia; 12 Apr 2022, 12:03.

    Comment


    • #3
      The approach you are trying might ultimately work, but it is complicated and really amounts to hand-inputting the data through the code. It is also not scalable beyond a small number of cases.

      Stata has a very simple solution for this:
      Code:
      use dataset1, clear
      merge m:1 hh_id using dataset2, keep(master match)
      That said, however, your data sets are badly structured and you are going to have lots of difficulty working with them throughout your project. I'm inferring from the variable names A, B, and C that these data are imported from a spreadsheet. And the actually informative variable names are located in the first row of that spreadsheet. Well, you need to have those actually informative variable names used by Stata as its names for the variables, not A, B, and C. Not only will that make your code more understandable (so you won't have to constantly remember what A, B, or C means in which data set) but also it will enable you to have the numeric variables stored as numbers, rather than as strings. With strings you cannot do calculations. There are two ways you can fix this problem. One is to go back and reimport the data. The -import excel- command offers the -firstrow- option which will cause Stata to do that. Alternatively, you can fix the existing data sets with the following code:

      Code:
      use the_dataset, clear
      foreach v of varlist _all {
          rename `v' `=`v'[1]'
      }
      drop in 1
      destring _all, replace
      Thank you for using -dataex- on your very first post!

      Comment


      • #4
        I'm afraid I was unclear! I hand coded the variable farmer to illustrate what I was trying to accomplish. The rest of the data set doesn't have the farmer value. I was asking how to best generate this variable in dataset2 based on the data in dataset1. Any ideas for that?

        The merge command is already very helpful and I'll see what I can do with it! Thank you both!
        And thanks for the tips on the variable names, that was some sloppy reimporting on my part.

        Comment


        • #5
          Thanks for clarifying. In the dataset containing only farmers, you can generate the farmer variable like so:

          Code:
          bys hh_id: egen farmer = max(product=="apples")
          replace farmer = farmer+1
          Where 2 indicates an apple farmer and 1 indicates a non-apple farmer.

          Then, you can merge with your other dataset of all households, and replace your farmer variable based on non-matching hh_ids:

          Code:
          merge m:1 hhid using non_farmerdataset
          replace farmer = 0 if _merge==2
          Where 0 indicates non-farmer.

          Finally, to get to a dataset with a row for each hh_id, income, and farmer status:

          Code:
          collapse (first) farmer income, by(hh_id)
          Last edited by Ali Atia; 12 Apr 2022, 13:17.

          Comment


          • #6
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input byte hh_id str7 product byte amt_kg
            1 "apples"   10
            1 "bananas"  30
            1 "pears"    40
            2 "oranges"  50
            2 "grapes "  60
            2 "bananas"  70
            4 "grapes "  90
            4 "apples"  100
            4 "pears"   100
            end
            
            by hh_id, sort: egen farms_apples = max(product == "apples")
            replace farms_apples = farms_apples + 1
            by hh_id: keep if _n == 1
            keep hhid farms_apples
            label define farms_apples   1   "Farms, but not apples" 2   "Farms apples"
            label values farms_apples farms_apples
            will get you part of the way from dataset 1 to dataset 2. But the rest of the way is not possible with the information shown so far.

            Dataset 1 does not provide information about income. Presumably this could be calculated from the variable amt_kg if price information were available, but it is not.

            More important, it is not possible to create a variable for hh_id 3 in Dataset 2 because Dataset 1 doesn't even mention that person. I suppose your intent is that non-farmers are identified by virtue of their not appearing in dataset 1 at all. Fair enough. But then how do we know to create an observation for hh_id 3 and not also for hh_id 3333333 and hh_id 12345678901234567890, etc. There are infinitely many hh_id numbers that do not appear in the dataset 1 and no information about which of them are simply not valid hh_id numbers at all.

            Added: Crossed with #5.

            Comment


            • #7
              The code worked perfectly, Ali Atia, thank you so so much! Thank you both for your help, I really appreciate your time!

              Comment

              Working...
              X