Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Please Help with Code

    Hi All, I am trying to make a variable that identifies whether a person's second to last work history entry was employment (as opposed to being unemployed or abroad). The problem I am running into is that I do not have a traditional panel data set, as each person's work history entries are classified by different start dates over a period of 30 years, and I want to identify each person's most recent work history entry and their second most recent work history entry using the L. command. So far, I have sorted my data by worker id (fwid) and work history entry start date, and I have tried to generate a variable that identifies their work history entry number by generating a running sum of ones (called file_num) for each individual (after sorting the data). My idea was to use the xtset command with the worker's id as the panel variable and the file_num as the time variable. Then I was going to identify the max value for the file_num variable and use the L. command to identify the second to last entry for each person. The weird this is that when I run this same exact set of code from start to finish, my variable of interest (separated_to_employ_v2) winds up with different summary statistics every time, and I cannot figure out why. Any help you can provide to resolve this issue would be greatly appreciated, as this issue is preventing me from replicating my regression results when I re-run the code.


    use "C:\Users\Zach\Dropbox\H-2A\Generated Data Files\NAWS Workgrid with Main File Merged 1989-2018.dta", clear
    rename *, lower
    encode c06, gen(work_type)
    gen abroad = work_type==1
    gen farm_work = work_type==2
    gen non_farm_work = work_type==3
    gen non_employed = work_type==4
    gen file_num = .
    gen ones = 1
    sort fwid start_date
    by fwid: replace file_num = sum(ones)
    xtset fwid file_num
    gen separated_to_employv2 = (l.farm_work==1 | l.non_farm_work==1) & l.end_date<start_date
    sum separated_to_employv2, d

    Here are the summary stats for the variable "separated_to_employ_v2" from two separate runs of the code from start to finish...note the difference in the means.

    Click image for larger version

Name:	Stata Forum 1.PNG
Views:	2
Size:	49.2 KB
ID:	1652078
    Click image for larger version

Name:	Stata Forum 2.PNG
Views:	2
Size:	49.4 KB
ID:	1652079

  • #2
    You do not provide example data (summary statistics are nice, but insufficient) so it is hard to give concrete advice with certainty. But the likely problem is that you are misunderstanding your data.

    When you run -sort fwid start_date-, if fwid and start_date do not jointly identify unique observations in the data set, then within the groups of a given combination of fwid and start_date, they will be shuffled in random order, and that order will not be reproducible from one occasion to the next. My guess is that fwid and start_date simply are not unique identifiers. You can verify that with a simple command:
    Code:
    isid fwid start_date
    If you get no output from that, then they do uniquely identify observations and something else must be the problem. In that case, post back and use the -dataex- command to show example data. But as I think this is not likely, I will continue to comment here on the assumption that fwid and start_date do not jointly identify unique observations.

    But I suspect that code will give you an error message saying that fwid and start_date do not uniquely identify observations in the data. Now, the presence of such duplicates is incompatible with your goal of identifying the penultimate observation for each worker. If the penultimate start_date for a given fwid has two or more different observations in the data set, there is no way to decide which of them to designate as the desired one.

    The first thing you should do, in this case, is find the offending duplicate observations:

    Code:
    duplicates tag fwid start_date, gen(flag)
    browse if flag
    That will show them to you. Next you have to decide what is going on. There are three general cases:

    Case 1. The duplicates on fwid and start_date are actually complete duplicate observations agreeing on every variable in the data set. In that case, -duplicates drop- will eliminate them. However, before you rush to do that you have to ask yourself how those duplicates got there in the first place. It represents, at best, inefficient data management in the creation of that data set, or, more likely, coding errors along the way. So before you just purge the duplicates, you should do a thorough review of the way this data set came into existence, find the source of the duplicates and then fix that code and re-create the data set correctly. It is possible, even likely, that in the course of doing that you will uncover other programming errors, and this is a great opportunity to fix those as well before they bite you later.

    Case 2. The duplicates on fwid and start_date are not complete duplicates--they disagree on some other variables--but they are legitimate data and belong there. That is, the way the data were collected it is perfectly possible for the same person to have two employment experiences beginning on the same date and have those recorded as separate entries in your data. In that case, your goal of identifying the second to last experience is untenable. However, you might find some way, consistent with your research goals, to state a rule for working with these multiple entries. Perhaps you should prioritize among the economic sectors engaged, or perhaps choose the one of longest duration, or the one with the most work hours per week, or something like that. That is a substantive matter that you will have to decide on. If you are not comfortable with doing that, you will need to consult a colleague in your field--it is not a statistical question.

    Case 3. The duplicates on fwid and start_date are not complete duplicates and they aren't supposed to be there. Then you have the problem of chasing down which one is correct, and also chasing the data management (a la Case 1) to find out how the incorrect data got included.

    On an optimistic note, let me assume that you manage to resolve these problems and come up with a data set for which you have fwid and start_date jointly identifying unique observations. Then there is a very simple way to identify the last and second-to-last experiences:

    Code:
    by fwid (start_date), sort: gen last_obs = (_n == _N)
    by fwid (start_date): gen second_to_last_obs = (_n == _N-1)

    Comment


    • #3
      Clyde, Thank you so much for your response. I am so grateful for you taking the time to respond. You are correct that the fwid and start_date variables do not uniquely identify the observations in my data set, so that appears to be the cause of my problem. From an initial inspection of the data (I'm working with farmworker survey data), it looks like some of these people have multiple entries because they were working on multiple crops during the same time, and the survey I am working with asks questions about the crop that they were working on. So my hunch is that they are getting multiple entries simply because they want info about the different crops workers were working on (I can confirm this by checking with my contact at the DOL). Nevertheless, a preliminary analysis that drops the workers with duplicate start date entries produces results that are consistent when I run the code multiple times, so that appears to resolve my issue. I will have to think more deeply about how I want to classify the last and second to last entry for those workers moving forward. For now, I am just going to drop those observations (they only comprise about 3% of workers in my sample) to generate some preliminary results and then revisit that decision as I make more progress on the project. Thank you once again for your assistance. I am ever so grateful! Have a great weekend.

      Comment

      Working...
      X