Monthly labor force status stored as a string variable

Youjin Choi

Join Date: Feb 2020

Posts: 1
#1

Monthly labor force status stored as a string variable

23 Feb 2020, 08:16

Hello,

I am working with a data set which storing monthly labour force status during the survey reference period as a 24-digit string variable.

For simplicity, let’s say 1 is for employed; 2 is for unemployed; 0 is not covered in the survey.

For example, the data looks liked:
pid dv
1001 111111111111222221111000
1002 222111111111111221111111
1003 111111112111111111111000
1004 111122221111111122221111
...

I’d like to clean this, so that I can run a hazard analysis. So, initial state, durations of employment, durations of unemployment, date of a spell started, etc.

What would be the best approach/reference to start?
Tags: None

Mike Lacy

Join Date: Apr 2014
Posts: 2416

23 Feb 2020, 08:34

Here's one approach:

Code:

forvalues i = 1/24 {
   gen byte molaborstat`i' = real(substr(dv,`i',1))
   label var molaborstat`i' "Labor force status in month `i'"
}
label define lflbl 0 "NA" 1 "employed" 2 "unemployed"
label values molaborstat* lflbl
// If you are going to do a discrete time analysis,
// the long format will be useful
reshape long molaborstat, i(pid) j(month)

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

23 Feb 2020, 08:42

Welcome to Statalist.

The following code may start you in a useful direction.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int pid str24 dv
1001 "111111111111222221111000"
1002 "222111111111111221111111"
1003 "111111112111111111111000"
1004 "111122221111111122221111"
end
// replace each digit with itself and a following space
generate dv2 = dv
replace dv2 = subinstr(dv2,"0","0 ",.)
replace dv2 = subinstr(dv2,"1","1 ",.)
replace dv2 = subinstr(dv2,"2","2 ",.)
// split it into status1...status 24
split dv2, generate(status) destring
drop dv dv2
// reshape it into one observation per month
reshape long status, i(pid) j(month)
list if pid==1001, clean noobs

Code:

. // replace each digit with itself and a following space
. generate dv2 = dv

. replace dv2 = subinstr(dv2,"0","0 ",.)
variable dv2 was str24 now str27
(2 real changes made)

. replace dv2 = subinstr(dv2,"1","1 ",.)
variable dv2 was str27 now str47
(4 real changes made)

. replace dv2 = subinstr(dv2,"2","2 ",.)
variable dv2 was str47 now str48
(4 real changes made)

. // split it into status1...status 24
. split dv2, generate(status) destring
variables born as string:
status1   status4   status7   status10  status13  status16  status19  status22
status2   status5   status8   status11  status14  status17  status20  status23
status3   status6   status9   status12  status15  status18  status21  status24
status1: all characters numeric; replaced as byte
status2: all characters numeric; replaced as byte
...
status24: all characters numeric; replaced as byte


. drop dv dv2

. // reshape it into one observation per month
. reshape long status, i(pid) j(month)
(note: j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        4   ->      96
Number of variables                  25   ->       3
j variable (24 values)                    ->   month
xij variables:
           status1 status2 ... status24   ->   status
-----------------------------------------------------------------------------

. list if pid==1001, clean noobs

     pid   month   status  
    1001       1        1  
    1001       2        1  
    1001       3        1  
    1001       4        1  
    1001       5        1  
    1001       6        1  
    1001       7        1  
    1001       8        1  
    1001       9        1  
    1001      10        1  
    1001      11        1  
    1001      12        1  
    1001      13        2  
    1001      14        2  
    1001      15        2  
    1001      16        2  
    1001      17        2  
    1001      18        1  
    1001      19        1  
    1001      20        1  
    1001      21        1  
    1001      22        0  
    1001      23        0  
    1001      24        0

I will note that the construction of dv2 could have been reduced from 4 very obvious commands to a single command using Stata's unicode regular expression function ustrregexra() at a cost of complete incomprehensibility to anyone not familiar with regular expressions. But if confronted with a larger number of potential characters, it would be less repetitive coding.

Code:

generate dv2 = ustrregexra(dv,"(.)","$1 ")

Last edited by William Lisowski; 23 Feb 2020, 08:54.

Announcement