Randomly assign values of existing string variable to new variable

Philipp Schrauth

Join Date: Feb 2016

Posts: 31
#1

Randomly assign values of existing string variable to new variable

20 Apr 2016, 07:34

Hello,

I Need to create a data-file, which resembles the structure of my original data-file but does not give any true Information about the original data.
I manage to shuffle and mask variables with numerical values (using e.g. the runiform() function) but I have problems with string variables.

There is for exmaple one variable, which tells about the country code, such that observations have the string values "DE" "AT" "US" and so on.
Now I want these values to be randomly assigned to any other observation in the dataset.
The ralpha() function from ssc does not help much in this case because it creates completely random values, which are no real country codes. So, I want the real country codes just randomly assigned to any business-ID for example.

There are also other string variables, which contain a mixture of numbers and signs like *21351 or *D009128571 ...

One possibility also would be to shift the observations down 20 or 30 or 80 lines, which i did with numerical variables with the help of lag-operators as follows:

Code:

*Randomly Shuffle observations gen helpvar= runiform() if ID != ID[_n-1] bysort ID: egen helpvar2 = max(helpvar) sort hhlfs_randomnr2 * Duplicate the first 75 observations to the end of the dataset gen count = _n expand 2 if count <= 75, gen(exptag) * tsset the dataset gen nrr = _n tsset nrr * Use lag-Operator to shift down observations of a certain varlist by 75 lines in this case foreach x of varlist var1 var2 var3 { gen hh_`x' = L75.`x' replace `x' = hh_`x' } * Drop the first 75 observations, which now do not contain any Information within variable 1 var2 and var3 drop if exptag == 0 & count <=75 drop exptag count nrr

But this only works for numerical variables.

Help would be much appreciated!
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3824
#2

20 Apr 2016, 07:56

I have written a program, churn, that destroys datasets about a moth ago. Maybe it helps. Download the files and put them along your adopath (probably into c:/ado/plus/c). Type in Stata

Code:

discard help churn

Best
Daniel
Attached Files

churn.ado (2.8 KB, 1 view)

churn.sthlp (2.9 KB, 1 view)

Last edited by daniel klein; 20 Apr 2016, 08:00.
Comment

Carole J. Wilson

Join Date: Jan 2015
Posts: 932

20 Apr 2016, 08:10

I haven't looked at Daniel's program, but if that doesn't solve your problem, this might.

Code:

encode old_code, gen(c_id)
gen u=runiform() 
bysort c_id: egen rmax=max(u)
sort rmax
egen new_id=group(rmax)
label val new_id c_id
decode new_id, gen(new_code)
drop c_id u rmax new_id

Example data:

Code:

clear
input str1 old_code
"A"
"A"
"A"
"B"
"B"
"B"
"C"
"C"
"C"
"D"
"D"
"D"
"E"
"E"
"E"
"F"
"F"
"F"
"G"
"G"
"G"
"H"
"H"
"H"
"I"
"I"
"I"
"J"
"J"
"J"
end

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1

Comment

Philipp Schrauth

Join Date: Feb 2016

Posts: 31
#4

20 Apr 2016, 08:15

Thanks to both of you.
Daniel's Programm worked fine and was exactly what I was looking for! perfect. That saves a lot of time and lines of code.
Thanks a lot again.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#5

20 Apr 2016, 08:29

Reading Carole's post, I am not sure whether churn actually works here. The code given preserves the "nested" structure, in the sense that each letter is substituted for one other. The first three observations that had letter A will have the same letter (e.g. B) afterwards. With churn the first three letters might be completely different.

What exactly you want depends on what exactly is meant by

resembl[ing] the structure of my original data-file

Best
Daniel
Comment
Philipp Schrauth

Join Date: Feb 2016

Posts: 31
#6

20 Apr 2016, 08:41

Hey. I was looking for a code (or now Programm) which completely shuffles these observations. So churn works better in this case. Though, Carole's option might by a solution for different things I have to do in a different context. So, thanks to both again!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#7

20 Apr 2016, 08:42

Here's a short solution that permutes the country names, i.e., samples them without replacement:

Code:

putmata new = country mata: _jumble(new) getmata new
Comment

Announcement

Randomly assign values of existing string variable to new variable

Comment

Comment

Comment

Comment

Comment

Comment