Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Randomly assign values of existing string variable to new variable

    Hello,

    I Need to create a data-file, which resembles the structure of my original data-file but does not give any true Information about the original data.
    I manage to shuffle and mask variables with numerical values (using e.g. the runiform() function) but I have problems with string variables.

    There is for exmaple one variable, which tells about the country code, such that observations have the string values "DE" "AT" "US" and so on.
    Now I want these values to be randomly assigned to any other observation in the dataset.
    The ralpha() function from ssc does not help much in this case because it creates completely random values, which are no real country codes. So, I want the real country codes just randomly assigned to any business-ID for example.

    There are also other string variables, which contain a mixture of numbers and signs like *21351 or *D009128571 ...

    One possibility also would be to shift the observations down 20 or 30 or 80 lines, which i did with numerical variables with the help of lag-operators as follows:

    Code:
    *Randomly Shuffle observations
    gen helpvar= runiform() if ID != ID[_n-1]
    bysort ID: egen helpvar2 = max(helpvar)
    sort hhlfs_randomnr2
    
    * Duplicate the first 75 observations to the end of the dataset
    gen count = _n
    expand 2 if count <= 75, gen(exptag)
    
    * tsset the dataset
    gen nrr = _n
    tsset nrr
    
    * Use lag-Operator to shift down observations of a certain varlist by 75 lines in this case
    foreach x of varlist var1 var2 var3 {
     gen hh_`x' = L75.`x'
     replace `x' = hh_`x'
    }
    
    * Drop the first 75 observations, which now do not contain any Information within variable 1 var2 and var3
    drop if exptag == 0 & count <=75
    drop exptag count nrr
    But this only works for numerical variables.

    Help would be much appreciated!

  • #2
    I have written a program, churn, that destroys datasets about a moth ago. Maybe it helps. Download the files and put them along your adopath (probably into c:/ado/plus/c). Type in Stata

    Code:
    discard
    help churn
    Best
    Daniel
    Attached Files
    Last edited by daniel klein; 20 Apr 2016, 08:00.

    Comment


    • #3
      I haven't looked at Daniel's program, but if that doesn't solve your problem, this might.


      Code:
      encode old_code, gen(c_id)
      gen u=runiform() 
      bysort c_id: egen rmax=max(u)
      sort rmax
      egen new_id=group(rmax)
      label val new_id c_id
      decode new_id, gen(new_code)
      drop c_id u rmax new_id
      Example data:
      Code:
      clear
      input str1 old_code
      "A"
      "A"
      "A"
      "B"
      "B"
      "B"
      "C"
      "C"
      "C"
      "D"
      "D"
      "D"
      "E"
      "E"
      "E"
      "F"
      "F"
      "F"
      "G"
      "G"
      "G"
      "H"
      "H"
      "H"
      "I"
      "I"
      "I"
      "J"
      "J"
      "J"
      end
      Stata/MP 14.1 (64-bit x86-64)
      Revision 19 May 2016
      Win 8.1

      Comment


      • #4
        Thanks to both of you.
        Daniel's Programm worked fine and was exactly what I was looking for! perfect. That saves a lot of time and lines of code.
        Thanks a lot again.

        Comment


        • #5
          Reading Carole's post, I am not sure whether churn actually works here. The code given preserves the "nested" structure, in the sense that each letter is substituted for one other. The first three observations that had letter A will have the same letter (e.g. B) afterwards. With churn the first three letters might be completely different.

          What exactly you want depends on what exactly is meant by

          resembl[ing] the structure of my original data-file
          Best
          Daniel

          Comment


          • #6
            Hey. I was looking for a code (or now Programm) which completely shuffles these observations. So churn works better in this case. Though, Carole's option might by a solution for different things I have to do in a different context. So, thanks to both again!

            Comment


            • #7
              Here's a short solution that permutes the country names, i.e., samples them without replacement:

              Code:
              putmata new = country
              mata: _jumble(new)
              getmata new


              Comment

              Working...
              X