  • SMOTE in Stata

    Dear Statalist,

    I am currently training a random forest algorithm with a heavily imbalanced dataset. I have two groups, where the group of interest makes up about 10 percent of the sample.
    In order to deal with this, I wish to use a Synthetic Minority Over-sampling Technique (SMOTE). I have found applications of this in Weka and R. However, I would rather work solely out of Stata.

    Therefore, my question is now; Is there any way to import a SMOTE routine to STATA?

    Johan Karlsson
    unfortunately, -search smote- does not give back any entry.
    Kind regards,
    (StataNow 18.5)


      An update for those who venture down a similar path:

      I have since solved this issue by writing a SMOTE algorithm for STATA.

      I have attached a rudimentary example on how to apply SMOTE in STATA below using STATAs "Auto" dataset:
      clear all
      set more off

      sysuse auto
      gen id=_n

      rename price xvar1
      rename mpg xvar2
      rename rep78 xvar3

      keep id group xvar*

      /* Placeholder for scarce outcome - when applied to real data, just replace
      the "minority group" entries with the outcome that you seek to balance. */

      gen minority_group=0
      replace minority_group=1 if _n<=5 /* To be removed when applied to real data */

      gen byte i = 1

      tempfile orig
      save "orig.dta", replace

      // Define how many times you need to draw synthetic samples (upper loop limit).
      // Typically you need less than 20. I have specified 50 here for convenience.

      forvalues i=1/50 {
      use orig.dta, clear
      keep if minority_group==1

      *----- all pairwise -----

      rename (id xvar*) =0

      joinby i using "orig.dta"
      drop i

      *----- compute Euclidean distance -----

      gen eucld = ((xvar10 - xvar1)^2 + (xvar20 - xvar2)^2+(xvar30-xvar3)) ^ (1/2)

      sort id eucld

      *-------- select k nearest neighbors (current=10)---------
      bysort id0 (eucld): gen nearest = _n <= 10
      keep if nearest!=0

      *-------- select random nearest neighbor ---------
      gen randomizer=runiform(0,1)
      bysort id: egen max_randomizer=max(randomizer)
      keep if randomizer==max_randomizer

      *-------- create synthetic values --------------
      * Synthetic value = var_id +(var_id-var_id0)*gap
      * Where gap = runiform(0,1)

      forvalues j=1/3 {
      gen synth_xvar`j'=xvar`j'+(xvar`j'0-xvar`j')*randomizer
      gen synth_id=1
      foreach var of varlist xvar1-xvar3 {
      replace `var'=synth_`var'
      drop synth_`var'
      keep xvar1 xvar2 xvar3 synth_id
      gen minority_group=1

      save synthetic_`i'.dta, replace

      use "orig.dta", clear
      forvalues i=1/50 {
      append using synthetic_`i'.dta
      qui sum minority_group
      qui gen reporter=r(mean)
      if reporter>=0.49 & reporter<=0.51 {
      disp in red "Append " `i' " files for balanced SMOTE"
      global SMOTES=`i'
      drop reporter

      use "orig.dta", clear
      forvalues i=1/$SMOTES {
      qui append using synthetic_`i'.dta

      if `i'==$SMOTES {
      qui sum minority_group
      qui gen reporter=int(r(mean)*100)

      qui sum minority_group if synth_id==.
      qui gen pre_reporter=int(r(mean)*100)

      disp in red "Minority group was " pre_reporter "% of sample"
      disp in red "Minority group is now " reporter "% of sample"
      qui drop reporter pre_reporter

        Dear Johan,
        I tried to apply your algorithm, but (apart from an error in line 11, since no 'group' variable exists, which is readily solvable),
        I get an 'invalix syntax' error for the last group of command lines. Are you sure we can use the $SMOTES macro as a numerical index?
        Or do you have any idea on other possible reasons for this error?

        Best regards,


          Dear Johan,
          I tried to apply your algorithm, but (apart from an error in line 11, since no 'group' variable exists, which is readily solvable),
          I get an 'invalix syntax' error for the last group of command lines. Are you sure we can use the $SMOTES macro as a numerical index?
          Or do you have any idea on other possible reasons for this error?

          Best regards,
          Dear Francesca,

          Apologies for the late response. Sometimes STATA gets cranky about using a global macro to run a loop. If you still want to use the reporting function
          you can essentially just replace the last few lines with:

          use "orig.dta", clear
          forvalues i=1/[Number of SMOTES that you have run]*/ {
          qui append using synthetic_`i'.dta
          count if minority_group==1 & synth_id==.
          scalar reporter=r(mean)
          local reporter=reporter

          disp "Minority group is now: " `reporter'

          ------ End of code

          I usually like to add some if-clause that stops the process once the dataset is balanced like:

          if reporter>=0.499 & reporter<=0.501 {

          Hope this helps.

