Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • SMOTE in Stata

    Dear Statalist,

    I am currently training a random forest algorithm with a heavily imbalanced dataset. I have two groups, where the group of interest makes up about 10 percent of the sample.
    In order to deal with this, I wish to use a Synthetic Minority Over-sampling Technique (SMOTE). I have found applications of this in Weka and R. However, I would rather work solely out of Stata.

    Therefore, my question is now; Is there any way to import a SMOTE routine to STATA?

    Sincerely,
    Johan Karlsson
    Last edited by Johan Karlsson; 04 Feb 2020, 03:33.

  • #2
    Johan:
    unfortunately, -search smote- does not give back any entry.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      An update for those who venture down a similar path:

      I have since solved this issue by writing a SMOTE algorithm for STATA.

      I have attached a rudimentary example on how to apply SMOTE in STATA below using STATAs "Auto" dataset:
      clear all
      set more off

      sysuse auto
      gen id=_n

      rename price xvar1
      rename mpg xvar2
      rename rep78 xvar3

      keep id group xvar*

      /* Placeholder for scarce outcome - when applied to real data, just replace
      the "minority group" entries with the outcome that you seek to balance. */

      gen minority_group=0
      replace minority_group=1 if _n<=5 /* To be removed when applied to real data */

      gen byte i = 1

      tempfile orig
      save "orig.dta", replace

      // Define how many times you need to draw synthetic samples (upper loop limit).
      // Typically you need less than 20. I have specified 50 here for convenience.

      forvalues i=1/50 {
      use orig.dta, clear
      keep if minority_group==1

      *----- all pairwise -----

      rename (id xvar*) =0

      joinby i using "orig.dta"
      drop i

      *----- compute Euclidean distance -----

      gen eucld = ((xvar10 - xvar1)^2 + (xvar20 - xvar2)^2+(xvar30-xvar3)) ^ (1/2)

      sort id eucld

      *-------- select k nearest neighbors (current=10)---------
      bysort id0 (eucld): gen nearest = _n <= 10
      keep if nearest!=0

      *-------- select random nearest neighbor ---------
      gen randomizer=runiform(0,1)
      bysort id: egen max_randomizer=max(randomizer)
      keep if randomizer==max_randomizer

      *-------- create synthetic values --------------
      * Synthetic value = var_id +(var_id-var_id0)*gap
      * Where gap = runiform(0,1)
      *------------------------------------------------

      forvalues j=1/3 {
      gen synth_xvar`j'=xvar`j'+(xvar`j'0-xvar`j')*randomizer
      }
      gen synth_id=1
      foreach var of varlist xvar1-xvar3 {
      replace `var'=synth_`var'
      drop synth_`var'
      }
      keep xvar1 xvar2 xvar3 synth_id
      gen minority_group=1

      save synthetic_`i'.dta, replace
      }

      use "orig.dta", clear
      forvalues i=1/50 {
      append using synthetic_`i'.dta
      qui sum minority_group
      qui gen reporter=r(mean)
      if reporter>=0.49 & reporter<=0.51 {
      disp in red "Append " `i' " files for balanced SMOTE"
      global SMOTES=`i'
      }
      drop reporter
      }

      use "orig.dta", clear
      forvalues i=1/$SMOTES {
      qui append using synthetic_`i'.dta

      if `i'==$SMOTES {
      qui sum minority_group
      qui gen reporter=int(r(mean)*100)

      qui sum minority_group if synth_id==.
      qui gen pre_reporter=int(r(mean)*100)

      disp in red "Minority group was " pre_reporter "% of sample"
      disp in red "Minority group is now " reporter "% of sample"
      qui drop reporter pre_reporter
      }
      }

      Last edited by Johan Karlsson; 21 Feb 2020, 08:34.

      Comment


      • #4
        Dear Johan,
        I tried to apply your algorithm, but (apart from an error in line 11, since no 'group' variable exists, which is readily solvable),
        I get an 'invalix syntax' error for the last group of command lines. Are you sure we can use the $SMOTES macro as a numerical index?
        Or do you have any idea on other possible reasons for this error?

        Best regards,
        Francesca

        Comment


        • #5
          Originally posted by Francesca Ghinami View Post
          Dear Johan,
          I tried to apply your algorithm, but (apart from an error in line 11, since no 'group' variable exists, which is readily solvable),
          I get an 'invalix syntax' error for the last group of command lines. Are you sure we can use the $SMOTES macro as a numerical index?
          Or do you have any idea on other possible reasons for this error?

          Best regards,
          Francesca
          Dear Francesca,

          Apologies for the late response. Sometimes STATA gets cranky about using a global macro to run a loop. If you still want to use the reporting function
          you can essentially just replace the last few lines with:

          use "orig.dta", clear
          forvalues i=1/[Number of SMOTES that you have run]*/ {
          qui append using synthetic_`i'.dta
          count if minority_group==1 & synth_id==.
          scalar reporter=r(mean)
          local reporter=reporter

          disp "Minority group is now: " `reporter'
          }

          ------ End of code

          I usually like to add some if-clause that stops the process once the dataset is balanced like:

          if reporter>=0.499 & reporter<=0.501 {
          break
          }

          Hope this helps.
          Johan

          Comment

          Working...
          X