SMOTE in Stata

Johan Karlsson

Join Date: Jan 2020

Posts: 25
#1

SMOTE in Stata

04 Feb 2020, 02:27

Dear Statalist,

I am currently training a random forest algorithm with a heavily imbalanced dataset. I have two groups, where the group of interest makes up about 10 percent of the sample.
In order to deal with this, I wish to use a Synthetic Minority Over-sampling Technique (SMOTE). I have found applications of this in Weka and R. However, I would rather work solely out of Stata.

Therefore, my question is now; Is there any way to import a SMOTE routine to STATA?

Sincerely,
Johan Karlsson

Last edited by Johan Karlsson; 04 Feb 2020, 02:33.
Tags: imbalanced data, machine learning, Oversampling, SMOTE
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

04 Feb 2020, 03:20

Johan:
unfortunately, -search smote- does not give back any entry.

Kind regards,
Carlo
(Stata 19.0)
Comment

Johan Karlsson

Join Date: Jan 2020
Posts: 25

21 Feb 2020, 07:32

An update for those who venture down a similar path:

I have since solved this issue by writing a SMOTE algorithm for STATA.

I have attached a rudimentary example on how to apply SMOTE in STATA below using STATAs "Auto" dataset:

clear all
set more off

sysuse auto
gen id=_n

rename price xvar1
rename mpg xvar2
rename rep78 xvar3

keep id group xvar*

/* Placeholder for scarce outcome - when applied to real data, just replace
the "minority group" entries with the outcome that you seek to balance. */

gen minority_group=0
replace minority_group=1 if _n<=5 /* To be removed when applied to real data */

gen byte i = 1

tempfile orig
save "orig.dta", replace

// Define how many times you need to draw synthetic samples (upper loop limit).
// Typically you need less than 20. I have specified 50 here for convenience.

forvalues i=1/50 {
use orig.dta, clear
keep if minority_group==1

*----- all pairwise -----

rename (id xvar*) =0

joinby i using "orig.dta"
drop i

*----- compute Euclidean distance -----

gen eucld = ((xvar10 - xvar1)^2 + (xvar20 - xvar2)^2+(xvar30-xvar3)) ^ (1/2)

sort id eucld

*-------- select k nearest neighbors (current=10)---------
bysort id0 (eucld): gen nearest = _n <= 10
keep if nearest!=0

*-------- select random nearest neighbor ---------
gen randomizer=runiform(0,1)
bysort id: egen max_randomizer=max(randomizer)
keep if randomizer==max_randomizer

*-------- create synthetic values --------------
* Synthetic value = var_id +(var_id-var_id0)*gap
* Where gap = runiform(0,1)
*------------------------------------------------

forvalues j=1/3 {
gen synth_xvar`j'=xvar`j'+(xvar`j'0-xvar`j')*randomizer
}
gen synth_id=1
foreach var of varlist xvar1-xvar3 {
replace `var'=synth_`var'
drop synth_`var'
}
keep xvar1 xvar2 xvar3 synth_id
gen minority_group=1

save synthetic_`i'.dta, replace
}

use "orig.dta", clear
forvalues i=1/50 {
append using synthetic_`i'.dta
qui sum minority_group
qui gen reporter=r(mean)
if reporter>=0.49 & reporter<=0.51 {
disp in red "Append " `i' " files for balanced SMOTE"
global SMOTES=`i'
}
drop reporter
}

use "orig.dta", clear
forvalues i=1/$SMOTES {
qui append using synthetic_`i'.dta

if `i'==$SMOTES {
qui sum minority_group
qui gen reporter=int(r(mean)*100)

qui sum minority_group if synth_id==.
qui gen pre_reporter=int(r(mean)*100)

disp in red "Minority group was " pre_reporter "% of sample"
disp in red "Minority group is now " reporter "% of sample"
qui drop reporter pre_reporter
}
}

Last edited by Johan Karlsson; 21 Feb 2020, 07:34.

Comment

Francesca Ghinami

Join Date: Jan 2022

Posts: 1
#4

20 Jan 2022, 06:06

Dear Johan,
I tried to apply your algorithm, but (apart from an error in line 11, since no 'group' variable exists, which is readily solvable),
I get an 'invalix syntax' error for the last group of command lines. Are you sure we can use the $SMOTES macro as a numerical index?
Or do you have any idea on other possible reasons for this error?

Best regards,
Francesca
Comment
Johan Karlsson

Join Date: Jan 2020

Posts: 25
#5

03 Feb 2022, 05:58

Originally posted by Francesca Ghinami View Post

Dear Johan,
I tried to apply your algorithm, but (apart from an error in line 11, since no 'group' variable exists, which is readily solvable),
I get an 'invalix syntax' error for the last group of command lines. Are you sure we can use the $SMOTES macro as a numerical index?
Or do you have any idea on other possible reasons for this error?

Best regards,
Francesca

Dear Francesca,

Apologies for the late response. Sometimes STATA gets cranky about using a global macro to run a loop. If you still want to use the reporting function
you can essentially just replace the last few lines with:

use "orig.dta", clear
forvalues i=1/[Number of SMOTES that you have run]*/ {
qui append using synthetic_`i'.dta
count if minority_group==1 & synth_id==.
scalar reporter=r(mean)
local reporter=reporter

disp "Minority group is now: " `reporter'
}

------ End of code

I usually like to add some if-clause that stops the process once the dataset is balanced like:

if reporter>=0.499 & reporter<=0.501 {
break
}

Hope this helps.
Johan
Comment

Announcement

Comment

Comment

Comment

Comment