Dear Statalists,
this may be a stupid question as I am not too experienced with k-nearest neighbor algorithms and their implementation in stata. However, I could not find any satisfactory answer - be it in the help file, statalist or on the world wide web in general.
From my internet research, I understand that the k-nearest neighbor algorithm is a (simple) machine learning tool that takes in information from a learning sample to predict an outcome variable.
For example, in stata's help file on discrim knn (which as I understand, implements the k-nearest neighbor algorithm in stata), I worked through example 3. There, you have data on a number mushroom species, whether they are poisonous or edible as well as data on other characteristics (color, shape etc.). Based on the latter information, discrim knn is able to generate an out-of-sample prediction for other mushrooms - whether they are edible or not. This is all very well.
Now, what I do not understand is how I would need to go about the command if the variable I would like to predict is continuous rather than binary. I understand from the first sentences on Wikipedia that this is generally an adequate task for a k-nearest neighbor algorithm. (How) Is it possible to implement this with stata, too?
Specifically, the example gives the following code:
where i.population i.habitat i.bruises i.capshape i.capsurface i.capcolor is the set of variables the learning process takes the information from and poison is the variable that is to be predicted.
Now, if I would like to predict a continuous variable based on the same set of variables, I do not understand how I would need to specify the code. Say, I wanted to predict u (I know this is not meaningful as I just generated u randomly), what would I type?
cannot be meaningful as u is not a group variable. (And indeed, if one runs the code, the sample in 1/50 is split in 50 groups which cannot be useful.) How else is this done with a continuous variable?
Apologies in advance if I have misused terms. I am new to the topic and still learning. Please point out any errors.
Any insights are much appreciated.
Best,
Milan
this may be a stupid question as I am not too experienced with k-nearest neighbor algorithms and their implementation in stata. However, I could not find any satisfactory answer - be it in the help file, statalist or on the world wide web in general.
From my internet research, I understand that the k-nearest neighbor algorithm is a (simple) machine learning tool that takes in information from a learning sample to predict an outcome variable.
For example, in stata's help file on discrim knn (which as I understand, implements the k-nearest neighbor algorithm in stata), I worked through example 3. There, you have data on a number mushroom species, whether they are poisonous or edible as well as data on other characteristics (color, shape etc.). Based on the latter information, discrim knn is able to generate an out-of-sample prediction for other mushrooms - whether they are edible or not. This is all very well.
Now, what I do not understand is how I would need to go about the command if the variable I would like to predict is continuous rather than binary. I understand from the first sentences on Wikipedia that this is generally an adequate task for a k-nearest neighbor algorithm. (How) Is it possible to implement this with stata, too?
Specifically, the example gives the following code:
Code:
use http://www.stata-press.com/data/r13/mushroom, clear tab habitat poison set seed 12345678 gen u = runiform() sort u xi, noomit: discrim knn i.population i.habitat i.bruises i.capshape i.capsurface i.capcolor in 1/2000, k(15) group(poison) measure(dice)
Now, if I would like to predict a continuous variable based on the same set of variables, I do not understand how I would need to specify the code. Say, I wanted to predict u (I know this is not meaningful as I just generated u randomly), what would I type?
Code:
xi, noomit: discrim knn i.population i.habitat i.bruises i.capshape i.capsurface i.capcolor in 1/50, k(15) group(u) measure(dice)
Apologies in advance if I have misused terms. I am new to the topic and still learning. Please point out any errors.
Any insights are much appreciated.
Best,
Milan
Comment