Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Predicting a continuous variable with discrimm knn

    Dear Statalists,

    this may be a stupid question as I am not too experienced with k-nearest neighbor algorithms and their implementation in stata. However, I could not find any satisfactory answer - be it in the help file, statalist or on the world wide web in general.

    From my internet research, I understand that the k-nearest neighbor algorithm is a (simple) machine learning tool that takes in information from a learning sample to predict an outcome variable.

    For example, in stata's help file on discrim knn (which as I understand, implements the k-nearest neighbor algorithm in stata), I worked through example 3. There, you have data on a number mushroom species, whether they are poisonous or edible as well as data on other characteristics (color, shape etc.). Based on the latter information, discrim knn is able to generate an out-of-sample prediction for other mushrooms - whether they are edible or not. This is all very well.

    Now, what I do not understand is how I would need to go about the command if the variable I would like to predict is continuous rather than binary. I understand from the first sentences on Wikipedia that this is generally an adequate task for a k-nearest neighbor algorithm. (How) Is it possible to implement this with stata, too?

    Specifically, the example gives the following code:

    Code:
    use http://www.stata-press.com/data/r13/mushroom, clear
    
    tab habitat poison
    
    set seed 12345678
    
    gen u = runiform()
    sort u
    xi, noomit: discrim knn i.population i.habitat i.bruises i.capshape i.capsurface i.capcolor in 1/2000, k(15) group(poison) measure(dice)
    where i.population i.habitat i.bruises i.capshape i.capsurface i.capcolor is the set of variables the learning process takes the information from and poison is the variable that is to be predicted.

    Now, if I would like to predict a continuous variable based on the same set of variables, I do not understand how I would need to specify the code. Say, I wanted to predict u (I know this is not meaningful as I just generated u randomly), what would I type?

    Code:
    xi, noomit: discrim knn i.population i.habitat i.bruises i.capshape i.capsurface i.capcolor in 1/50, k(15) group(u) measure(dice)
    cannot be meaningful as u is not a group variable. (And indeed, if one runs the code, the sample in 1/50 is split in 50 groups which cannot be useful.) How else is this done with a continuous variable?

    Apologies in advance if I have misused terms. I am new to the topic and still learning. Please point out any errors.
    Any insights are much appreciated.

    Best,
    Milan

  • #2
    You didn't get a quick answer. You'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, Stata output, and sample data using dataex. Also try to cut down what you give us to what is essential to generate your problem.

    xi is not longer needed in Stata 14. You can just drop it.

    The short answer is that you can't use a continuous variable in group - it takes every different value as a separate group.

    After most Stata procedures in the documentation is a post estimation section. If you want predictions, then you probably need to use predict. But you can't just predict anything - the predictions deal with the classifications examined by discrimin.

    Comment


    • #3
      Thank you for your answer. I don't quite understand your advice on the FAQ. I provided an example that everyone can directly read into his or her stata using the above code.

      The problem I have is straightforward (although the solution seems more difficult): Can you use discrim knn for continuous variables like it is used for binary variables in the example? If so how?

      Comment


      • #4
        -discrim- is for categorical outcomes; the "purpose" of discrimination analysis is classification; you might be able to force its use for a continuous variable but you should not do so

        Comment


        • #5
          Okay, I see. Thank you very much for pointing this out. If I understand correctly, then discrim knn is used to implement the k nearest neighbor algorithm for classification. Do you know any other command that implements the k nearest neighbor for regression purposes?

          I apologize for all these questions. I am just not very experienced with the implementation in stata. Are there any such tools at all?

          Comment


          • #6
            What's the goal? It's not clear what you want to predict. As pointed out, K-nearest neighbors will classify, non-parametrically, each observation into a discrete group. You specify the number of groups a priori. Selecting the number of groups is judgment-based. I read through the example, and I am pretty sure that KNN discriminant analysis can take continuous and categorical independent variables.

            It's also not clear what you mean by "outcome" variable. I would not say that KNN produces a binary outcome, unless you told it to classify the dataset into 2 groups. I would call it more of a categorical classification.

            When you then say you are looking for a continuous version of this, the first thing I think of is actually factor analysis. There, you have continuous observed variables, and you believe that they load on one or more continuous latent factors (i.e. you can't see these variables directly, but you can predict the level of those factors using SEM or exploratory factor analysis, or you can use something more like an item response theory model if your observed indicators are binary or ordinal). Does this sound more like what you want to do?

            Side note: the advice that you don't need the xi: prefix in Stata 14 is usually correct. However, I am pretty sure the example file is saying you must use xi: in this particular case, as it will create categorical variables with the reference group omitted.

            Also, FWIW, when you say "machine learning", I tend to think more data science methods. From what I can tell, on Statalist, we tend to be much more in the realm of traditional statistical and econometric methods. If what you are looking for is more in the realm of data science, you might be better off posting elsewhere, maybe on an R forum.
            Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

            When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

            Comment


            • #7
              Thank you for your answer. I will look into SEM and explanatory factor analysis to see if they do the job. Maybe discrim knn is just not the right tool, after all.

              Comment

              Working...
              X