Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Logit prediction in other dataset

    Greetings community,

    I am training a model to identify behavioural changes (characterized by a dummy variable) in one dataset which I then want to apply in another dataset to predict which observations from the other dataset will show the same behaviour (by classifying them based on the information from the training data).

    For this, before using other prediction methods, I'm using a simple logit right now and Stata's -predict- command.
    I made sure that variable names and labels match each other and that all categories in the dataset where I want to make the predictions are also present in my training data set.
    I also verified to have no missing values dataset where I want to predict.

    However, at the moment of doing the predictions, predictions are only calculated for a bit more than half of the observations.
    Codewise, I'm not using the same dataset with different samples, but two different frames.
    Am example of the data where the predictions are to be made:
    Code:
    clear
    input float(preisBenzinBerSQ nettoEink_katMOP hhTypMOP) double(raumTyp anzPers) float(Alter bildung beruf haltstrSQ haltzugSQ quali_nvSQ haltbusmSQ jahr) double monat
    1.2682927 3 4 77 3  50.33333 3 3 0 0 2  5 2017 7
    1.2682927 5 2 74 2      65.5 4 3 0 1 3  2 2017 7
    1.2820513 8 1 77 2      50.5 5 2 0 1 3  3 2017 3
    1.2770138 8 3 73 3  40.66667 5 3 0 0 2  3 2017 4
    1.2682927 8 1 76 2      50.5 3 1 0 0 2  5 2017 7
    1.2820513 8 3 77 4        31 3 3 0 1 4  3 2017 3
    1.2770138 5 2 77 1        34 4 3 0 0 1 95 2017 4
    1.2820513 4 2 73 2        62 4 3 0 0 3  3 2017 3
    1.2682927 5 2 73 1         . 5 3 0 1 3  1 2017 7
    1.2820513 5 1 71 2        61 2 2 1 0 2  2 2017 3
    1.2820513 8 1 72 1        56 5 1 0 0 1 95 2017 3
    1.2820513 7 2 71 2      66.5 5 3 1 0 4  3 2017 3
    1.2820513 5 2 77 2        73 5 3 0 0 3  3 2017 3
    1.2820513 7 2 74 2        78 3 3 0 0 4  1 2017 3
    1.2820513 6 2 72 2        85 2 3 0 0 2  1 2017 3
    1.2820513 8 4 72 4      38.5 4 3 0 0 2  3 2017 3
    1.2820513 4 3 72 2        29 3 3 0 1 3  1 2017 3
    1.2820513 5 2 76 2      63.5 2 3 0 1 3  3 2017 3
    1.2820513 8 3 72 4      20.5 2 1 0 1 3  1 2017 3
    1.2820513 8 3 73 4        25 4 3 1 0 2  3 2017 3
    1.2820513 7 2 75 2        63 4 3 0 0 2  1 2017 3
    1.2820513 5 2 74 2      72.5 3 3 0 0 1  3 2017 3
    end
    The dataset which the model is trained on:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float preisBenzinBerSQ byte(nettoEink_katMOP hhTypMOP) float raumTyp byte anzPers float(Alter bildung beruf haltstrSQ haltzugSQ quali_nvSQ haltbusmSQ) int jahr float monat
    1.3171036 8 1 71 1 62 2 2 1 1 4  5 2000 10
    1.3924794 4 2 73 2 65 1 3 1 1 4  5 2004 10
     1.574468 2 2 75 1 66 3 3 1 1 3  5 2013 10
     1.574468 5 2 72 1 92 3 3 1 1 3  5 2013 10
    1.4930333 4 2 72 2 76 3 3 1 1 3  5 2010  9
      1.41895 4 2 73 2 67 1 3 1 1 4  5 2006  9
    1.3555777 6 3 71 3 41 5 2 1 0 3  5 2015  9
      1.41895 3 1 71 2 22 1 2 1 1 4  5 2006  9
      1.29802 6 2 71 1 66 5 3 0 0 3  5 2016  9
    1.5109574 2 1 73 1 33 2 1 0 1 4  5 2005 10
    1.1887255 8 2 71 1 50 3 3 1 1 4  1 2001 10
    1.1887255 8 2 71 1 53 2 3 1 1 4  1 2001 10
    1.1719902 8 3 71 3 42 1 1 1 1 4  1 2001 11
    1.1887255 8 2 71 2 65 3 3 1 1 4  3 2001 10
      1.38242 3 1 71 2 36 1 2 0 0 4  5 2006 10
    1.1689872 8 2 71 1 63 2 3 1 0 4  1 2001 11
    1.1887255 8 3 75 4 32 3 3 1 1 4  1 2001 10
    end
    The relevant part of the code:
    Code:
    clear all
    use "$data_where_to_predict", clear
    
    * change to other frame to make predictions
    cap frame create VMwahl
    frame change VMwahl
        
     use "$training_data", clear
      
     global listCovariates = "preisBenzinBerSQ i.nettoEink_katMOP i.hhTypMOP i.raumTyp anzPers Alter i.bildung i.beruf haltstrSQ haltzugSQ i.quali_nvSQ i.haltbusmSQ i.jahr i.monat"
    
            
    logit switch $listCovariates if jahr < 2020, cluster(id)
    
    frame change default   // change back to frame where to make the predictions
    
    
    cap drop probaSQ
        
    predict probaSQ


    Does someone have an idea what could be the problem here?

    Thanks a lot and have a nice day,

    Guest
    Last edited by sladmin; 25 May 2022, 13:08. Reason: anonymize original poster

  • #2
    global listCovariates = "preisBenzinBerSQ i.nettoEink_katMOP i.hhTypMOP i.raumTyp anzPers Alter i.bildung i.beruf haltstrSQ haltzugSQ i.quali_nvSQ i.haltbusmSQ i.jahr i.monat"

    logit switch $listCovariates if jahr < 2020, cluster(id)

    There are no variables "switch" and "id" in your example datasets, so your problem is not reproducible. You have too many variables for the number of observations that you present. Cut out unnecessary variables and present example data that replicates the issue.

    Comment

    Working...
    X