Greetings community,
I am training a model to identify behavioural changes (characterized by a dummy variable) in one dataset which I then want to apply in another dataset to predict which observations from the other dataset will show the same behaviour (by classifying them based on the information from the training data).
For this, before using other prediction methods, I'm using a simple logit right now and Stata's -predict- command.
I made sure that variable names and labels match each other and that all categories in the dataset where I want to make the predictions are also present in my training data set.
I also verified to have no missing values dataset where I want to predict.
However, at the moment of doing the predictions, predictions are only calculated for a bit more than half of the observations.
Codewise, I'm not using the same dataset with different samples, but two different frames.
Am example of the data where the predictions are to be made:
The dataset which the model is trained on:
The relevant part of the code:
Does someone have an idea what could be the problem here?
Thanks a lot and have a nice day,
Guest
I am training a model to identify behavioural changes (characterized by a dummy variable) in one dataset which I then want to apply in another dataset to predict which observations from the other dataset will show the same behaviour (by classifying them based on the information from the training data).
For this, before using other prediction methods, I'm using a simple logit right now and Stata's -predict- command.
I made sure that variable names and labels match each other and that all categories in the dataset where I want to make the predictions are also present in my training data set.
I also verified to have no missing values dataset where I want to predict.
However, at the moment of doing the predictions, predictions are only calculated for a bit more than half of the observations.
Codewise, I'm not using the same dataset with different samples, but two different frames.
Am example of the data where the predictions are to be made:
Code:
clear input float(preisBenzinBerSQ nettoEink_katMOP hhTypMOP) double(raumTyp anzPers) float(Alter bildung beruf haltstrSQ haltzugSQ quali_nvSQ haltbusmSQ jahr) double monat 1.2682927 3 4 77 3 50.33333 3 3 0 0 2 5 2017 7 1.2682927 5 2 74 2 65.5 4 3 0 1 3 2 2017 7 1.2820513 8 1 77 2 50.5 5 2 0 1 3 3 2017 3 1.2770138 8 3 73 3 40.66667 5 3 0 0 2 3 2017 4 1.2682927 8 1 76 2 50.5 3 1 0 0 2 5 2017 7 1.2820513 8 3 77 4 31 3 3 0 1 4 3 2017 3 1.2770138 5 2 77 1 34 4 3 0 0 1 95 2017 4 1.2820513 4 2 73 2 62 4 3 0 0 3 3 2017 3 1.2682927 5 2 73 1 . 5 3 0 1 3 1 2017 7 1.2820513 5 1 71 2 61 2 2 1 0 2 2 2017 3 1.2820513 8 1 72 1 56 5 1 0 0 1 95 2017 3 1.2820513 7 2 71 2 66.5 5 3 1 0 4 3 2017 3 1.2820513 5 2 77 2 73 5 3 0 0 3 3 2017 3 1.2820513 7 2 74 2 78 3 3 0 0 4 1 2017 3 1.2820513 6 2 72 2 85 2 3 0 0 2 1 2017 3 1.2820513 8 4 72 4 38.5 4 3 0 0 2 3 2017 3 1.2820513 4 3 72 2 29 3 3 0 1 3 1 2017 3 1.2820513 5 2 76 2 63.5 2 3 0 1 3 3 2017 3 1.2820513 8 3 72 4 20.5 2 1 0 1 3 1 2017 3 1.2820513 8 3 73 4 25 4 3 1 0 2 3 2017 3 1.2820513 7 2 75 2 63 4 3 0 0 2 1 2017 3 1.2820513 5 2 74 2 72.5 3 3 0 0 1 3 2017 3 end
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float preisBenzinBerSQ byte(nettoEink_katMOP hhTypMOP) float raumTyp byte anzPers float(Alter bildung beruf haltstrSQ haltzugSQ quali_nvSQ haltbusmSQ) int jahr float monat 1.3171036 8 1 71 1 62 2 2 1 1 4 5 2000 10 1.3924794 4 2 73 2 65 1 3 1 1 4 5 2004 10 1.574468 2 2 75 1 66 3 3 1 1 3 5 2013 10 1.574468 5 2 72 1 92 3 3 1 1 3 5 2013 10 1.4930333 4 2 72 2 76 3 3 1 1 3 5 2010 9 1.41895 4 2 73 2 67 1 3 1 1 4 5 2006 9 1.3555777 6 3 71 3 41 5 2 1 0 3 5 2015 9 1.41895 3 1 71 2 22 1 2 1 1 4 5 2006 9 1.29802 6 2 71 1 66 5 3 0 0 3 5 2016 9 1.5109574 2 1 73 1 33 2 1 0 1 4 5 2005 10 1.1887255 8 2 71 1 50 3 3 1 1 4 1 2001 10 1.1887255 8 2 71 1 53 2 3 1 1 4 1 2001 10 1.1719902 8 3 71 3 42 1 1 1 1 4 1 2001 11 1.1887255 8 2 71 2 65 3 3 1 1 4 3 2001 10 1.38242 3 1 71 2 36 1 2 0 0 4 5 2006 10 1.1689872 8 2 71 1 63 2 3 1 0 4 1 2001 11 1.1887255 8 3 75 4 32 3 3 1 1 4 1 2001 10 end
Code:
clear all use "$data_where_to_predict", clear * change to other frame to make predictions cap frame create VMwahl frame change VMwahl use "$training_data", clear global listCovariates = "preisBenzinBerSQ i.nettoEink_katMOP i.hhTypMOP i.raumTyp anzPers Alter i.bildung i.beruf haltstrSQ haltzugSQ i.quali_nvSQ i.haltbusmSQ i.jahr i.monat" logit switch $listCovariates if jahr < 2020, cluster(id) frame change default // change back to frame where to make the predictions cap drop probaSQ predict probaSQ
Does someone have an idea what could be the problem here?
Thanks a lot and have a nice day,
Guest
Comment