Logit prediction in other dataset

Guest

Logit prediction in other dataset

23 Mar 2022, 09:29

Greetings community,

I am training a model to identify behavioural changes (characterized by a dummy variable) in one dataset which I then want to apply in another dataset to predict which observations from the other dataset will show the same behaviour (by classifying them based on the information from the training data).

For this, before using other prediction methods, I'm using a simple logit right now and Stata's -predict- command.
I made sure that variable names and labels match each other and that all categories in the dataset where I want to make the predictions are also present in my training data set.
I also verified to have no missing values dataset where I want to predict.

However, at the moment of doing the predictions, predictions are only calculated for a bit more than half of the observations.
Codewise, I'm not using the same dataset with different samples, but two different frames.
Am example of the data where the predictions are to be made:

Code:

clear
input float(preisBenzinBerSQ nettoEink_katMOP hhTypMOP) double(raumTyp anzPers) float(Alter bildung beruf haltstrSQ haltzugSQ quali_nvSQ haltbusmSQ jahr) double monat
1.2682927 3 4 77 3  50.33333 3 3 0 0 2  5 2017 7
1.2682927 5 2 74 2      65.5 4 3 0 1 3  2 2017 7
1.2820513 8 1 77 2      50.5 5 2 0 1 3  3 2017 3
1.2770138 8 3 73 3  40.66667 5 3 0 0 2  3 2017 4
1.2682927 8 1 76 2      50.5 3 1 0 0 2  5 2017 7
1.2820513 8 3 77 4        31 3 3 0 1 4  3 2017 3
1.2770138 5 2 77 1        34 4 3 0 0 1 95 2017 4
1.2820513 4 2 73 2        62 4 3 0 0 3  3 2017 3
1.2682927 5 2 73 1         . 5 3 0 1 3  1 2017 7
1.2820513 5 1 71 2        61 2 2 1 0 2  2 2017 3
1.2820513 8 1 72 1        56 5 1 0 0 1 95 2017 3
1.2820513 7 2 71 2      66.5 5 3 1 0 4  3 2017 3
1.2820513 5 2 77 2        73 5 3 0 0 3  3 2017 3
1.2820513 7 2 74 2        78 3 3 0 0 4  1 2017 3
1.2820513 6 2 72 2        85 2 3 0 0 2  1 2017 3
1.2820513 8 4 72 4      38.5 4 3 0 0 2  3 2017 3
1.2820513 4 3 72 2        29 3 3 0 1 3  1 2017 3
1.2820513 5 2 76 2      63.5 2 3 0 1 3  3 2017 3
1.2820513 8 3 72 4      20.5 2 1 0 1 3  1 2017 3
1.2820513 8 3 73 4        25 4 3 1 0 2  3 2017 3
1.2820513 7 2 75 2        63 4 3 0 0 2  1 2017 3
1.2820513 5 2 74 2      72.5 3 3 0 0 1  3 2017 3
end

The dataset which the model is trained on:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float preisBenzinBerSQ byte(nettoEink_katMOP hhTypMOP) float raumTyp byte anzPers float(Alter bildung beruf haltstrSQ haltzugSQ quali_nvSQ haltbusmSQ) int jahr float monat
1.3171036 8 1 71 1 62 2 2 1 1 4  5 2000 10
1.3924794 4 2 73 2 65 1 3 1 1 4  5 2004 10
 1.574468 2 2 75 1 66 3 3 1 1 3  5 2013 10
 1.574468 5 2 72 1 92 3 3 1 1 3  5 2013 10
1.4930333 4 2 72 2 76 3 3 1 1 3  5 2010  9
  1.41895 4 2 73 2 67 1 3 1 1 4  5 2006  9
1.3555777 6 3 71 3 41 5 2 1 0 3  5 2015  9
  1.41895 3 1 71 2 22 1 2 1 1 4  5 2006  9
  1.29802 6 2 71 1 66 5 3 0 0 3  5 2016  9
1.5109574 2 1 73 1 33 2 1 0 1 4  5 2005 10
1.1887255 8 2 71 1 50 3 3 1 1 4  1 2001 10
1.1887255 8 2 71 1 53 2 3 1 1 4  1 2001 10
1.1719902 8 3 71 3 42 1 1 1 1 4  1 2001 11
1.1887255 8 2 71 2 65 3 3 1 1 4  3 2001 10
  1.38242 3 1 71 2 36 1 2 0 0 4  5 2006 10
1.1689872 8 2 71 1 63 2 3 1 0 4  1 2001 11
1.1887255 8 3 75 4 32 3 3 1 1 4  1 2001 10
end

The relevant part of the code:

Code:

clear all
use "$data_where_to_predict", clear

* change to other frame to make predictions
cap frame create VMwahl
frame change VMwahl
    
 use "$training_data", clear
  
 global listCovariates = "preisBenzinBerSQ i.nettoEink_katMOP i.hhTypMOP i.raumTyp anzPers Alter i.bildung i.beruf haltstrSQ haltzugSQ i.quali_nvSQ i.haltbusmSQ i.jahr i.monat"

        
logit switch $listCovariates if jahr < 2020, cluster(id)

frame change default   // change back to frame where to make the predictions


cap drop probaSQ
    
predict probaSQ

Does someone have an idea what could be the problem here?

Thanks a lot and have a nice day,

Guest

Last edited by sladmin; 25 May 2022, 13:08. Reason: anonymize original poster

Tags: None

Andrew Musau

Join Date: Oct 2014

Posts: 9957
#2

23 Mar 2022, 14:16

global listCovariates = "preisBenzinBerSQ i.nettoEink_katMOP i.hhTypMOP i.raumTyp anzPers Alter i.bildung i.beruf haltstrSQ haltzugSQ i.quali_nvSQ i.haltbusmSQ i.jahr i.monat"

logit switch $listCovariates if jahr < 2020, cluster(id)

There are no variables "switch" and "id" in your example datasets, so your problem is not reproducible. You have too many variables for the number of observations that you present. Cut out unnecessary variables and present example data that replicates the issue.
Comment

Announcement

Logit prediction in other dataset

Comment