Adding true zeroes into a dataset

Kate Pryce

Join Date: Jul 2020
Posts: 107

Adding true zeroes into a dataset

23 Feb 2024, 13:18

Hi all,

I'm running a pretty basic xtlogit model on the data below—looking at the relationship between being "on" (>0) or "off" (=0) in period 1 and being "on" or "off" in period 2. I've run into the issue where if the class (class1) is "off" for both periods for a company (ID) then that class is not included at all for that company. Classes appear only for IDs where there is a positive count in either period 1 or period 2. This results in being off in period 1 perfectly predicting success in period 2 since there are no cases where both periods are 0.

Is there a way to add in all the different class1 types for each ID and register them as having zero in both periods?

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long ID int class1 double(countP1 countP2)
1013  16  0   2
1013  29  2   9
1013  40  0   1
1013  52  0   2
1013  57  0   1
1013  60  0   1
1013 118  0   1
1013 165  0   3
1013 174  0  39
1013 200  1  11
1013 211  0  14
1013 229  0   1
1013 235  1   1
1013 242  0   2
1013 248  2  14
1013 257  0   1
1013 264  0   1
1013 307  0   1
1013 312  0   7
1013 320  0   1
1013 323  0   1
1013 324  1   2
1013 326  0   1
1013 327  0   3
1013 330  1   3
1013 331  2   1
1013 333  4  13
1013 334  0   1
1013 337  0   1
1013 340  3   7
1013 341  1   4
1013 343  0   1
1013 348  0   1
1013 356  0   5
1013 359  0   9
1013 361  7  46
1013 362  0   1
1013 363  0   2
1013 370  4  64
1013 372  0  13
1013 375  2  13
1013 379  6  26
1013 380  0   1
1013 385 38 273
1013 398  0   5
1013 403  1   6
1013 427  1   1
1013 430  0   1
1013 438  0   4
1013 439 20 159
1013 451  1   5
1013 454  0   1
1013 455  5  42
1013 708  0   1
1013 709  0   5
1013 710  0   3
1013 711  0   1
1013 713  0   1
1013 714  3   5
1013 715  0   1
1013 717  1   1
1013 725  0  12
1055 136  1   0
1055 174  3   0
1055 198  1   0
1055 200  2   0
1055 206  1   0
1055 248  1   0
1055 320  4   0
1055 324  1   0
1055 326  3   0
1055 327  2   0
1055 333  4   0
1055 338  1   0
1055 341  4   0
1055 345  5   1
1055 361 20   0
1055 363  1   0
1055 369  1   0
1055 370  1   0
1055 377  1   0
1055 379  5   1
1055 382  3   1
1055 398  1   0
1055 439  2   0
1055 702  1   0
1055 707  1   0
1055 708  0   2
1055 709  1   0
1055 710 24   1
1055 711 14   0
1055 712  2   0
1055 713  6   0
1055 714  4   0
1055 715 12   0
1055 717  3   0
1055 719  2   0
1055 725  1   0
1055 726  2   0
1072  29  3  10
end

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

23 Feb 2024, 13:26

You can only accomplish that if there is some way to identify all the ID-class1 pairs that got omitted. For example, if we knew that the values of class1 for a given ID are always consecutive numbers with no gaps, we could infer, for example that ID 1013 is missing a whole lot of classes: everything from 17 through 28 and 30 through 39 and.... It would be possible to write code to fill those gaps and impute countP1 and countP2 to zero.

Alternatively, since I suspect it isn't really true that the class1 numbering system has no gaps, if whoever curated this data set for you has access to the original complete data and can provide you a data set with a list of ID class1 values for every instance, that data set could be appended to the current data set and, again the countP* values set to zero for those.

But if you can't somehow discover what is the list of ID-class1 pairs that got omitted (or at least how many got omitted from each ID), then you are out of luck.
Comment
Kate Pryce

Join Date: Jul 2020

Posts: 107
#3

23 Feb 2024, 13:33

Ah, alas. I don't have a list of all the ID-class1 pairs that were omitted. I can get a full list of class1s—I was hoping there would be some way that I could merge those into the data. Giving each ID all 603 of the class1s except where a class1 was already present.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29796

23 Feb 2024, 13:49

Yes! If you can get a complete list of class1s, then you are good to go.

Code:

use existing_dataset, clear
merge 1:1 ID class1 using full_list_of_class1s, assert(match using)
replace countP1 = 0 if _merge == 2 // THESE WERE OMITTED FROM EXISTING DATA SET
replace countP2 = 0 if _merge == 2

Comment

Kate Pryce

Join Date: Jul 2020

Posts: 107
#5

23 Feb 2024, 14:12

Thank you!!
Comment
Kate Pryce

Join Date: Jul 2020

Posts: 107
#6

23 Feb 2024, 14:21

Wait, actually. It's not quite working. When I try to run the above, I get the following error:

"variable ID not found"

I think it's happening because the full_list_of_class1s file only contains a list of all the possible classes and no IDs to tie them down.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#7

23 Feb 2024, 18:07

Well, this is a problem. Some values of class1 appear twice in your example data, associated with different IDs. Some others appear only once If your other data set has only a list of classes, you have no way of knowing how many times each of the classes that isn't in your original data set should be added into the original.
Comment
Kate Pryce

Join Date: Jul 2020

Posts: 107
#8

07 Mar 2024, 10:40

I've been thinking about this and what I really want to do is to make sure that each ID has one of each of the classes. If I have one dta file that has a list of all the IDs and another that has a list of all the class1s, is there a way to bring those together so that each ID has each of the 600 class1s? Perhaps using joinby? Ex. If I had IDs 1 - 10 and class1s 1 - 10 I would end up with:

ID class1

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

1 10

2 1

2 2

2 3

2 4

2 5

2 6

2 7

2 8

2 9

Etc.

And then, from there, I could merge that with my existing dataset and only merge in the ID-class1 pairs that are not already present—so that each ID will end up with a full list of the total class1 possibilities.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float ID 1 2 3 4 5 6 7 8 9 10 end

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float class1 1 2 3 4 5 6 7 8 9 10 end
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#9

07 Mar 2024, 11:00

You are pointing in the right direction with -joinby-, but it is not directly applicable here because the two data sets have no shared variables. What you want is:

Code:

use data_set_with_ids, clear cross using data_set_with_classes
Comment
Kate Pryce

Join Date: Jul 2020

Posts: 107
#10

07 Mar 2024, 11:04

That worked perfectly! Thank you
Comment

Announcement

Adding true zeroes into a dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment