Making a group of observations all share the same value for a variable?

Mariama Kam

Join Date: Aug 2024
Posts: 11

#16

03 Sep 2024, 06:25

Good morning,
Thank you for provided me with the ''amend'' code which allowed me to merge the datasets LB2019 and LB2016. However, given the number of variables is important (1605 in the master file and 848 in the using dataset), I wonder whether there is an alternative way to merge the two datasets without renaming some variables and reorganizing their contents to match those in the master dataset. Here are the restricted versions of the datasets I am merging. I have renamed some variables in LB2016r to match the corresponding variables in the master dataset LB2019r. However, the observations in the matching variables (by name) are not exactly similar. (I am also planning to undertaking a second merge using the dataset merged_LB2019r.LB2016r and LB2015r).

LB2016r

input str4 M5A str3 M5B str12 M14 str3 M15 str12 HHH1 str3 SOUSPREFIDD str6 HHH4 byte SEXE_INDIV int E20A str3 Milieu_new str12(Niveau_instr BRANCH21_E1 TYPE_MENAGE)

M5A

M5B

M14

M15

HHH1

SOUSPREFIDD

HHH4

SEXE_INDIV

E20A

Milieu_new

Niveau_instr

BRANCH21_E1

TYPE_MENAGE

ABO

Publique

Non

DISTRICT AUT

ABO

Urbain

7500

ADJ

Primaire

Vente

Famille éla

ABO

Publique

Oui

ADJ

Primaire

Couple avec

ADIE

Confessionne

Non

DISTRICT AUT

ABO

Urbain

ADJ

Secondaire g

Couple sans

ABO

Privée laiq

Oui

DISTRICT AUT

ABO

Urbain

ADJ

Secondaire g

Ménage mono

DISTRICT AUT

ABO

Urbain

2000

ADJ

Aucun niveau

Vente

Famille éla

ABO

Publique

Oui

DISTRICT AUT

ABO

Urbain

5000

ADJ

Secondaire g

Autres serv

Famille éla

ABO

Publique

Oui

DISTRICT AUT

ABO

Urbain

ADJ

Secondaire g

Famille éla

ABO

Publique

Oui

DISTRICT AUT

ABO

Urbain

ADJ

Secondaire g

Famille éla

MLI

Publique

Non

DISTRICT AUT

ABO

Urbain

ADJ

Secondaire t

Couple avec

ANYA

Privée laiq

Oui

ADJ

Préscolaire

Couple avec

ANYA

DISTRICT AUT

ABO

Urbain

5000

ADJ

Aucun niveau

Vente

Couple avec

OURA

DISTRICT AUT

ABO

Urbain

1000

ADJ

Aucun niveau

Vente

Couple avec

ABO

Confessionne

Oui

DISTRICT AUT

ABO

Urbain

ADJ

Primaire

Ménage mono

ABO

Confessionne

Oui

ADJ

Préscolaire

Ménage mono

BKOU

Privée laiq

Oui

DISTRICT AUT

ABO

Urbain

ADJ

Primaire

Ménage mono

ABO

Confessionne

Oui

DISTRICT AUT

ABO

Urbain

ADJ

Primaire

Ménage mono

ABO

Publique

Oui

DISTRICT AUT

ABO

Urbain

ADJ

Secondaire g

Ménage mono

KOUM

DISTRICT AUT

ABO

Urbain

6000

ADJ

Aucun niveau

Vente

Couple avec

MLI

DISTRICT AUT

ABO

Urbain

1200

ADJ

Aucun niveau

Vente

Couple avec

and LB2019r (master file)

input str3 HH2 str7 M5A str8 M14 str3 M15 str12 HHH1 str3 SOUSPREFIDD str6 HHH4 byte SEXE_INDIV str3 Milieu_new str12(Niveau_instr BRANCH21_E1) long REVENU_P str12 TYPE_MENAGE

HH2

M5A

M14

M15

HHH1

SOUSPREFIDD

HHH4

SEXE_INDIV

Milieu_new

Niveau_instr

BRANCH21_E1

REVENU_P

TYPE_MENAGE

ADJ

ABO

Publique

Oui

DISTRICT AUT

ABO

Urbain

ADJ

Secondaire g

Ménage mono

ADJ

KAT

DISTRICT AUT

ABO

Urbain

ADJ

Aucun niveau

Vente

150000

Ménage mono

ADJ

TENG

Publique

Oui

DISTRICT AUT

ABO

Urbain

ADJ

Secondaire g

Ménage mono

ADJ

ANYA

DISTRICT AUT

ABO

Urbain

ADJ

Aucun niveau

Ménage mono

ADJ

ABO

ADJ

Ménage mono

ADJ

ADJAM

Publique

Non

DISTRICT AUT

ABO

Urbain

ADJ

Primaire

Vente

225000

Famille éla

ADJ

ABO

Autres

Oui

DISTRICT AUT

ABO

Urbain

ADJ

Primaire

Famille éla

ADJ

ABO

Publique

Oui

DISTRICT AUT

ABO

Urbain

ADJ

Secondaire g

Famille éla

ADJ

ABO

Publique

Oui

ADJ

Primaire

Famille éla

ADJ

ABO

Publique

Oui

DISTRICT AUT

ABO

Urbain

ADJ

Primaire

Famille éla

ADJ

ABO

Autres

Oui

ADJ

Primaire

Famille éla

ADJ

MAN

DISTRICT AUT

ABO

Urbain

ADJ

Aucun niveau

Vente

150000

Famille éla

ADJ

ABO

Publique

Non

DISTRICT AUT

ABO

Urbain

ADJ

Primaire

Vente

262500

Famille éla

ADJ

ABO

ADJ

Famille éla

ADJ

ADJAM

Publique

Non

DISTRICT AUT

ABO

Urbain

ADJ

Primaire

Produits des

300000

Famille éla

ADJ

ABO

Publique

Oui

ADJ

Primaire

Famille éla

ADJ

ABO

Publique

Oui

ADJ

Primaire

Ménage mono

ADJ

ANYA

DISTRICT AUT

ABO

Urbain

ADJ

Aucun niveau

Ménage mono

ADJ

ANYA

Publique

Non

DISTRICT AUT

ABO

Urbain

ADJ

Primaire

Vente

75000

Ménage mono

Many thanks

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29963
#17

03 Sep 2024, 09:39

The example data you show were not generated by -dataex-. As best I can tell, it is -list- output to which you prefixed an -input- command. Please, please be considerate of the time that others spend helping you and use -dataex-. Time wasted wrestling ill-formed example data into Stata is time not spent helping you or helping somebody else.

The bigger obstacle to helping, however, is that I cannot figure out what you actually want to do here. First, I have not been working on -merge-ing these data sets: I have been working on -append-ing them. There is a big difference. -append- puts the data sets together "vertically," whereas -merge- puts them together "horizontally." Because the data sets have names like lb2016r and lb2019r, I have assumed that they are data sets representing the same variables measured in different years, possibly observed on the same households, possibly not. In general, data sets like that are best combined by -append-ing them.

But maybe I have assumed incorrectly. I do notice in this exampe that lb20164 contains two variables, m5b and e20a that do not occur in lb2019r, and l2019r contains hh2 and revenu_p, which are not found in lb2016r. So perhaps this is in fact two sets of (mostly) different variables measured on the same households, so that -merge- would be appropriate. But in order to -merge- data sets, there has to be a variable, or set of variables, that appear in both data sets and uniquely identify the different households (or whatever unit of analysis each observation represents) in at least one of the two data sets. But there is no such set of identifying variables in these examples.

So I need you to give a clearer explanation of how these two data sets are related to each other, what, if anything, are the variables that identify the same analytic units in them, and how you wish to combine them. I think it would also help if you made up a table that illustrates what you would like the combined result to look like.
Comment
Mariama Kam

Join Date: Aug 2024

Posts: 11
#18

04 Sep 2024, 01:21

Many Thanks, Clyde, for your prompt reaction and apologies for asking you additional questions. I aim to merge the two datasets LB2016r and LB2019r. They are both Labor Force datasets collected respectively in 2016 and 2019 for the same country. However, variable names are not similar and their contents are slightly different. I have first renamed some variables in LB2016r (using dataset) and worked on the observations to match the variables and names in LB2019r (master). In combining the datasets I get the following message: ''variable BRANCH21_E1 does not uniquely identify observations in the master data''. I have spent a lot of time trying to combine both datasets. I am not even sure whether it is possible to combine them. As requested earlier given the high number of variables (1605 in the master file and 848 in the using dataset), I wonder whether there is an alternative way to merge the two datasets without renaming some variables and reorganizing their contents of the using dataset to match those in the master dataset.

Here are the restricted versions of my datasets:

LB2016r

dataex M5A M5B M14 M15 HHH1 SOUSPREFIDD HHH4 SEXE_INDIV E20A Niveau_instr BRANCH21_E1 , count(19) varlabel

input str4 M5A str3 M5B str12 M14 str3 M15 str12 HHH1 str3 SOUSPREFIDD str6 HHH4 byte SEXE_INDIV int E20A str12(Niveau_instr BRANCH21_E1) "ABO" "" "Publique" "Non" "DISTRICT AUT" "ABO" "Urbain" 2 7500 "Primaire" "Vente"
"ABO" "" "Publique" "Oui" "" "" "" . . "Primaire" ""
"ADIE" "" "Confessionne" "Non" "DISTRICT AUT" "ABO" "Urbain" 2 . "Secondaire g" ""
"ABO" "" "Privée laiq" "Oui" "DISTRICT AUT" "ABO" "Urbain" 1 . "Secondaire g" ""
"" "BF" "" "" "DISTRICT AUT" "ABO" "Urbain" 2 2000 "Aucun niveau" "Vente"
"ABO" "" "Publique" "Oui" "DISTRICT AUT" "ABO" "Urbain" 2 5000 "Secondaire g" "Autres serv"
"ABO" "" "Publique" "Oui" "DISTRICT AUT" "ABO" "Urbain" 2 . "Secondaire g" ""
"ABO" "" "Publique" "Oui" "DISTRICT AUT" "ABO" "Urbain" 2 . "Secondaire g" ""
"" "MLI" "Publique" "Non" "DISTRICT AUT" "ABO" "Urbain" 2 . "Secondaire t" ""
"ANYA" "" "Privée laiq" "Oui" "" "" "" . . "Préscolaire" ""
"ANYA" "" "" "" "DISTRICT AUT" "ABO" "Urbain" 2 5000 "Aucun niveau" "Vente"
"OURA" "" "" "" "DISTRICT AUT" "ABO" "Urbain" 2 1000 "Aucun niveau" "Vente"
"ABO" "" "Confessionne" "Oui" "DISTRICT AUT" "ABO" "Urbain" 2 . "Primaire" ""
"ABO" "" "Confessionne" "Oui" "" "" "" . . "Préscolaire" ""
"BKOU" "" "Privée laiq" "Oui" "DISTRICT AUT" "ABO" "Urbain" 2 . "Primaire" ""
"ABO" "" "Confessionne" "Oui" "DISTRICT AUT" "ABO" "Urbain" 2 . "Primaire" ""
"ABO" "" "Publique" "Oui" "DISTRICT AUT" "ABO" "Urbain" 1 . "Secondaire g" ""
"KOUM" "" "" "" "DISTRICT AUT" "ABO" "Urbain" 2 6000 "Aucun niveau" "Vente"
"" "MLI" "" "" "DISTRICT AUT" "ABO" "Urbain" 2 1200 "Aucun niveau" "Vente

LB2019r

[CODE]
* Example generated by -dataex-. For more info, type help dataex
clear
input str3 HH2 str7 M5A str12 M6 str8 M14 str3 M15 str12 HHH1 str3 SOUSPREFIDD byte SEXE_INDIV str12(Niveau_instr BRANCH21_E1 TYPE_MENAGE)
"ADJ" "ABO" "Ivoirienne" "Publique" "Oui" "DISTRICT AUT" "ABO" 1 "Secondaire g" "" "Ménage mono"
"ADJ" "KAT" "Ivoirienne" "" "" "DISTRICT AUT" "ABO" 1 "Aucun niveau" "Vente" "Ménage mono"
"ADJ" "TENG" "Ivoirienne" "Publique" "Oui" "DISTRICT AUT" "ABO" 2 "Secondaire g" "" "Ménage mono"
"ADJ" "ANYA" "Ivoirienne" "" "" "DISTRICT AUT" "ABO" 2 "Aucun niveau" "" "Ménage mono"
"ADJ" "ABO" "Ivoirienne" "" "" "" "" . "" "" "Ménage mono"
"ADJ" "ADJAM" "Ivoirienne" "Publique" "Non" "DISTRICT AUT" "ABO" 2 "Primaire" "Vente" "Famille éla"
"ADJ" "ABO" "Ivoirienne" "Autres" "Oui" "DISTRICT AUT" "ABO" 2 "Primaire" "" "Famille éla"
"ADJ" "ABO" "Ivoirienne" "Publique" "Oui" "DISTRICT AUT" "ABO" 1 "Secondaire g" "" "Famille éla"
"ADJ" "ABO" "Ivoirienne" "Publique" "Oui" "" "" . "Primaire" "" "Famille éla"
"ADJ" "ABO" "Ivoirienne" "Publique" "Oui" "DISTRICT AUT" "ABO" 2 "Primaire" "" "Famille éla"
"ADJ" "ABO" "Ivoirienne" "Autres" "Oui" "" "" . "Primaire" "" "Famille éla"
"ADJ" "MAN" "Ivoirienne" "" "" "DISTRICT AUT" "ABO" 2 "Aucun niveau" "Vente" "Famille éla"
"ADJ" "ABO" "Ivoirienne" "Publique" "Non" "DISTRICT AUT" "ABO" 2 "Primaire" "Vente" "Famille éla"
"ADJ" "ABO" "Ivoirienne" "" "" "" "" . "" "" "Famille éla"
"ADJ" "ADJAM" "Ivoirienne" "Publique" "Non" "DISTRICT AUT" "ABO" 1 "Primaire" "Produits des" "Famille éla"
"ADJ" "ABO" "Ivoirienne" "Publique" "Oui" "" "" . "Primaire" "" "Famille éla"
"ADJ" "ABO" "Non ivoirien" "Publique" "Oui" "" "" . "Primaire" "" "Ménage mono"
"ADJ" "ANYA" "Ivoirienne" "" "" "DISTRICT AUT" "ABO" 2 "Aucun niveau" "" "Ménage mono"
"ADJ" "ANYA" "Non ivoirien" "Publique" "Non" "DISTRICT AUT" "ABO" 2 "Primaire" "Vente" "Ménage mono
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29963
#19

04 Sep 2024, 12:08

If these examples are representative of the two data sets, then, no it is not possible to -merge- them. Here is my explanation why this is the case:
To -merge- two data sets, there must be a merge key, that is, a variable or set of variables, that is common to both data sets and that tells us which observations in each data set should be paired with which observations in the other.

Moreover, to do this using the -merge- command, that merge key must uniquely identify variables in one of the two data sets. That is, in at least one of the two data sets, no combination of values of the merge key variable(s) can ever occur more than once in that data set.

The variables that are common to both data sets are: M5A, M14, M15, HHH1, SOUSPREFIDD, SEXE_INDIV, Niveau_instr, and BRANCH21_E1.

But even this entire set of variables fails to uniquely identify observations in either data set, so surely no subset of these can do that.

Now, part of the failure to uniquely identify observations arises from the fact that both data sets contain observations that are exact duplicates of each other in all variables. Such pure duplicate observations usually represent data management errors. So their presence implies that you should carefully review how these data sets were created, and how it is possible that the same observation managed to end up in the data set more than once. (You may find that the problem is already present in your initial source data, in which case you should contact the source about this.)

But, even after removing all pure duplicate observations, it is still true that M5A, M14, M15, HHH1, SOUSPREFIDD, SEXE_INDIV, Niveau_instr, and BRANCH21_E1 do not uniquely identify observations in either data set.

As I pointed out in 5., there is reason to think that these data sets contain incorrect data. But even if they don't, you still have to confront the lack of an appropriate merge key. You have stated that the real data sets contain many more variables. So it is possible, and I suspect it is the case, that the variable(s) needed to form a usable -merge- key are there in the full data sets, but are not among the ones shown in your examples. You will have to look into that.
Comment
Mariama Kam

Join Date: Aug 2024

Posts: 11
#20

04 Sep 2024, 12:19

Many thanks Clyde. Clear observations. All understood now. As you advised, I will drop the duplicated variables and amend the datasets. Very useful assistance. Many thanks
Comment
Mariama Kam

Join Date: Aug 2024

Posts: 11
#21

03 Nov 2024, 04:05

Hi All
I am using Principal Component Analysis (PCA) to narrow the wide range of my independent variables. In my regression, I use cluster. I have standardized all my variables ahead of the PCA as advised. However, in my regression, when I apply i.schooling (my education variable), I get the following message: schooling: factor variables may not contain noninteger values. Does this mean that i. cannot be used with standardized variables?
Comment
Mariama Kam

Join Date: Aug 2024

Posts: 11
#22

03 Nov 2024, 04:06

Hi All
I am using Principal Component Analysis (PCA) to narrow the wide range of my independent variables. In my regression, I use cluster. I have standardized all my variables ahead of the PCA as advised. However, in my regression, when I apply i.schooling (my education variable), I get the following message: schooling: factor variables may not contain noninteger values. Does this mean that i. cannot be used with standardized variables?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35449
#23

03 Nov 2024, 04:23

#21 and #22 have no relation to the title of this thread.

It's a little difficult to tell whether schooling is an original variable or some principal component mishmash of original variables.

But the answer is essentially Yes, yet if I understand correctly you're missing nothing you should want to do.

* If you have, or have calculated, predictor (you say independent) variables that are on continuous scales, then there is no call to treat them otherwise. They aren't levels of a factor.

* Conversely, if schooling is essentially counted or ordered there is no call for or advantage in standardizing it.

(By default, pca in Stata works with correlations not covariances, so in effect standardization is redundant. Using correlations does that automatically.)

Last edited by Nick Cox; 03 Nov 2024, 04:42.
Comment
Mariama Kam

Join Date: Aug 2024

Posts: 11
#24

04 Nov 2024, 05:31

Many thanks Nick my issue is now resolved. All variables in my regressions are standardized because of the principal components. I generate the continuous variable of years of schooling based on the number of years of completed education which I standardized.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment