Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Good morning,
    Thank you for provided me with the ''amend'' code which allowed me to merge the datasets LB2019 and LB2016. However, given the number of variables is important (1605 in the master file and 848 in the using dataset), I wonder whether there is an alternative way to merge the two datasets without renaming some variables and reorganizing their contents to match those in the master dataset. Here are the restricted versions of the datasets I am merging. I have renamed some variables in LB2016r to match the corresponding variables in the master dataset LB2019r. However, the observations in the matching variables (by name) are not exactly similar. (I am also planning to undertaking a second merge using the dataset merged_LB2019r.LB2016r and LB2015r).

    LB2016r

    input str4 M5A str3 M5B str12 M14 str3 M15 str12 HHH1 str3 SOUSPREFIDD str6 HHH4 byte SEXE_INDIV int E20A str3 Milieu_new str12(Niveau_instr BRANCH21_E1 TYPE_MENAGE)
    M5A M5B M14 M15 HHH1 SOUSPREFIDD HHH4 SEXE_INDIV E20A Milieu_new Niveau_instr BRANCH21_E1 TYPE_MENAGE
    ABO Publique Non DISTRICT AUT ABO Urbain 2 7500 ADJ Primaire Vente Famille éla
    ABO Publique Oui ADJ Primaire Couple avec
    ADIE Confessionne Non DISTRICT AUT ABO Urbain 2 ADJ Secondaire g Couple sans
    ABO Privée laiq Oui DISTRICT AUT ABO Urbain 1 ADJ Secondaire g Ménage mono
    BF DISTRICT AUT ABO Urbain 2 2000 ADJ Aucun niveau Vente Famille éla
    ABO Publique Oui DISTRICT AUT ABO Urbain 2 5000 ADJ Secondaire g Autres serv Famille éla
    ABO Publique Oui DISTRICT AUT ABO Urbain 2 ADJ Secondaire g Famille éla
    ABO Publique Oui DISTRICT AUT ABO Urbain 2 ADJ Secondaire g Famille éla
    MLI Publique Non DISTRICT AUT ABO Urbain 2 ADJ Secondaire t Couple avec
    ANYA Privée laiq Oui ADJ Préscolaire Couple avec
    ANYA DISTRICT AUT ABO Urbain 2 5000 ADJ Aucun niveau Vente Couple avec
    OURA DISTRICT AUT ABO Urbain 2 1000 ADJ Aucun niveau Vente Couple avec
    ABO Confessionne Oui DISTRICT AUT ABO Urbain 2 ADJ Primaire Ménage mono
    ABO Confessionne Oui ADJ Préscolaire Ménage mono
    BKOU Privée laiq Oui DISTRICT AUT ABO Urbain 2 ADJ Primaire Ménage mono
    ABO Confessionne Oui DISTRICT AUT ABO Urbain 2 ADJ Primaire Ménage mono
    ABO Publique Oui DISTRICT AUT ABO Urbain 1 ADJ Secondaire g Ménage mono
    KOUM DISTRICT AUT ABO Urbain 2 6000 ADJ Aucun niveau Vente Couple avec
    MLI DISTRICT AUT ABO Urbain 2 1200 ADJ Aucun niveau Vente Couple avec


    and LB2019r (master file)

    input str3 HH2 str7 M5A str8 M14 str3 M15 str12 HHH1 str3 SOUSPREFIDD str6 HHH4 byte SEXE_INDIV str3 Milieu_new str12(Niveau_instr BRANCH21_E1) long REVENU_P str12 TYPE_MENAGE
    HH2 M5A M14 M15 HHH1 SOUSPREFIDD HHH4 SEXE_INDIV Milieu_new Niveau_instr BRANCH21_E1 REVENU_P TYPE_MENAGE
    ADJ ABO Publique Oui DISTRICT AUT ABO Urbain 1 ADJ Secondaire g Ménage mono
    ADJ KAT DISTRICT AUT ABO Urbain 1 ADJ Aucun niveau Vente 150000 Ménage mono
    ADJ TENG Publique Oui DISTRICT AUT ABO Urbain 2 ADJ Secondaire g Ménage mono
    ADJ ANYA DISTRICT AUT ABO Urbain 2 ADJ Aucun niveau Ménage mono
    ADJ ABO ADJ Ménage mono
    ADJ ADJAM Publique Non DISTRICT AUT ABO Urbain 2 ADJ Primaire Vente 225000 Famille éla
    ADJ ABO Autres Oui DISTRICT AUT ABO Urbain 2 ADJ Primaire Famille éla
    ADJ ABO Publique Oui DISTRICT AUT ABO Urbain 1 ADJ Secondaire g Famille éla
    ADJ ABO Publique Oui ADJ Primaire Famille éla
    ADJ ABO Publique Oui DISTRICT AUT ABO Urbain 2 ADJ Primaire Famille éla
    ADJ ABO Autres Oui ADJ Primaire Famille éla
    ADJ MAN DISTRICT AUT ABO Urbain 2 ADJ Aucun niveau Vente 150000 Famille éla
    ADJ ABO Publique Non DISTRICT AUT ABO Urbain 2 ADJ Primaire Vente 262500 Famille éla
    ADJ ABO ADJ Famille éla
    ADJ ADJAM Publique Non DISTRICT AUT ABO Urbain 1 ADJ Primaire Produits des 300000 Famille éla
    ADJ ABO Publique Oui ADJ Primaire Famille éla
    ADJ ABO Publique Oui ADJ Primaire Ménage mono
    ADJ ANYA DISTRICT AUT ABO Urbain 2 ADJ Aucun niveau Ménage mono
    ADJ ANYA Publique Non DISTRICT AUT ABO Urbain 2 ADJ Primaire Vente 75000 Ménage mono
    Many thanks

    Comment


    • #17
      The example data you show were not generated by -dataex-. As best I can tell, it is -list- output to which you prefixed an -input- command. Please, please be considerate of the time that others spend helping you and use -dataex-. Time wasted wrestling ill-formed example data into Stata is time not spent helping you or helping somebody else.

      The bigger obstacle to helping, however, is that I cannot figure out what you actually want to do here. First, I have not been working on -merge-ing these data sets: I have been working on -append-ing them. There is a big difference. -append- puts the data sets together "vertically," whereas -merge- puts them together "horizontally." Because the data sets have names like lb2016r and lb2019r, I have assumed that they are data sets representing the same variables measured in different years, possibly observed on the same households, possibly not. In general, data sets like that are best combined by -append-ing them.

      But maybe I have assumed incorrectly. I do notice in this exampe that lb20164 contains two variables, m5b and e20a that do not occur in lb2019r, and l2019r contains hh2 and revenu_p, which are not found in lb2016r. So perhaps this is in fact two sets of (mostly) different variables measured on the same households, so that -merge- would be appropriate. But in order to -merge- data sets, there has to be a variable, or set of variables, that appear in both data sets and uniquely identify the different households (or whatever unit of analysis each observation represents) in at least one of the two data sets. But there is no such set of identifying variables in these examples.

      So I need you to give a clearer explanation of how these two data sets are related to each other, what, if anything, are the variables that identify the same analytic units in them, and how you wish to combine them. I think it would also help if you made up a table that illustrates what you would like the combined result to look like.

      Comment


      • #18
        Many Thanks, Clyde, for your prompt reaction and apologies for asking you additional questions. I aim to merge the two datasets LB2016r and LB2019r. They are both Labor Force datasets collected respectively in 2016 and 2019 for the same country. However, variable names are not similar and their contents are slightly different. I have first renamed some variables in LB2016r (using dataset) and worked on the observations to match the variables and names in LB2019r (master). In combining the datasets I get the following message: ''variable BRANCH21_E1 does not uniquely identify observations in the master data''. I have spent a lot of time trying to combine both datasets. I am not even sure whether it is possible to combine them. As requested earlier given the high number of variables (1605 in the master file and 848 in the using dataset), I wonder whether there is an alternative way to merge the two datasets without renaming some variables and reorganizing their contents of the using dataset to match those in the master dataset.

        Here are the restricted versions of my datasets:

        LB2016r

        dataex M5A M5B M14 M15 HHH1 SOUSPREFIDD HHH4 SEXE_INDIV E20A Niveau_instr BRANCH21_E1 , count(19) varlabel

        input str4 M5A str3 M5B str12 M14 str3 M15 str12 HHH1 str3 SOUSPREFIDD str6 HHH4 byte SEXE_INDIV int E20A str12(Niveau_instr BRANCH21_E1) "ABO" "" "Publique" "Non" "DISTRICT AUT" "ABO" "Urbain" 2 7500 "Primaire" "Vente"
        "ABO" "" "Publique" "Oui" "" "" "" . . "Primaire" ""
        "ADIE" "" "Confessionne" "Non" "DISTRICT AUT" "ABO" "Urbain" 2 . "Secondaire g" ""
        "ABO" "" "Privée laiq" "Oui" "DISTRICT AUT" "ABO" "Urbain" 1 . "Secondaire g" ""
        "" "BF" "" "" "DISTRICT AUT" "ABO" "Urbain" 2 2000 "Aucun niveau" "Vente"
        "ABO" "" "Publique" "Oui" "DISTRICT AUT" "ABO" "Urbain" 2 5000 "Secondaire g" "Autres serv"
        "ABO" "" "Publique" "Oui" "DISTRICT AUT" "ABO" "Urbain" 2 . "Secondaire g" ""
        "ABO" "" "Publique" "Oui" "DISTRICT AUT" "ABO" "Urbain" 2 . "Secondaire g" ""
        "" "MLI" "Publique" "Non" "DISTRICT AUT" "ABO" "Urbain" 2 . "Secondaire t" ""
        "ANYA" "" "Privée laiq" "Oui" "" "" "" . . "Préscolaire" ""
        "ANYA" "" "" "" "DISTRICT AUT" "ABO" "Urbain" 2 5000 "Aucun niveau" "Vente"
        "OURA" "" "" "" "DISTRICT AUT" "ABO" "Urbain" 2 1000 "Aucun niveau" "Vente"
        "ABO" "" "Confessionne" "Oui" "DISTRICT AUT" "ABO" "Urbain" 2 . "Primaire" ""
        "ABO" "" "Confessionne" "Oui" "" "" "" . . "Préscolaire" ""
        "BKOU" "" "Privée laiq" "Oui" "DISTRICT AUT" "ABO" "Urbain" 2 . "Primaire" ""
        "ABO" "" "Confessionne" "Oui" "DISTRICT AUT" "ABO" "Urbain" 2 . "Primaire" ""
        "ABO" "" "Publique" "Oui" "DISTRICT AUT" "ABO" "Urbain" 1 . "Secondaire g" ""
        "KOUM" "" "" "" "DISTRICT AUT" "ABO" "Urbain" 2 6000 "Aucun niveau" "Vente"
        "" "MLI" "" "" "DISTRICT AUT" "ABO" "Urbain" 2 1200 "Aucun niveau" "Vente



        LB2019r

        [CODE]
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str3 HH2 str7 M5A str12 M6 str8 M14 str3 M15 str12 HHH1 str3 SOUSPREFIDD byte SEXE_INDIV str12(Niveau_instr BRANCH21_E1 TYPE_MENAGE)
        "ADJ" "ABO" "Ivoirienne" "Publique" "Oui" "DISTRICT AUT" "ABO" 1 "Secondaire g" "" "Ménage mono"
        "ADJ" "KAT" "Ivoirienne" "" "" "DISTRICT AUT" "ABO" 1 "Aucun niveau" "Vente" "Ménage mono"
        "ADJ" "TENG" "Ivoirienne" "Publique" "Oui" "DISTRICT AUT" "ABO" 2 "Secondaire g" "" "Ménage mono"
        "ADJ" "ANYA" "Ivoirienne" "" "" "DISTRICT AUT" "ABO" 2 "Aucun niveau" "" "Ménage mono"
        "ADJ" "ABO" "Ivoirienne" "" "" "" "" . "" "" "Ménage mono"
        "ADJ" "ADJAM" "Ivoirienne" "Publique" "Non" "DISTRICT AUT" "ABO" 2 "Primaire" "Vente" "Famille éla"
        "ADJ" "ABO" "Ivoirienne" "Autres" "Oui" "DISTRICT AUT" "ABO" 2 "Primaire" "" "Famille éla"
        "ADJ" "ABO" "Ivoirienne" "Publique" "Oui" "DISTRICT AUT" "ABO" 1 "Secondaire g" "" "Famille éla"
        "ADJ" "ABO" "Ivoirienne" "Publique" "Oui" "" "" . "Primaire" "" "Famille éla"
        "ADJ" "ABO" "Ivoirienne" "Publique" "Oui" "DISTRICT AUT" "ABO" 2 "Primaire" "" "Famille éla"
        "ADJ" "ABO" "Ivoirienne" "Autres" "Oui" "" "" . "Primaire" "" "Famille éla"
        "ADJ" "MAN" "Ivoirienne" "" "" "DISTRICT AUT" "ABO" 2 "Aucun niveau" "Vente" "Famille éla"
        "ADJ" "ABO" "Ivoirienne" "Publique" "Non" "DISTRICT AUT" "ABO" 2 "Primaire" "Vente" "Famille éla"
        "ADJ" "ABO" "Ivoirienne" "" "" "" "" . "" "" "Famille éla"
        "ADJ" "ADJAM" "Ivoirienne" "Publique" "Non" "DISTRICT AUT" "ABO" 1 "Primaire" "Produits des" "Famille éla"
        "ADJ" "ABO" "Ivoirienne" "Publique" "Oui" "" "" . "Primaire" "" "Famille éla"
        "ADJ" "ABO" "Non ivoirien" "Publique" "Oui" "" "" . "Primaire" "" "Ménage mono"
        "ADJ" "ANYA" "Ivoirienne" "" "" "DISTRICT AUT" "ABO" 2 "Aucun niveau" "" "Ménage mono"
        "ADJ" "ANYA" "Non ivoirien" "Publique" "Non" "DISTRICT AUT" "ABO" 2 "Primaire" "Vente" "Ménage mono

        Comment


        • #19
          If these examples are representative of the two data sets, then, no it is not possible to -merge- them. Here is my explanation why this is the case:
          1. To -merge- two data sets, there must be a merge key, that is, a variable or set of variables, that is common to both data sets and that tells us which observations in each data set should be paired with which observations in the other.
          2. Moreover, to do this using the -merge- command, that merge key must uniquely identify variables in one of the two data sets. That is, in at least one of the two data sets, no combination of values of the merge key variable(s) can ever occur more than once in that data set.
          3. The variables that are common to both data sets are: M5A, M14, M15, HHH1, SOUSPREFIDD, SEXE_INDIV, Niveau_instr, and BRANCH21_E1.
          4. But even this entire set of variables fails to uniquely identify observations in either data set, so surely no subset of these can do that.
          5. Now, part of the failure to uniquely identify observations arises from the fact that both data sets contain observations that are exact duplicates of each other in all variables. Such pure duplicate observations usually represent data management errors. So their presence implies that you should carefully review how these data sets were created, and how it is possible that the same observation managed to end up in the data set more than once. (You may find that the problem is already present in your initial source data, in which case you should contact the source about this.)
          6. But, even after removing all pure duplicate observations, it is still true that M5A, M14, M15, HHH1, SOUSPREFIDD, SEXE_INDIV, Niveau_instr, and BRANCH21_E1 do not uniquely identify observations in either data set.
          As I pointed out in 5., there is reason to think that these data sets contain incorrect data. But even if they don't, you still have to confront the lack of an appropriate merge key. You have stated that the real data sets contain many more variables. So it is possible, and I suspect it is the case, that the variable(s) needed to form a usable -merge- key are there in the full data sets, but are not among the ones shown in your examples. You will have to look into that.

          Comment


          • #20
            Many thanks Clyde. Clear observations. All understood now. As you advised, I will drop the duplicated variables and amend the datasets. Very useful assistance. Many thanks

            Comment


            • #21
              Hi All
              I am using Principal Component Analysis (PCA) to narrow the wide range of my independent variables. In my regression, I use cluster. I have standardized all my variables ahead of the PCA as advised. However, in my regression, when I apply i.schooling (my education variable), I get the following message: schooling: factor variables may not contain noninteger values. Does this mean that i. cannot be used with standardized variables?

              Comment


              • #22
                Hi All
                I am using Principal Component Analysis (PCA) to narrow the wide range of my independent variables. In my regression, I use cluster. I have standardized all my variables ahead of the PCA as advised. However, in my regression, when I apply i.schooling (my education variable), I get the following message: schooling: factor variables may not contain noninteger values. Does this mean that i. cannot be used with standardized variables?

                Comment


                • #23
                  #21 and #22 have no relation to the title of this thread.

                  It's a little difficult to tell whether schooling is an original variable or some principal component mishmash of original variables.


                  But the answer is essentially Yes, yet if I understand correctly you're missing nothing you should want to do.

                  * If you have, or have calculated, predictor (you say independent) variables that are on continuous scales, then there is no call to treat them otherwise. They aren't levels of a factor.

                  * Conversely, if schooling is essentially counted or ordered there is no call for or advantage in standardizing it.

                  (By default, pca in Stata works with correlations not covariances, so in effect standardization is redundant. Using correlations does that automatically.)
                  Last edited by Nick Cox; 03 Nov 2024, 05:42.

                  Comment


                  • #24
                    Many thanks Nick my issue is now resolved. All variables in my regressions are standardized because of the principal components. I generate the continuous variable of years of schooling based on the number of years of completed education which I standardized.

                    Comment

                    Working...
                    X