Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Value labels and number of observations

    Hi Statalist,
    I want to merge two datasets that I call parent data and child data in this post. Parent data has 208 observations and child data has 2623 observations.
    Both datasets have a variable named "npf" containing French names and a variable named "id" that I created and that is my key variable for merging.

    The _merge variable displays the following result:

    . tab _merge

    _merge | Freq. Percent Cum.
    ------------------------------+--------------------------------------------------
    only in using data | 14 0.53 0.53
    both in master and using data | 2,609 9.47 100.00
    ------------------------------+--------------------------------------------------
    Total | 2,623 100.00

    An inspection of the child data (my using data) with "tab npf, nolabel" shows that the value labels goes from 1 to 209 (instead of 208), in particular for, say, the npf "cesar" I have two values with their respective frequencies:

    181 césar 21
    182 cesar 1

    I tried to remove the value label "182" but this changes the number of observations for child data. I also copied and pasted the original data in a new excel file but the problem remains.
    What I want to get is the name written without any accent.
    How can I fix this problem?

    I also inspected the parent data with "tab npf, nol" which gives 224 value labels while the number of observations remains 208. From my understanding, all these value labels arise after changes I made to the names of the variable "npf" in the original excel file of parent data to remove names with characters that have accent. It seems that each time Stata keeps memory of all these changes I made.
    Any suggestion for some good practice to avoid this kind of problems in the future?

    Thank you!

    Chwen Chwen


  • #2
    The problem description is not clear, so it is hard to say anything.

    Originally posted by Chwen Chwen Chen View Post
    An inspection of the child data (my using data) with "tab npf, nolabel" shows that the value labels goes from 1 to 209 (instead of 208)
    Do you mean you see consecutive values (not value labels) from 1 to 209? Then there must be (at least) 209 observations in that dataset. tabulate does not show observations that aren't there.

    Show the results of

    Code:
    use using
    describe , short
    describe npf
    If there is no issue with data sensitivity, show the dataset using dataex. If you do not know how to do that, type

    Code:
    help dataex
    Last edited by daniel klein; 12 Jan 2024, 05:46.

    Comment


    • #3
      Hi Daniel,

      I mean something else than consecutive values. Because of data sensitivity I regret I cannot show the dataset. I will do my best to be as clearer as possible.

      Below are the results of what you requested:

      . use "C:\DatiLocali\DatiLocali\0project\01Data_manageme nt\output\clean-child.dta"

      . describe , short

      Contains data from C:\DatiLocali\DatiLocali\0project\01Data_managemen t\output\clean-child.dta
      Observations: 2,623
      Variables: 145 12 Jan 2024 00:04
      Sorted by: npf b_count


      . describe npf

      Variable Storage Display Value
      name type format label Variable label
      -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      npf long %101.0g npf parent name


      The variable "npf" contains parent names in my example. If I tabulate this variable I actually get all the names and their frequencies for a total of 2623 observations, which is correct.
      When I inspect the names I find that the parent name "cesar" is repeated twice with these frequencies:

      césar 21
      cesar 1

      The correct frequency is in correspondence of the name "césar". But what I would like to get is the name "cesar" without the accent-character.

      In one of my previous attempts to solve this problem I dropped value label "npf". As a result all the parent names are removed from the variable "npf".

      In the original excel file the number of parent names is 208 for 2623 observations, and the variable "npf" contains the name "cesar".


      Thank you for your help.



      Comment


      • #4
        You could find out the underlying code (should be: values) for the labels "césar" and "cesar". Say those codes (should be: values) were

        Code:
        42 césar
        73 cesar
        you could then combine the observations

        Code:
        replace npf = 73 if npf == 42
        and clean up the value label

        Code:
        label define npf 42 "" , modify
        Be careful when combining datasets with same-named value labels but different contents in subsequent steps.
        Last edited by daniel klein; 12 Jan 2024, 07:04. Reason: fixed terminology code -> value

        Comment


        • #5
          Thank you Daniel! It works perfectly!

          Comment

          Working...
          X