Value labels and number of observations

Chwen Chwen Chen

Join Date: Mar 2023

Posts: 17
#1

Value labels and number of observations

12 Jan 2024, 03:50

Hi Statalist,
I want to merge two datasets that I call parent data and child data in this post. Parent data has 208 observations and child data has 2623 observations.
Both datasets have a variable named "npf" containing French names and a variable named "id" that I created and that is my key variable for merging.

The _merge variable displays the following result:

. tab _merge

_merge | Freq. Percent Cum.
------------------------------+--------------------------------------------------
only in using data | 14 0.53 0.53
both in master and using data | 2,609 9.47 100.00
------------------------------+--------------------------------------------------
Total | 2,623 100.00

An inspection of the child data (my using data) with "tab npf, nolabel" shows that the value labels goes from 1 to 209 (instead of 208), in particular for, say, the npf "cesar" I have two values with their respective frequencies:

181 césar 21
182 cesar 1

I tried to remove the value label "182" but this changes the number of observations for child data. I also copied and pasted the original data in a new excel file but the problem remains.
What I want to get is the name written without any accent.
How can I fix this problem?

I also inspected the parent data with "tab npf, nol" which gives 224 value labels while the number of observations remains 208. From my understanding, all these value labels arise after changes I made to the names of the variable "npf" in the original excel file of parent data to remove names with characters that have accent. It seems that each time Stata keeps memory of all these changes I made.
Any suggestion for some good practice to avoid this kind of problems in the future?

Thank you!

Chwen Chwen
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3850
#2

12 Jan 2024, 04:44

The problem description is not clear, so it is hard to say anything.

Originally posted by Chwen Chwen Chen View Post

An inspection of the child data (my using data) with "tab npf, nolabel" shows that the value labels goes from 1 to 209 (instead of 208)

Do you mean you see consecutive values (not value labels) from 1 to 209? Then there must be (at least) 209 observations in that dataset. tabulate does not show observations that aren't there.

Show the results of

Code:

use using describe , short describe npf

If there is no issue with data sensitivity, show the dataset using dataex. If you do not know how to do that, type

Code:

help dataex

Last edited by daniel klein; 12 Jan 2024, 04:46.
1 like
Comment
Chwen Chwen Chen

Join Date: Mar 2023

Posts: 17
#3

12 Jan 2024, 05:48

Hi Daniel,

I mean something else than consecutive values. Because of data sensitivity I regret I cannot show the dataset. I will do my best to be as clearer as possible.

Below are the results of what you requested:

. use "C:\DatiLocali\DatiLocali\0project\01Data_manageme nt\output\clean-child.dta"

. describe , short

Contains data from C:\DatiLocali\DatiLocali\0project\01Data_managemen t\output\clean-child.dta
Observations: 2,623
Variables: 145 12 Jan 2024 00:04
Sorted by: npf b_count

. describe npf

Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
npf long %101.0g npf parent name

The variable "npf" contains parent names in my example. If I tabulate this variable I actually get all the names and their frequencies for a total of 2623 observations, which is correct.
When I inspect the names I find that the parent name "cesar" is repeated twice with these frequencies:

césar 21
cesar 1

The correct frequency is in correspondence of the name "césar". But what I would like to get is the name "cesar" without the accent-character.

In one of my previous attempts to solve this problem I dropped value label "npf". As a result all the parent names are removed from the variable "npf".

In the original excel file the number of parent names is 208 for 2623 observations, and the variable "npf" contains the name "cesar".

Thank you for your help.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#4

12 Jan 2024, 05:57

You could find out the underlying code (should be: values) for the labels "césar" and "cesar". Say those codes (should be: values) were

Code:

42 césar 73 cesar

you could then combine the observations

Code:

replace npf = 73 if npf == 42

and clean up the value label

Code:

label define npf 42 "" , modify

Be careful when combining datasets with same-named value labels but different contents in subsequent steps.

Last edited by daniel klein; 12 Jan 2024, 06:04. Reason: fixed terminology code -> value
1 like
Comment
Chwen Chwen Chen

Join Date: Mar 2023

Posts: 17
#5

12 Jan 2024, 08:30

Thank you Daniel! It works perfectly!
Comment

Announcement

Value labels and number of observations

Comment

Comment

Comment

Comment