Formatting survey data to remove duplicates IDS.

Beri Parfait

Join Date: Jun 2019

Posts: 54
#1

Formatting survey data to remove duplicates IDS.

15 Jan 2024, 16:38

Dear Stata Users,

This is my first time of working with survey data, and I am having some challenges with making my IDs uniquely identifiable. I am working on a survey dataset in Stata. The dataset contains an ID variable, another called preferred colour, and a host of other variables. Preferred colour is central to my study.

Regarding the preferred colour, IDs reported that they either prefer brown, white, purple, black or none. Since IDs listed different colours, the final dataset has a format where there is one observation per colour.
I need to reformat the data so that there is only one observation per ID. I read somewhere that I need to export separate excel sheets of the duplicates for each of the preferred colours, and then merge them into my main data, but how to even start. I will appreciate any hints that you can provide.

NB
I tried to get sample data using data dataex in 1/5 but it wasn't giving.

Last edited by Beri Parfait; 15 Jan 2024, 16:41.
Tags: categorical, foreach, string, survey
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

15 Jan 2024, 17:52

I have to say that despite your best efforts to describe your data, I cannot even begin to imagine what you have and how it is organized. If you tried running

Code:

data dataex in 1/5

it is no wonder that you got no results. There is no command -data-. The correct syntax for this is:

Code:

dataex in 1/5

Actually, you should probably show more than just the first 5 observations--that's a bit skimpy. Most people don't bother specifying an -in- clause when using -dataex-, and with no other specification for it, the default is the first 100 observations. That's usually an ample amount of observations for most purposes. If whatever regularities exist in the data can be reliably displayed with a smaller number, then, sure, go ahead and restrict with an -in- clause. But remember that if you unwittingly end up failing to show something that is actually important, you are likely to get a response with code that won't work for you.

Perhaps when you do this you will run into a problem where it tells you that there are too many variables. In that case, you should pick a shorter list of the variables that are most relevant to your problem and include those in the command:

Code:

dataex list_of_most_relevant_variables

That list should include, at a minimum, the id variable and the color variables.
1 like
Comment

Beri Parfait

Join Date: Jun 2019
Posts: 54

15 Jan 2024, 18:47

Thank you, Clyde, for your suggestion. As you can see, the current format of the dataset contains one observation per colour, but I want to format it to have one observation per id (participant in the survey).

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int id long(colour region) byte age
1001 2 2 32
1001 1 2 32
1001 6 2 32
1002 5 1 32
1002 6 1 32
1003 2 2 30
1003 1 2 30
1004 2 1 68
1004 1 1 68
1004 5 1 68
1005 2 1 28
1005 6 1 28
1006 6 1 36
1007 6 2 66
1008 2 1 52
1008 5 1 52
1008 6 1 52
1009 6 1 37
1010 2 2 45
1010 1 2 45
1010 4 2 45
1011 2 1 25
1011 6 1 25
1012 3 1 25
1013 2 1 20
1013 6 1 20
1014 6 1 32
1015 2 1 31
1015 5 1 31
1015 6 1 31
1016 6 1 24
1017 3 2 20
1018 2 1 23
1018 5 1 23
1019 2 1 45
1019 5 1 45
1019 6 1 45
1020 3 1 70
1021 2 2 33
1021 1 2 33
1021 5 2 33
1022 2 1 40
1022 5 1 40
1022 6 1 40
1023 2 1 24
1024 2 1 23
1024 1 1 23
1025 3 1 27
1026 6 1 21
1027 2 1 21
1028 2 1 24
1028 6 1 24
1029 2 2 39
1030 3 1 26
1031 2 1 37
1032 2 1 39
1032 1 1 39
1032 6 1 39
1033 2 2 49
1033 1 2 49
1033 5 2 49
1033 6 2 49
1034 5 1 25
1035 2 1 50
1035 6 1 50
1036 2 1 44
1036 5 1 44
1037 5 1 40
1037 6 1 40
1038 2 1 24
1038 1 1 24
1038 6 1 24
1039 2 2 59
1039 1 2 59
1040 2 1 28
1040 6 1 28
1041 2 1 58
1042 2 1 26
1042 6 1 26
1043 2 1 60
1043 1 1 60
1044 2 1 24
1044 6 1 24
1045 2 1 65
1045 5 1 65
1046 2 1 35
1046 5 1 35
1046 6 1 35
1047 2 1 31
1047 1 1 31
1047 6 1 31
1048 5 1 60
1048 6 1 60
1049 5 1 47
1049 6 1 47
1050 6 1 48
1051 5 1 58
1051 6 1 58
1052 2 1 23
1052 6 1 23
end
label values colour colour
label def colour 1 "Black", modify
label def colour 2 "White", modify
label def colour 3 "None", modify
label def colour 4 "Purple", modify
label def colour 5 "Brown", modify
label def colour 6 "Green", modify
label values region region
label def region 1 "SW", modify
label def region 2 "NW", modify

Comment

Beri Parfait

Join Date: Jun 2019

Posts: 54
#4

15 Jan 2024, 18:58

To add more, I read on the this website that "The basic process behind tidying data is that------: first, identify all of the variables that were measured at the same level of observation; second, create separate data tables for each level of observation; and third, reshape the data and remove duplicate rows until each data table is uniquely and fully identified by the unique identifier that corresponds to its unit of observation. but the information wasn't enough to help me proceed. I believe these are the steps I needed to follow but do not know how to proceed in stata.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#5

15 Jan 2024, 21:50

The process you describe in #4 is doable, and probably is the best way to proceed in some other kinds of statistical packages. It is not the best way to proceed in Stata.

You have stated that your goal is to end up with one observation per id. So it is important, at the start, to know which variables are constant within id and which ones vary. In your example data, both region and age are constant within id. I will assume that this is true throughout your data. (If it isn't, the -reshape- command will notice the problem and halt with an error message.) The colour variable, however, does vary within id. So you will need to end up with multiple variables capturing colour. Actually, in your example data, there are always at most four observations for any id. So you will end up with four colour variables. (If this is not true of your larger data set, you may end up with more than four such variables. The code will not break if there are some id's with more than four: four is not built into this code.)

Now, we know that the key identifying variable here is id. We need another variable that indexes the observations within id. We can't use colour for that purpose because it is the variable we want to expand to four variables. And we can't use region or age because they are constant within id. So we have no indexing variable. That means we need to create one. We can create a new variable that simply counts up within id's starting at 1. There is a little detail we may have to pay attention to at this point. From your description of the problem, I cannot ascertain whether it makes any difference which colour a person chooses is taken to be colour1 and which colour2 and which colour3 and which colour4. If it does not matter, it is very easy. If it does matter, then the code will have to respect that, and depending on what the desired ordering is, the code will have to be modified accordingly. I'm going to assume that you want them in the order that they currently appear (vertically) in the data. So our indexing variable has to be set to 1 for what is currently the first observation for each id, 2 for the second, etc.

Code:

sort id, stable // PRESERVE THE EXISTING ORDER WITHIN ID by id: gen _j = _n // CREATE AN INDEXING VARIABLE _j reshape wide colour, i(id) j(_j) // DO THE RESHAPE

and you're done.
Comment

Beri Parfait

Join Date: Jun 2019
Posts: 54

16 Jan 2024, 00:25

Originally posted by Clyde Schechter View Post

In your example data, both region and age are constant within id. I will assume that this is true throughout your data. (If it isn't, the -reshape- command will notice the problem and halt with an error message.)

Thank you once again for your patience with me. As you rightly predicted, the code broke out because there was one other variable (colour_number== this variable is just 1, 2, 3 describing the colour numbers) that was coded in the same way as colour (names of colours 1, 2, 3, etc). I apologise for not including it in the sample dataset yesterday. When I try to reshape, I get the following error message . I dropped colour_num from the dataset the the error persisted. kindly see the error below:

reshape wide colour , i( id ) j(_j)
(note: j = 1 2 3 4 5)
variable colour_num not constant within id
variable colour not constant within id
Your data are currently long. You are performing a reshape wide. You typed something like

. reshape wide a b, i(id) j(_j)

There are variables other than a, b, id, _j in your data. They must be constant within id because that.
is the only way they can fit into wide data without loss of information.

The variable or variables listed above are not constant within id. Perhaps the values are in error. Type
reshape error for a list of the problem observations.

Either that, or the values vary because they should vary, in which case you must either add the variables to
the list of xij variables to be reshaped or drop them.

Although I have already drop colour_num from my dataset, this is the revised example data I generated with that variable

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int id byte colour_num long(colour region) byte age
1001 1 2 2 32
1001 2 1 2 32
1001 3 6 2 32
1002 1 5 1 32
1002 2 6 1 32
1003 1 2 2 30
1003 2 1 2 30
1004 1 2 1 68
1004 2 1 1 68
1004 3 5 1 68
1005 1 2 1 28
1005 2 6 1 28
1006 1 6 1 36
1007 1 6 2 66
1008 1 2 1 52
1008 2 5 1 52
1008 3 6 1 52
1009 1 6 1 37
1010 1 2 2 45
1010 2 1 2 45
1010 3 4 2 45
1011 1 2 1 25
1011 2 6 1 25
1012 1 3 1 25
1013 1 2 1 20
1013 2 6 1 20
1014 1 6 1 32
1015 1 2 1 31
1015 2 5 1 31
1015 3 6 1 31
1016 1 6 1 24
1017 1 3 2 20
1018 1 2 1 23
1018 2 5 1 23
1019 1 2 1 45
1019 2 5 1 45
1019 3 6 1 45
1020 1 3 1 70
1021 1 2 2 33
1021 2 1 2 33
1021 3 5 2 33
1022 1 2 1 40
1022 2 5 1 40
1022 3 6 1 40
1023 1 2 1 24
1024 1 2 1 23
1024 2 1 1 23
1025 1 3 1 27
1026 1 6 1 21
1027 1 2 1 21
1028 1 2 1 24
1028 2 6 1 24
1029 1 2 2 39
1030 1 3 1 26
1031 1 2 1 37
1032 1 2 1 39
1032 2 1 1 39
1032 3 6 1 39
1033 1 2 2 49
1033 2 1 2 49
1033 3 5 2 49
1033 4 6 2 49
1034 1 5 1 25
1035 1 2 1 50
1035 2 6 1 50
1036 1 2 1 44
1036 2 5 1 44
1037 1 5 1 40
1037 2 6 1 40
1038 1 2 1 24
1038 2 1 1 24
1038 3 6 1 24
1039 1 2 2 59
1039 2 1 2 59
1040 1 2 1 28
1040 2 6 1 28
1041 1 2 1 58
1042 1 2 1 26
1042 2 6 1 26
1043 1 2 1 60
1043 2 1 1 60
1044 1 2 1 24
1044 2 6 1 24
1045 1 2 1 65
1045 2 5 1 65
1046 1 2 1 35
1046 2 5 1 35
1046 3 6 1 35
1047 1 2 1 31
1047 2 1 1 31
1047 3 6 1 31
1048 1 5 1 60
1048 2 6 1 60
1049 1 5 1 47
1049 2 6 1 47
1050 1 6 1 48
1051 1 5 1 58
1051 2 6 1 58
1052 1 2 1 23
1052 2 6 1 23
end
label values colour colour
label def colour 1 "Black", modify
label def colour 2 "White", modify
label def colour 3 "None", modify
label def colour 4 "Purple", modify
label def colour 5 "Brown", modify
label def colour 6 "Green", modify
label values region region
label def region 1 "SW", modify
label def region 2 "NW", modify

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#7

16 Jan 2024, 08:39

OK. The principle remains the same. Those variables that vary within id should be listed in the variable list that precedes the comma, the id variable goes in -i()-, and the indexing variable goes in -j()-. Since colour_num and colour now both vary within id, we get:

Code:

sort id, stable by id: gen _j = _n reshape wide colour colour_num, i(id) j(_j)

I have tested this code with your example data and verified that it runs correctly. If it does not do so in your real data set, you will need to find a new data example that exhibits whatever problems you encounter and post back with that.
Comment
Beri Parfait

Join Date: Jun 2019

Posts: 54
#8

16 Jan 2024, 13:51

Dear Clyde, Thank you very much for your assistance. The code worked perfectly for me. The delayed response was because I wanted to ensure that the outcome (clean dataset) was a perfect replication of the original (raw) dataset. I sincerely appreciate.
Comment

Announcement