Replacing values based on multiple conditions

Alberto Camus

Join Date: Oct 2014
Posts: 47

Replacing values based on multiple conditions

11 Jan 2018, 06:25

Hi, I'm trying to get my head around how to best perform this operation. Here is a sample of my dataset:

Country	Cost	LOS	Threshold	N_cases cost	N_cases LOS	N_cases	ID	Year
FRA	500	5	.		.	8	1	2014
FRA	400	5	Cost	7	.	.	1	2014
FRA	500	7	LOS	.	6	.	1	2014
GBR	600	3	.	10	9	.	A	2014
IRE	700	4	.	12	12	.	B	2014
ITA	800	4	.	20	22	.	C	2014
ISR	550	6	.	13	11	.	D	2014

I basically have a panel dataset per country, ID and year. I need to conform France's data to that of all the other observations -- this means that I need to condense those 3 observations into one, where cost = cost if threshold = cost, and LOS = LOS for threshold = LOS. Same for the number of observations.

I am having a hard time coming up with a formula that makes sense and doesn't erase data from other observations. The variable threshold exists only for France, but the others are shared across all observations.

Thanks in advance for your help!

Tags: None

Max Piper

Join Date: Dec 2015

Posts: 61
#2

11 Jan 2018, 06:40

It is not quite clear to me what you want. Can you show us what the dataset should look like after the transformation? Perhaps you can use underlining, bold letters, or color to show exactly what should go where!
Comment

Alberto Camus

Join Date: Oct 2014
Posts: 47

11 Jan 2018, 06:49

Sure:

Before

Country	Cost	LOS	Threshold	N_cases cost	N_cases LOS	N_cases	ID	Year
FRA	500	5	.		.	8	1	2014
FRA	400	5	Cost	7	.	.	1	2014
FRA	500	7	LOS	.	6	.	1	2014
GBR	600	3	.	10	9	.	A	2014
IRE	700	4	.	12	12	.	B	2014
ITA	800	4	.	20	22	.	C	2014
ISR	550	6	.	13	11	.	D	2014

After:

Country	Cost	LOS	Threshold	N_cases cost	N_cases LOS	N_cases	ID	Year
FRA	400	7	.	7	6	.	1	2014
GBR	600	3	.	10	9	.	A	2014
IRE	700	4	.	12	12	.	B	2014
ITA	800	4	.	20	22	.	C	2014
ISR	550	6	.	13	11	.	D	2014

Performed on all id's and years for France. Does this make it clearer?

Comment

Max Piper

Join Date: Dec 2015
Posts: 61

11 Jan 2018, 08:42

you could try to drop all observations except those of France (save beforehand, of course) and then use the reshape command. But maybe we can do it easier:

drop if Threshold == .
replace LOS = 0 if Threshold == "Cost"
and
replace Cost = 0 if Threshold == "LOS"

Then it should look like this (changes in blue):

Country	Cost	LOS	Threshold	N_cases cost	N_cases LOS	N_cases	ID	Year

FRA	400	0	Cost	7	.	.	1	2014
FRA	0	7	LOS	.	6	.	1	2014

and then, try to sum them vertically.
That would go along (not sure about precise code):
new_Cost = sum (Cost) by Country ID Year
new_LOS = sum (LOS) by Country ID Year
new_N_cases cost = sum (N_cases cost) by Country ID Year
new_N_cases LOS = sum (N_cases LOS) by Country ID Year

it would look like this:

Country	Cost	LOS	Threshold	N_cases cost	N_cases LOS	N_cases	ID	Year	new_Cost	new_LOS	new_N_cases cost	new_N_cases LOS

FRA	400	0	Cost	7	.	.	1	2014	400	7	7	6
FRA	0	7	LOS	.	6	.	1	2014	400	7	7	6

perhaps you have to replace the missing values with zeros, or the other way round, before adding.
Finally, drop the old variables and rename the new variables with the names of the ones you just dropped.

Country	Threshold	N_cases	ID	Year	Cost	LOS	N_cases cost	N_cases LOS
FRA	Cost	.	1	2014	400	7	7	6
FRA	LOS	.	1	2014	400	7	7	6

and then drop duplicates.

This only works if Country with ID and Year uniquely identifies each set of observations for which you want this, i.e. for every ID and year you have exactly three such lines in your dataset.

If I am not mistaken, you do not want the first row's content, right? You don't need N_cases? My code requires that the observation with that is dropped or set to zero.

Last edited by Max Piper; 11 Jan 2018, 08:48.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

11 Jan 2018, 11:19

Something like the following untested code might be a start on the code you need. I am assuming that your variable Threshold is a string variable rather than a numeric variable with value labels.

Code:

bysort country ID year: egen nc = max(cond(Threshold=="Cost",Cost,.)) bysort country ID year: replace Cost = nc if _N>1 bysort country ID year: drop if _n>1 drop nc
Comment

Romalpa Akzo

Join Date: Oct 2017
Posts: 369

11 Jan 2018, 17:18

collapse would give out a concise solution, or am I misunderstanding something?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str3 country int cost byte los str4 threshold byte(n_casescost n_caseslos n_cases) str1 id int year
"FRA" 500 5 "."     .  . 8 "1" 2014
"FRA" 400 5 "Cost"  7  . . "1" 2014
"FRA" 500 7 "LOS"   .  6 . "1" 2014
"GBR" 600 3 "."    10  9 . "A" 2014
"IRE" 700 4 "."    12 12 . "B" 2014
"ITA" 800 4 "."    20 22 . "C" 2014
"ISR" 550 6 "."    13 11 . "D" 2014
end

replace cost = . if threshold != "Cost"
replace los = . if threshold != "LOS"
collapse (firstnm) cost los n_casescost n_caseslos, by (country id year)

. list, noobs

  +--------------------------------------------------------+
  | country   id   year   cost   los   n_case~t   n_case~s |
  |--------------------------------------------------------|
  |     FRA    1   2014    400     7          7          6 |
  |     GBR    A   2014      .     .         10          9 |
  |     IRE    B   2014      .     .         12         12 |
  |     ISR    D   2014      .     .         13         11 |
  |     ITA    C   2014      .     .         20         22 |
  +--------------------------------------------------------+

Last edited by Romalpa Akzo; 11 Jan 2018, 17:40.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

11 Jan 2018, 17:46

Romalpa Akzo Yours would be a more complete solution if it left the other countries unchanged, as Alberto Camus suggested in post #3. I think if you modified your replace commands to something like

Code:

by country id year: replace cost = . if threshold != "Cost" & _N>1

your solution would include the other countries.

Another possible problem is that collapse changes variable labels.
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#8

11 Jan 2018, 18:13

William, thanks for your comments. Anyhow, in my understanding, FRA is just an example, whilst what Alberto needs is the same solution for every country, .

If he actually might in need only for FRA, the replace commands should be

Code:

replace cost =. if threshold != "Cost" & country =="FRA"

Your suggested code, while keeping info for other countries, seems providing the incorrect output for FRA itself.

Last edited by Romalpa Akzo; 11 Jan 2018, 18:16.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

11 Jan 2018, 18:49

Romialpa, perhaps you misunderstand what I posted, which was a proposal of a direction for Alberto Camus to pursue, not a complete solution.

Below is my sample code, which only handled the Cost variable, tested on the sample data for Cost, with a second country XYZ to demonstrate that it works for multiple countries with multiple observations, as well as those Country/ID/Year combinations that have just a single observation.

Extending my code to handle the other three variables is straightforward.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str3 Country int Cost str4 Threshold str1 ID int Year
"FRA" 500 "."    "1" 2014
"FRA" 400 "Cost" "1" 2014
"FRA" 500 "LOS"  "1" 2014
"GBR" 600 "."    "A" 2014
"IRE" 700 "."    "B" 2014
"ISR" 550 "."    "D" 2014
"ITA" 800 "."    "C" 2014
"XYZ" 510 "."    "2" 2013
"XYZ" 410 "Cost" "2" 2013
"XYZ" 510 "LOS"  "2" 2013
end
bysort Country ID Year: egen nc = max(cond(Threshold=="Cost",Cost,.))
bysort Country ID Year: replace Cost = nc if _N>1
bysort Country ID Year: drop if _n>1
drop nc Threshold
list, clean noobs

which yields

Code:

. list, clean noobs

    Country   Cost   ID   Year  
        FRA    400    1   2014  
        GBR    600    A   2014  
        IRE    700    B   2014  
        ISR    550    D   2014  
        ITA    800    C   2014  
        XYZ    410    2   2013

Comment

Alberto Camus

Join Date: Oct 2014
Posts: 47

#10

12 Jan 2018, 03:40

Thanks everyone for your suggestions. A needed clarification that might help give some context to the issue:

What I'm working with is trimmed and untrimmed data. In my dataset, every country provides a trimmed observation where Cost is trimmed on costs (e.g. costs < X are truncated) and LOS is trimmed on length of stay (LOS < Y are truncated), along with their number of cases for each trim. They provide these values on one single observation (as shown in my example).

The only country that doesn't do it is France, which provides the cost and LOS trimmed values on two separate lines. I have separate worksheets for the dataset, therefore I can also work individually on France and then append it to the main dataset, if the provided solution is simpler.

Now, this introduces a potential problem that I didn't consider mentioning earlier -- that is, the ID and year are not unique to the trimmed data. What this means is that the dataset actually looks more like this:

Country	Cost	LOS	Threshold	N_cases cost	N_cases LOS	N_cases	ID	Year	Trimmed
FRA	500	5	.		.	8	1	2014	No
FRA	400	5	Cost	7	.	.	1	2014	Yes
FRA	500	7	LOS	.	6	.	1	2014	Yes
GBR	700	4	.	.	.	11	A	2014	No
GBR	600	3	.	10	9	.	A	2014	Yes
IRE	750	5	.	.	.	13	B	2014	No
IRE	700	4	.	12	12	.	B	2014	Yes
ITA	900	6	.	.	.	25	C	2014	No
ITA	800	4	.	20	22	.	C	2014	Yes
ISR	600	7	.	.	.	14	D	2014	No
ISR	550	6	.	13	11	.	D	2014	Yes

As I said, any country besides France can be ignored. They are well formatted and they serve only as a reference as to how the final data for France should look like. This means that the untrimmed data should not be lost in the process of merging the trimmed observations.

From your contributions, the possible solution I am seeing now is this:

Code:

import excel France.xls
drop if threshold !=.
append using Master.dta
save Master.dta, replace
clear all

import excel France.xls
replace cost = . if threshold != "Cost"
replace los = . if threshold != "LOS"
collapse (firstnm) cost los n_casescost n_caseslos
append using Master.dta
save Master.dta, replace

Although not very sexy, this should work, correct? The first part of the code allows me to append the untrimmed data, while the second part uses the code provided by Romalpa to collapse the trimmed data into one observation (and I don't care about losing the untrimmed one since it's already appended to the main dataset).

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#11

12 Jan 2018, 04:27

1. William, I got your point now, it does make sense in assumption that Alberto need to keep info for other country. May I thanks again for your further clarification.

2. Alberto, there are something suspicious in your code, for example: threshold !=. will not work since your threshold seems to be a string variable.

My suggestion is:

Code:

import excel "France.xls", firstrow clear replace Cost = . if Threshold != "Cost" & Trimmed =="Yes" replace LOS = . if Threshold != "LOS" & Trimmed =="Yes" collapse (firstnm) Cost LOS N_casescost N_casesLOS N_cases, by (Country ID Year Trimmed) append using "Master.dta" save "Master.dta", replace

By the way, in the future, please read FAQ of the forum to have better post, whereas, using dataex (to give a small sample of your data) is strongly recommended.

Last edited by Romalpa Akzo; 12 Jan 2018, 04:46.
2 likes
Comment
Max Piper

Join Date: Dec 2015

Posts: 61
#12

12 Jan 2018, 04:46

Originally posted by Alberto Camus View Post

As I said, any country besides France can be ignored. They are well formatted and they serve only as a reference as to how the final data for France should look like. This means that the untrimmed data should not be lost in the process of merging the trimmed observations.

You could export all countries to another file, then drop them in the file you're currently working on, do whatever you want to do on France, and then merge the other countries again.

If you have a solution that works on France, then whatever this solution does to the other countries can be ignored, because you then just have to combine the adjusted French data with the previous version of the other countries' data, and voilà, as they say.
Comment
Alberto Camus

Join Date: Oct 2014

Posts: 47
#13

12 Jan 2018, 06:56

Thank you again both for your help. As you said, Max, I just worked on a file with France and imported it into the main one at a later stage.

Romalpa, your code worked perfectly. For anyone browsing this at a later stage, remember that this line

Code:

collapse (firstnm) Cost LOS N_casescost N_casesLOS N_cases, by (Country ID Year Trimmed)

Must contain all of your other variables that don't need to be collapsed, otherwise they will be dropped altogether (losing many observations in the process). My working code ended up being:

Code:

import excel "France.xls", sheet("data") firstrow clear sort Country id Year drop if Threshold == "Both" replace ACOST = . if Threshold != "Cost" & Trimmed == "Yes" replace ALOS = . if Threshold != "LOS" & Trimmed == "Yes" collapse (firstnm) ACOST ALOS Nb_cases_TC Nb_cases_TL, by (Country id Year Trimmed Type Code_condition Code_subset Nb_cases) save Master.dta, replace

Romalpa, I was aware of dataex, but it required admin approval for installation at work and I needed a quick solution. Thanks for pointing it out though.
Comment

Announcement

Replacing values based on multiple conditions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment