Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacing values based on multiple conditions

    Hi, I'm trying to get my head around how to best perform this operation. Here is a sample of my dataset:
    Country Cost LOS Threshold N_cases cost N_cases LOS N_cases ID Year
    FRA 500 5 . . 8 1 2014
    FRA 400 5 Cost 7 . . 1 2014
    FRA 500 7 LOS . 6 . 1 2014
    GBR 600 3 . 10 9 . A 2014
    IRE 700 4 . 12 12 . B 2014
    ITA 800 4 . 20 22 . C 2014
    ISR 550 6 . 13 11 . D 2014

    I basically have a panel dataset per country, ID and year. I need to conform France's data to that of all the other observations -- this means that I need to condense those 3 observations into one, where cost = cost if threshold = cost, and LOS = LOS for threshold = LOS. Same for the number of observations.

    I am having a hard time coming up with a formula that makes sense and doesn't erase data from other observations. The variable threshold exists only for France, but the others are shared across all observations.

    Thanks in advance for your help!

  • #2
    It is not quite clear to me what you want. Can you show us what the dataset should look like after the transformation? Perhaps you can use underlining, bold letters, or color to show exactly what should go where!

    Comment


    • #3
      Sure:

      Before
      Country Cost LOS Threshold N_cases cost N_cases LOS N_cases ID Year
      FRA 500 5 . . 8 1 2014
      FRA 400 5 Cost 7 . . 1 2014
      FRA 500 7 LOS . 6 . 1 2014
      GBR 600 3 . 10 9 . A 2014
      IRE 700 4 . 12 12 . B 2014
      ITA 800 4 . 20 22 . C 2014
      ISR 550 6 . 13 11 . D 2014
      After:
      Country Cost LOS Threshold N_cases cost N_cases LOS N_cases ID Year
      FRA 400 7 . 7 6 . 1 2014
      GBR 600 3 . 10 9 . A 2014
      IRE 700 4 . 12 12 . B 2014
      ITA 800 4 . 20 22 . C 2014
      ISR 550 6 . 13 11 . D 2014

      Performed on all id's and years for France. Does this make it clearer?

      Comment


      • #4
        you could try to drop all observations except those of France (save beforehand, of course) and then use the reshape command. But maybe we can do it easier:

        drop if Threshold == .
        replace LOS = 0 if Threshold == "Cost"
        and
        replace Cost = 0 if Threshold == "LOS"

        Then it should look like this (changes in blue):
        Country Cost LOS Threshold N_cases cost N_cases LOS N_cases ID Year
        FRA 400 0 Cost 7 . . 1 2014
        FRA 0 7 LOS . 6 . 1 2014
        and then, try to sum them vertically.
        That would go along (not sure about precise code):
        new_Cost = sum (Cost) by Country ID Year
        new_LOS = sum (LOS) by Country ID Year
        new_N_cases cost = sum (N_cases cost) by Country ID Year
        new_N_cases LOS = sum (N_cases LOS) by Country ID Year

        it would look like this:
        Country Cost LOS Threshold N_cases cost N_cases LOS N_cases ID Year new_Cost new_LOS new_N_cases cost new_N_cases LOS
        FRA 400 0 Cost 7 . . 1 2014 400 7 7 6
        FRA 0 7 LOS . 6 . 1 2014 400 7 7 6

        perhaps you have to replace the missing values with zeros, or the other way round, before adding.
        Finally, drop the old variables and rename the new variables with the names of the ones you just dropped.
        Country Threshold N_cases ID Year Cost LOS N_cases cost N_cases LOS
        FRA Cost . 1 2014 400 7 7 6
        FRA LOS . 1 2014 400 7 7 6
        and then drop duplicates.

        This only works if Country with ID and Year uniquely identifies each set of observations for which you want this, i.e. for every ID and year you have exactly three such lines in your dataset.

        If I am not mistaken, you do not want the first row's content, right? You don't need N_cases? My code requires that the observation with that is dropped or set to zero.
        Last edited by Max Piper; 11 Jan 2018, 08:48.

        Comment


        • #5
          Something like the following untested code might be a start on the code you need. I am assuming that your variable Threshold is a string variable rather than a numeric variable with value labels.
          Code:
          bysort country ID year: egen nc = max(cond(Threshold=="Cost",Cost,.))
          bysort country ID year: replace Cost = nc if _N>1
          bysort country ID year: drop if _n>1
          drop nc

          Comment


          • #6
            collapse would give out a concise solution, or am I misunderstanding something?
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str3 country int cost byte los str4 threshold byte(n_casescost n_caseslos n_cases) str1 id int year
            "FRA" 500 5 "."     .  . 8 "1" 2014
            "FRA" 400 5 "Cost"  7  . . "1" 2014
            "FRA" 500 7 "LOS"   .  6 . "1" 2014
            "GBR" 600 3 "."    10  9 . "A" 2014
            "IRE" 700 4 "."    12 12 . "B" 2014
            "ITA" 800 4 "."    20 22 . "C" 2014
            "ISR" 550 6 "."    13 11 . "D" 2014
            end
            
            replace cost = . if threshold != "Cost"
            replace los = . if threshold != "LOS"
            collapse (firstnm) cost los n_casescost n_caseslos, by (country id year)
            
            . list, noobs
            
              +--------------------------------------------------------+
              | country   id   year   cost   los   n_case~t   n_case~s |
              |--------------------------------------------------------|
              |     FRA    1   2014    400     7          7          6 |
              |     GBR    A   2014      .     .         10          9 |
              |     IRE    B   2014      .     .         12         12 |
              |     ISR    D   2014      .     .         13         11 |
              |     ITA    C   2014      .     .         20         22 |
              +--------------------------------------------------------+
            Last edited by Romalpa Akzo; 11 Jan 2018, 17:40.

            Comment


            • #7
              Romalpa Akzo Yours would be a more complete solution if it left the other countries unchanged, as Alberto Camus suggested in post #3. I think if you modified your replace commands to something like
              Code:
              by country id year: replace cost = . if threshold != "Cost" & _N>1
              your solution would include the other countries.

              Another possible problem is that collapse changes variable labels.

              Comment


              • #8
                William, thanks for your comments. Anyhow, in my understanding, FRA is just an example, whilst what Alberto needs is the same solution for every country, .

                If he actually might in need only for FRA, the replace commands should be
                Code:
                 replace cost =. if threshold != "Cost" & country =="FRA"
                Your suggested code, while keeping info for other countries, seems providing the incorrect output for FRA itself.
                Last edited by Romalpa Akzo; 11 Jan 2018, 18:16.

                Comment


                • #9
                  Romialpa, perhaps you misunderstand what I posted, which was a proposal of a direction for Alberto Camus to pursue, not a complete solution.

                  Below is my sample code, which only handled the Cost variable, tested on the sample data for Cost, with a second country XYZ to demonstrate that it works for multiple countries with multiple observations, as well as those Country/ID/Year combinations that have just a single observation.

                  Extending my code to handle the other three variables is straightforward.

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input str3 Country int Cost str4 Threshold str1 ID int Year
                  "FRA" 500 "."    "1" 2014
                  "FRA" 400 "Cost" "1" 2014
                  "FRA" 500 "LOS"  "1" 2014
                  "GBR" 600 "."    "A" 2014
                  "IRE" 700 "."    "B" 2014
                  "ISR" 550 "."    "D" 2014
                  "ITA" 800 "."    "C" 2014
                  "XYZ" 510 "."    "2" 2013
                  "XYZ" 410 "Cost" "2" 2013
                  "XYZ" 510 "LOS"  "2" 2013
                  end
                  bysort Country ID Year: egen nc = max(cond(Threshold=="Cost",Cost,.))
                  bysort Country ID Year: replace Cost = nc if _N>1
                  bysort Country ID Year: drop if _n>1
                  drop nc Threshold
                  list, clean noobs
                  which yields
                  Code:
                  . list, clean noobs
                  
                      Country   Cost   ID   Year  
                          FRA    400    1   2014  
                          GBR    600    A   2014  
                          IRE    700    B   2014  
                          ISR    550    D   2014  
                          ITA    800    C   2014  
                          XYZ    410    2   2013

                  Comment


                  • #10
                    Thanks everyone for your suggestions. A needed clarification that might help give some context to the issue:

                    What I'm working with is trimmed and untrimmed data. In my dataset, every country provides a trimmed observation where Cost is trimmed on costs (e.g. costs < X are truncated) and LOS is trimmed on length of stay (LOS < Y are truncated), along with their number of cases for each trim. They provide these values on one single observation (as shown in my example).

                    The only country that doesn't do it is France, which provides the cost and LOS trimmed values on two separate lines. I have separate worksheets for the dataset, therefore I can also work individually on France and then append it to the main dataset, if the provided solution is simpler.

                    Now, this introduces a potential problem that I didn't consider mentioning earlier -- that is, the ID and year are not unique to the trimmed data. What this means is that the dataset actually looks more like this:
                    Country Cost LOS Threshold N_cases cost N_cases LOS N_cases ID Year Trimmed
                    FRA 500 5 . . 8 1 2014 No
                    FRA 400 5 Cost 7 . . 1 2014 Yes
                    FRA 500 7 LOS . 6 . 1 2014 Yes
                    GBR 700 4 . . . 11 A 2014 No
                    GBR 600 3 . 10 9 . A 2014 Yes
                    IRE 750 5 . . . 13 B 2014 No
                    IRE 700 4 . 12 12 . B 2014 Yes
                    ITA 900 6 . . . 25 C 2014 No
                    ITA 800 4 . 20 22 . C 2014 Yes
                    ISR 600 7 . . . 14 D 2014 No
                    ISR 550 6 . 13 11 . D 2014 Yes
                    As I said, any country besides France can be ignored. They are well formatted and they serve only as a reference as to how the final data for France should look like. This means that the untrimmed data should not be lost in the process of merging the trimmed observations.

                    From your contributions, the possible solution I am seeing now is this:

                    Code:
                    import excel France.xls
                    drop if threshold !=.
                    append using Master.dta
                    save Master.dta, replace
                    clear all
                    
                    import excel France.xls
                    replace cost = . if threshold != "Cost"
                    replace los = . if threshold != "LOS"
                    collapse (firstnm) cost los n_casescost n_caseslos
                    append using Master.dta
                    save Master.dta, replace
                    Although not very sexy, this should work, correct? The first part of the code allows me to append the untrimmed data, while the second part uses the code provided by Romalpa to collapse the trimmed data into one observation (and I don't care about losing the untrimmed one since it's already appended to the main dataset).

                    Comment


                    • #11

                      1. William, I got your point now, it does make sense in assumption that Alberto need to keep info for other country. May I thanks again for your further clarification.

                      2. Alberto, there are something suspicious in your code, for example: threshold !=. will not work since your threshold seems to be a string variable.

                      My suggestion is:
                      Code:
                      import excel "France.xls", firstrow clear
                      replace Cost = . if Threshold != "Cost" & Trimmed =="Yes"
                      replace LOS = . if Threshold != "LOS" & Trimmed =="Yes"
                      collapse (firstnm) Cost LOS N_casescost N_casesLOS N_cases, by (Country ID Year Trimmed)
                      append using "Master.dta"
                      save "Master.dta", replace
                      By the way, in the future, please read FAQ of the forum to have better post, whereas, using dataex (to give a small sample of your data) is strongly recommended.
                      Last edited by Romalpa Akzo; 12 Jan 2018, 04:46.

                      Comment


                      • #12
                        Originally posted by Alberto Camus View Post
                        As I said, any country besides France can be ignored. They are well formatted and they serve only as a reference as to how the final data for France should look like. This means that the untrimmed data should not be lost in the process of merging the trimmed observations.
                        You could export all countries to another file, then drop them in the file you're currently working on, do whatever you want to do on France, and then merge the other countries again.

                        If you have a solution that works on France, then whatever this solution does to the other countries can be ignored, because you then just have to combine the adjusted French data with the previous version of the other countries' data, and voilĂ , as they say.

                        Comment


                        • #13
                          Thank you again both for your help. As you said, Max, I just worked on a file with France and imported it into the main one at a later stage.

                          Romalpa, your code worked perfectly. For anyone browsing this at a later stage, remember that this line
                          Code:
                          collapse (firstnm) Cost LOS N_casescost N_casesLOS N_cases, by (Country ID Year Trimmed)
                          Must contain all of your other variables that don't need to be collapsed, otherwise they will be dropped altogether (losing many observations in the process). My working code ended up being:

                          Code:
                          import excel "France.xls", sheet("data") firstrow clear
                          sort Country id Year
                          
                          drop if Threshold == "Both"
                          replace ACOST = . if Threshold != "Cost" & Trimmed == "Yes"
                          replace ALOS = . if Threshold != "LOS" & Trimmed == "Yes"
                          collapse (firstnm) ACOST ALOS Nb_cases_TC Nb_cases_TL, by (Country id Year Trimmed Type Code_condition Code_subset Nb_cases)
                          
                          save Master.dta, replace
                          Romalpa, I was aware of dataex, but it required admin approval for installation at work and I needed a quick solution. Thanks for pointing it out though.

                          Comment

                          Working...
                          X