Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Collapse by a variable, while preserving others and creating new variables

    Hi,

    I have a dataset on property transactions (a sample below, for those of you from the UK, this is from Land Registry). My data contains properties that have been sold twice in the time period.

    I wish to collapse the two observations per property (first and second sale) into a single observation. Some variables are obviously the same for each transaction, e.g. Prop_Type, Old_New etc, so I would like these to collapse into one (forgive me if I am loose on terminology). Other variables are different within each property however, e.g. Price, Date, and Holiday (referring to the SDLT holiday in the UK). As these are different, I would like to somehow create a new variable within the collapsed observation denoting the value taken in the second transaction of the property, if different to the value taken in the first transaction of the property (essentially a "Holiday 2" variable).

    Is there a way of doing this using the collapse command or is this a little too fiddly for that?

    Thanks for your help,

    Jack



    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long Price str17 Date str1(Prop_Type Old_New Duration) str35 Locality str22 City str35 County byte(Freehold Leasehold) float Holiday str77 Address
    345000 "08 April 2019"    "D" "N" "F" "STOKE ST MICHAEL" "RADSTOCK"   "SOMERSET"                   1 0 0 "(TUCKER)  WITHYBROOK BA3 5JQ"            
    352000 "27 August 2020"   "D" "N" "F" "STOKE ST MICHAEL" "RADSTOCK"   "SOMERSET"                   1 0 1 "(TUCKER)  WITHYBROOK BA3 5JQ"            
    382500 "02 December 2019" "F" "N" "L" ""                 "LONDON"     "GREATER LONDON"             0 1 0 "06-Oct FLAT 15 ABERDEEN PARK N5 2BN"     
    412000 "09 November 2020" "F" "N" "L" ""                 "LONDON"     "GREATER LONDON"             0 1 1 "06-Oct FLAT 15 ABERDEEN PARK N5 2BN"     
    150000 "30 July 2019"     "F" "N" "L" ""                 "FOLKESTONE" "KENT"                       0 1 0 "07-Sep FLAT 3 LANGHORNE GARDENS CT20 2EA"
    180000 "16 March 2020"    "F" "N" "L" ""                 "FOLKESTONE" "KENT"                       0 1 0 "07-Sep FLAT 3 LANGHORNE GARDENS CT20 2EA"
     57001 "20 February 2020" "S" "N" "F" ""                 "HULL"       "CITY OF KINGSTON UPON HULL" 1 0 0 "1  9TH AVENUE HU6 8HU"                   
    102000 "10 November 2020" "S" "N" "F" ""                 "HULL"       "CITY OF KINGSTON UPON HULL" 1 0 1 "1  9TH AVENUE HU6 8HU"                   
    312500 "05 July 2019"     "D" "N" "F" "BOROUGHBRIDGE"    "YORK"       "NORTH YORKSHIRE"            1 0 0 "1  ABEL CLOSE YO51 9US"                  
    299000 "05 July 2019"     "D" "N" "F" "BOROUGHBRIDGE"    "YORK"       "NORTH YORKSHIRE"            1 0 0 "1  ABEL CLOSE YO51 9US"                  
     92000 "28 January 2019"  "S" "N" "F" ""                 "LEEDS"      "WEST YORKSHIRE"             1 0 0 "1  ABERFIELD GARDENS LS10 3QL"           
    115000 "08 November 2019" "S" "N" "F" ""                 "LEEDS"      "WEST YORKSHIRE"             1 0 0 "1  ABERFIELD GARDENS LS10 3QL"           
    212500 "21 August 2020"   "S" "N" "F" ""                 "TROWBRIDGE" "WILTSHIRE"                  1 0 1 "1  ADCROFT DRIVE BA14 8PU"               
    210000 "27 August 2020"   "S" "N" "F" ""                 "TROWBRIDGE" "WILTSHIRE"                  1 0 1 "1  ADCROFT DRIVE BA14 8PU"               
    146000 "26 July 2019"     "T" "N" "F" ""                 "LEICESTER"  "LEICESTER"                  1 0 0 "1  ADELAIDE CLOSE LE4 2NZ"               
    153000 "04 October 2019"  "T" "N" "F" ""                 "LEICESTER"  "LEICESTER"                  1 0 0 "1  ADELAIDE CLOSE LE4 2NZ"               
    113000 "02 October 2020"  "S" "N" "F" "MILE OAK"         "TAMWORTH"   "STAFFORDSHIRE"              1 0 1 "1  AFFLECK AVENUE B78 3NJ"               
    130000 "26 November 2020" "S" "N" "F" "MILE OAK"         "TAMWORTH"   "STAFFORDSHIRE"              1 0 1 "1  AFFLECK AVENUE B78 3NJ"               
    end

  • #2
    The only difficulty here is the lack of a variable identifying the separate properties. It appears in your example that the observations for the same property are always shown in consecutive observations given in chronological order. If that is true, then it is fairly simple. First create a property ID variable, then -collapse- everything on property id, always retaining the final observation.

    Code:
    gen long property_id = ceil(_n/2)
    ds property_id, not
    collapse (last) `r(varlist)', by(property_id)
    If, however, in the full data set, there are some properties that only appear once, or others that appear more than twice, the above approach will produce seriously incorrect results. In that case, you need to explain how to identify when two observations apply to the same property and when they don't. I would guess that the combination of Address, County, City, and Locality will do that--but I would need you to confirm or correct that.

    Also, in the example data, the observations for the same property are always in chronological order. If that is not true throughout your data set, then before the -ds- and -collapse- commands you would need to sort them chronologically. Given that your dates are in a string variable, you would first need to convert that to a Stata internal format date variable by applying the -daily()- function so that you can get the correct sort order. (Sorting those string dates will put them in a non-chronological order.)

    Comment


    • #3
      An alternative approach:

      Code:
      ds Price Date Holiday,not
      foreach var of varlist Price Date Holiday{
          bysort `r(varlist)': gen `var'2 = `var'[_N] if _n!=_N
      }
      bysort `r(varlist)': drop if _n!=1
      
      . list Price* Date* Holiday*,sep(0) noobs
      
        +----------------------------------------------------------------------------+
        |  Price   Price2               Date              Date2   Holiday   Holiday2 |
        |----------------------------------------------------------------------------|
        | 299000   312500       05 July 2019       05 July 2019         0          0 |
        | 345000   352000      08 April 2019     27 August 2020         0          1 |
        | 150000   180000       30 July 2019      16 March 2020         0          0 |
        | 382500   412000   02 December 2019   09 November 2020         0          1 |
        |  57001   102000   20 February 2020   10 November 2020         0          1 |
        | 115000    92000   08 November 2019    28 January 2019         0          0 |
        | 212500   210000     21 August 2020     27 August 2020         1          1 |
        | 130000   113000   26 November 2020    02 October 2020         1          1 |
        | 153000   146000    04 October 2019       26 July 2019         0          0 |
        +----------------------------------------------------------------------------+
      This assumes that the only variables which differ within a given property are Price, Date, and Holiday. If that's untrue, you just need to add the nonconstant variables to the ds command and the loop. It also assumes the properties appear at most twice. If a property appears once, the new variables will be missing. If a property appears more than twice, all but the first and last values of Price Date and Holiday will be ignored. Depending on if you're concerned with the sort order (i.e., which value of Price, Date, Holiday appears in the first variable vs the second variable), you might want to re-order the constant variables before the initial ds command.
      Last edited by Ali Atia; 08 Apr 2021, 11:23.

      Comment


      • #4
        Clyde and Ali, thank you very much for your suggestions.

        Comment


        • #5
          I have tried both methods.

          After changing the date format as suggested by Clyde and running the code, the data is collapsing by property but not creating the required "date2", "price2" "holiday2" variables.

          Ali's alternative suggestion creates the new variables in the cases where values for transactions within each property are different (date, price, holiday), but an error arises possibly due to chronological ordering. In your example code, Ali, there are observations where date2 is earlier than date (and price2 represents the price of transaction 1), e.g. rows 6, 8 and 9. Date 2, price 2, holiday 2 etc must always represent the values of the second, later transaction.

          It is essentially a swapping exercise, i.e. for each occasion date 2 is earlier than date, they must be swapped along with price and price 2. Is there a set of commands for this kind of thing, or is it a case of fiddling about with the sorting process before running the code? Would -swapval- be of any use here?

          Thanks again for your help,

          Jack

          Comment


          • #6
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input long Price str17 Date str1(Prop_Type Old_New Duration) str35 Locality str22 City str35 County byte(Freehold Leasehold) float Holiday str77 Address
            345000 "08 April 2019"    "D" "N" "F" "STOKE ST MICHAEL" "RADSTOCK"   "SOMERSET"                   1 0 0 "(TUCKER)  WITHYBROOK BA3 5JQ"            
            352000 "27 August 2020"   "D" "N" "F" "STOKE ST MICHAEL" "RADSTOCK"   "SOMERSET"                   1 0 1 "(TUCKER)  WITHYBROOK BA3 5JQ"            
            382500 "02 December 2019" "F" "N" "L" ""                 "LONDON"     "GREATER LONDON"             0 1 0 "06-Oct FLAT 15 ABERDEEN PARK N5 2BN"     
            412000 "09 November 2020" "F" "N" "L" ""                 "LONDON"     "GREATER LONDON"             0 1 1 "06-Oct FLAT 15 ABERDEEN PARK N5 2BN"     
            150000 "30 July 2019"     "F" "N" "L" ""                 "FOLKESTONE" "KENT"                       0 1 0 "07-Sep FLAT 3 LANGHORNE GARDENS CT20 2EA"
            180000 "16 March 2020"    "F" "N" "L" ""                 "FOLKESTONE" "KENT"                       0 1 0 "07-Sep FLAT 3 LANGHORNE GARDENS CT20 2EA"
             57001 "20 February 2020" "S" "N" "F" ""                 "HULL"       "CITY OF KINGSTON UPON HULL" 1 0 0 "1  9TH AVENUE HU6 8HU"                   
            102000 "10 November 2020" "S" "N" "F" ""                 "HULL"       "CITY OF KINGSTON UPON HULL" 1 0 1 "1  9TH AVENUE HU6 8HU"                   
            312500 "05 July 2019"     "D" "N" "F" "BOROUGHBRIDGE"    "YORK"       "NORTH YORKSHIRE"            1 0 0 "1  ABEL CLOSE YO51 9US"                  
            299000 "05 July 2019"     "D" "N" "F" "BOROUGHBRIDGE"    "YORK"       "NORTH YORKSHIRE"            1 0 0 "1  ABEL CLOSE YO51 9US"                  
             92000 "28 January 2019"  "S" "N" "F" ""                 "LEEDS"      "WEST YORKSHIRE"             1 0 0 "1  ABERFIELD GARDENS LS10 3QL"           
            115000 "08 November 2019" "S" "N" "F" ""                 "LEEDS"      "WEST YORKSHIRE"             1 0 0 "1  ABERFIELD GARDENS LS10 3QL"           
            212500 "21 August 2020"   "S" "N" "F" ""                 "TROWBRIDGE" "WILTSHIRE"                  1 0 1 "1  ADCROFT DRIVE BA14 8PU"               
            210000 "27 August 2020"   "S" "N" "F" ""                 "TROWBRIDGE" "WILTSHIRE"                  1 0 1 "1  ADCROFT DRIVE BA14 8PU"               
            146000 "26 July 2019"     "T" "N" "F" ""                 "LEICESTER"  "LEICESTER"                  1 0 0 "1  ADELAIDE CLOSE LE4 2NZ"               
            153000 "04 October 2019"  "T" "N" "F" ""                 "LEICESTER"  "LEICESTER"                  1 0 0 "1  ADELAIDE CLOSE LE4 2NZ"               
            113000 "02 October 2020"  "S" "N" "F" "MILE OAK"         "TAMWORTH"   "STAFFORDSHIRE"              1 0 1 "1  AFFLECK AVENUE B78 3NJ"               
            130000 "26 November 2020" "S" "N" "F" "MILE OAK"         "TAMWORTH"   "STAFFORDSHIRE"              1 0 1 "1  AFFLECK AVENUE B78 3NJ"               
            end
            gen date = date(Date,"DMY")
            drop Date
            ds Price date Holiday,not
            foreach var of varlist Price date Holiday{
                bysort `r(varlist)' (date): gen `var'2 = `var'[_N] if _n!=_N
            }
            bysort `r(varlist)': drop if _n!=1
            format date* %td
            
            . list Price* date* Holiday*,sep(0) noobs
            
              +--------------------------------------------------------------+
              |  Price   Price2        date       date2   Holiday   Holiday2 |
              |--------------------------------------------------------------|
              | 299000   312500   05jul2019   05jul2019         0          0 |
              | 345000   352000   08apr2019   27aug2020         0          1 |
              | 150000   180000   30jul2019   16mar2020         0          0 |
              | 382500   412000   02dec2019   09nov2020         0          1 |
              |  57001   102000   20feb2020   10nov2020         0          1 |
              |  92000   115000   28jan2019   08nov2019         0          0 |
              | 212500   210000   21aug2020   27aug2020         1          1 |
              | 113000   130000   02oct2020   26nov2020         1          1 |
              | 146000   153000   26jul2019   04oct2019         0          0 |
              +--------------------------------------------------------------+

            Comment

            Working...
            X