Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Vertical Data into Horizontal Data with Duplicate IDs and Create New Column Variables

    Hi,

    I am attempting to merge automotive retail sales data with automotive service data so that I can see how buyers act post purchase. The only unique identifier between the service data and the sales data is the VIN. The issue is that many of the buyers/owners have come in for service on more than one occasion so there are multiple records with the same VIN.

    I can (and already have) performed some analysis by removing duplicate VINs and drawing some basic insights into those who have come in for service (at least once) and those who have not come in at all.

    However, I really would like to re-organize the data so that I can get more clear insights into those who have come in for service (ie. avg number of service visits (conditional on numerous factors). the amount of time between their visits (conditional on numerous factors), mileage between services, etc. etc. etc.). Further, this would allow me to run logistic regressions estimating odds of coming in for a next service based on difference variables.

    Currently the data is organized vertically, so that it is all within a few columns. I would like to change this structure so that second service visits (and third, fourth, fifth, etc. etc.) and their corresponding information are moved to new columns and in a row that corresponds to the VIN.

    Below is a basic example of what I would like to accomplish:

    I want to turn this:
    veh_id mileage service_date
    ABC 8000 14Jan14
    DEF 7500 15Jan14
    GHI 6900 19Jan14
    JKL 8200 22Jan14
    MNO 7900 30Jan14
    ABC 14000 23Jun14
    DEF 13000 28Aug14
    GHI 14150 23Jul14
    JKL 16200 27May14
    MNO 13625 6Nov14
    ABC 20000 30Nov14
    DEF 18500 10Apr15
    GHI 21400 24Jan15
    JKL 24200 29Sep14
    MNO 19350 13Aug15


    Into This:
    veh_id mileage1 service_date1 mileage2 service_date2 Mileage3 service_date3
    ABC 8000 14Jan14 14000 23Jun14 20000 30Nov14
    DEF 7500 15Jan14 13000 28Aug14 18500 10Apr15
    GHI 6900 19Jan14 14150 23Jul14 21400 24Jan15
    JKL 8200 22Jan14 16200 27May14 24200 29Sep14
    MNO 7900 30Jan14 13625 6Nov14 19350 13Aug15




    Thank you in advance of any help you can provide!

    Best,
    Eric


    Current Data Structure Desired Data Structure

  • #2
    After thinking about this, I suppose I am trying to change time series data into panel data so that it can be merged with another data set and avoid a non-unique identifier error

    Comment


    • #3
      Stata advice so common as to be called standard is that your data structure is fine. If you want to merge with anything else you just need to merge on identifier and date. Even in Stata a wide shape is more awkward than otherwise.

      But it can be done.

      Code:
      clear
      input str3 veh_id mileage str7 Sservice_date
      ABC    8000    14Jan14
      DEF    7500    15Jan14
      GHI    6900    19Jan14
      JKL    8200    22Jan14
      MNO    7900    30Jan14
      ABC    14000    23Jun14
      DEF    13000    28Aug14
      GHI    14150    23Jul14
      JKL    16200    27May14
      MNO    13625    6Nov14
      ABC    20000    30Nov14
      DEF    18500    10Apr15
      GHI    21400    24Jan15
      JKL    24200    29Sep14
      MNO    19350    13Aug15
      end
      gen service_date = daily(Sservice_date, "DMY", 2050)
      bysort veh_id (service_date) : gen time = _n
      drop S
      reshape wide mileage service_date, i(veh_id) j(time)
      list
      
           +--------------------------------------------------------------------------+
           | veh_id   mileage1   servic~1   mileage2   servic~2   mileage3   servic~3 |
           |--------------------------------------------------------------------------|
        1. |    ABC       8000      19737      14000      19897      20000      20057 |
        2. |    DEF       7500      19738      13000      19963      18500      20188 |
        3. |    GHI       6900      19742      14150      19927      21400      20112 |
        4. |    JKL       8200      19745      16200      19870      24200      19995 |
        5. |    MNO       7900      19753      13625      20033      19350      20313 |
           +--------------------------------------------------------------------------+
      NB: please use dataex (SSC) for your examples. Your date listing needed surgery. See FAQ Advice #12.

      Comment

      Working...
      X