Vertical Data into Horizontal Data with Duplicate IDs and Create New Column Variables

Eric Bartlett

Join Date: Dec 2015
Posts: 6

Vertical Data into Horizontal Data with Duplicate IDs and Create New Column Variables

01 Dec 2015, 11:59

Hi,

I am attempting to merge automotive retail sales data with automotive service data so that I can see how buyers act post purchase. The only unique identifier between the service data and the sales data is the VIN. The issue is that many of the buyers/owners have come in for service on more than one occasion so there are multiple records with the same VIN.

I can (and already have) performed some analysis by removing duplicate VINs and drawing some basic insights into those who have come in for service (at least once) and those who have not come in at all.

However, I really would like to re-organize the data so that I can get more clear insights into those who have come in for service (ie. avg number of service visits (conditional on numerous factors). the amount of time between their visits (conditional on numerous factors), mileage between services, etc. etc. etc.). Further, this would allow me to run logistic regressions estimating odds of coming in for a next service based on difference variables.

Currently the data is organized vertically, so that it is all within a few columns. I would like to change this structure so that second service visits (and third, fourth, fifth, etc. etc.) and their corresponding information are moved to new columns and in a row that corresponds to the VIN.

Below is a basic example of what I would like to accomplish:

I want to turn this:

veh_id	mileage	service_date
ABC	8000	14Jan14
DEF	7500	15Jan14
GHI	6900	19Jan14
JKL	8200	22Jan14
MNO	7900	30Jan14
ABC	14000	23Jun14
DEF	13000	28Aug14
GHI	14150	23Jul14
JKL	16200	27May14
MNO	13625	6Nov14
ABC	20000	30Nov14
DEF	18500	10Apr15
GHI	21400	24Jan15
JKL	24200	29Sep14
MNO	19350	13Aug15

Into This:

veh_id	mileage1	service_date1	mileage2	service_date2	Mileage3	service_date3
ABC	8000	14Jan14	14000	23Jun14	20000	30Nov14
DEF	7500	15Jan14	13000	28Aug14	18500	10Apr15
GHI	6900	19Jan14	14150	23Jul14	21400	24Jan15
JKL	8200	22Jan14	16200	27May14	24200	29Sep14
MNO	7900	30Jan14	13625	6Nov14	19350	13Aug15

Thank you in advance of any help you can provide!

Best,
Eric

2 Photos

Tags: None

Eric Bartlett

Join Date: Dec 2015

Posts: 6
#2

01 Dec 2015, 12:07

After thinking about this, I suppose I am trying to change time series data into panel data so that it can be merged with another data set and avoid a non-unique identifier error
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

01 Dec 2015, 12:17

Stata advice so common as to be called standard is that your data structure is fine. If you want to merge with anything else you just need to merge on identifier and date. Even in Stata a wide shape is more awkward than otherwise.

But it can be done.

Code:

clear
input str3 veh_id mileage str7 Sservice_date
ABC    8000    14Jan14
DEF    7500    15Jan14
GHI    6900    19Jan14
JKL    8200    22Jan14
MNO    7900    30Jan14
ABC    14000    23Jun14
DEF    13000    28Aug14
GHI    14150    23Jul14
JKL    16200    27May14
MNO    13625    6Nov14
ABC    20000    30Nov14
DEF    18500    10Apr15
GHI    21400    24Jan15
JKL    24200    29Sep14
MNO    19350    13Aug15
end
gen service_date = daily(Sservice_date, "DMY", 2050)
bysort veh_id (service_date) : gen time = _n
drop S
reshape wide mileage service_date, i(veh_id) j(time)
list

     +--------------------------------------------------------------------------+
     | veh_id   mileage1   servic~1   mileage2   servic~2   mileage3   servic~3 |
     |--------------------------------------------------------------------------|
  1. |    ABC       8000      19737      14000      19897      20000      20057 |
  2. |    DEF       7500      19738      13000      19963      18500      20188 |
  3. |    GHI       6900      19742      14150      19927      21400      20112 |
  4. |    JKL       8200      19745      16200      19870      24200      19995 |
  5. |    MNO       7900      19753      13625      20033      19350      20313 |
     +--------------------------------------------------------------------------+

NB: please use dataex (SSC) for your examples. Your date listing needed surgery. See FAQ Advice #12.

Announcement

Vertical Data into Horizontal Data with Duplicate IDs and Create New Column Variables

Comment

Comment