use one variable to search another

Anna Heard

Join Date: Mar 2022

Posts: 8
#1

use one variable to search another

25 Mar 2022, 10:33

Hi. I'm trying to take the values in one variable and find them in another variable:
There is a variable called "Replacement Package". It contains a value if that item was replaced, it is blank if it was not replaced.
I would like to take the replacement package and find the "Package ID" that matches it and then label that "Package ID" as a "replacement"

In other words, the dataset is a list of "Packages" (with ID numbers) some of them are replacement packages and some of them are not. When a package is replaced, a second variable captures which Package replaced it, and then that Package is entered into the list of Packages, but there is no variable that labels a package as a "replacement".

How can I create this variable?
Thanks
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

25 Mar 2022, 12:20

Describing data sets in words is difficult at best, and almost never sufficiently successful to enable somebody to write code for a problem. That is why Stata has the -dataex- command. Please use it to post back with an actual example from your actual Stata data set. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

I suspect if you do that, it will be fairly simple for somebody to help you.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Anna Heard

Join Date: Mar 2022

Posts: 8
#3

29 Mar 2022, 19:11

good point.
The data look like:
PackageID var1 var2....replacement_package
1. abcd x y
2. bcde x y
3. cdef x y defg
4. defg x y
5. efgh x y
6. fghi x y ijkl
7. ghij x y
8. hijk x y
9. ijkl x y

I want to create a variable, "IsAReplacement" that labels PackageID's ijkl and defg as "replacement packages", but they are only a variable in the line for another ID.
(In the real dataset the PackageID and Replacement_Package are string variables with about 8 numbers or letters starting with 3B) (for those that are not a replacement, the variable can contain either 0 or . and for those that are a replacement it should be a 1

PackageID var1 var2....replacement_package IsAReplacement?
1. abcd x y 0
2. bcde x y 0
3. cdef x y defg 0
4. defg x y 1
5. efgh x y 0
6. fghi x y ijkl 0
7. ghij x y 0
8. hijk x y 0
9. ijkl x y 1

Thanks!!!
Comment
Anna Heard

Join Date: Mar 2022

Posts: 8
#4

29 Mar 2022, 19:17

ok, those spaces didn't remain. Column1: PackageID; Column2: Var1, Column3: Var2....Column...+1: replacement_package Column...+2: IsAReplacement
Those without a letter string after "y" are blank in the "replacement_package" column.
Let me try that again?
good point.
The data look like:
PackageID var1 var2....replacement_package
1. abcd x y
2. bcde x y
3. cdef x y defg
4. defg x y
5. efgh x y
6. fghi x y ijkl
7. ghij x y
8. hijk x y
9. ijkl x y

I want to create a variable, "IsAReplacement" that labels PackageID's ijkl and defg as "replacement packages", but they are only a variable in the line for another ID.
(In the real dataset the PackageID and Replacement_Package are string variables with about 8 numbers or letters starting with 3B) (for those that are not a replacement, the variable can contain either 0 or . and for those that are a replacement it should be a 1

PackageID var1 var2....replacement_package IsAReplacement?
1. abcd x y 0
2. bcde x y 0
3. cdef x y defg 0
4. defg x y 1
5. efgh x y 0
6. fghi x y ijkl 0
7. ghij x y 0
8. hijk x y 0
9. ijkl x y 1

Thanks!!!
Comment
Anna Heard

Join Date: Mar 2022

Posts: 8
#5

29 Mar 2022, 19:18

So sorry, the spaces don't convey!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

29 Mar 2022, 21:10

You would save us both a lot of time and trouble if you followed the instructions I gave you in #2. I specifically asked you to use the -dataex- command to show your example data. Having not done that, you have produced a tableau which, although it gives me some sense of what your data looks like, is not usable for developing and testing code. In particular, are the numbers 1 through 9 part of the variable packageid? Your tableau makes it appear that they are. But that seems inconsistent with your statement that in the real data, they are 8 numbers or letters that start with 3B. So I don't know what to think. If I use your example data, it is clear that none of the entries in PackageID exactly matches anything in replacement_package. So that would require for some additional code to reconcile the differences between them. But that code could easily backfire and destroy correct matches in the real data if what you have presented misrepresents this matter.

So, let's back up: USE -datatex-. See the first paragraph in #2 for how to access it. Read -help dataex- for instructions on how to use it. -dataex- will give a faithful and usable representation of your data here and will avoid the need for endless questions about the data. Added bonus: you won't have to worry about the spacing. -dataex- takes care of it, and it doesn't actually matter anyway since nobody is going to read the -dataex- results with their eyes: it's for Stata to input.
Comment

Anna Heard

Join Date: Mar 2022
Posts: 8

06 Apr 2022, 13:08

I was hoping not to download so many observations, but I guess that was not a good idea. I also didn't realize this box was essentially "coded."
So I used dataex Work_Package Replacement_Package:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6(BatchID Replacement_Batch)
"4BP001" "4BP025"
"4BP002" "4BP023"
"4BP003" "4BP021"
"4BP004" ""      
"4BP005" ""      
"4BP006" "4BP022"
"4BP007" ""      
"4BP008" ""      
"4BP009" ""      
"4BP010" ""      
"4BP011" "4BP019"
"4BP012" "4BP031"
"4BP013" "4BP024"
"4BP014" "4BP028"
"4BP015" ""      
"4BP016" "4BP020"
"4BP017" "4BP029"
"4BP018" ""      
"4BP019" ""      
"4BP020" "4BP026"
"4BP021" "4BP027"
"4BP022" ""      
"4BP023" ""      
"4BP024" ""      
"4BP025" "4BP030"
"4BP026" ""      
"4BP027" ""      
"4BP028" ""      
"4BP029" ""      
"4BP030" ""      
"4BP031" ""      
"4BP032" ""      
"4BP033" ""      
end

as you can see, for some batches, there is a replacement batch, and for others there isn't. All the "Replacement_Batch" numbers are also in the BatchID column, because as they are added as a replacement, they have their own listing. I would like to tag their own entry to say "these are replacement Batches". So, create a new variable, "IsReplacement" that finds all the IDs in Replacement Batch" and tags all the matching IDs in BatchID as "replacements.

For "BatchID" an easy workaround is that all batches 19 and over are "replacements" but let's assume that we can't just tag all the BatchID's over a certain number. (because I want to do it with another variable too, that is not systematic).

The finished product should look something like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6(BatchID Replacement_Batch) float IsReplacement
"4BP001" "4BP025" 0
"4BP002" "4BP023" 0
"4BP003" "4BP021" 0
"4BP004" ""       0
"4BP005" ""       0
"4BP006" "4BP022" 0
"4BP007" ""       0
"4BP008" ""       0
"4BP009" ""       0
"4BP010" ""       0
"4BP011" "4BP019" 0
"4BP012" "4BP031" 0
"4BP013" "4BP024" 0
"4BP014" "4BP028" 0
"4BP015" ""       0
"4BP016" "4BP020" 0
"4BP017" "4BP029" 0
"4BP018" ""       0
"4BP019" ""       1
"4BP020" "4BP026" 1
"4BP021" "4BP027" 1
"4BP022" ""       1
"4BP023" ""       1
"4BP024" ""       1
"4BP025" "4BP030" 1
"4BP026" ""       1
"4BP027" ""       1
"4BP028" ""       1
"4BP029" ""       1
"4BP030" ""       1
"4BP031" ""       1
"4BP032" ""       1
"4BP033" ""       1
end

Sorry for the poor direction following, but I wasn't familiar with the command and I wasn't sure how to only output a limited amount of data.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

06 Apr 2022, 13:39

So, I think the following does what you ask for:

Code:

frame put Replacement_Batch if !missing(Replacement_Batch), into(replacements) frame replacements: duplicates drop frlink m:1 BatchID, frame(replacements Replacement_Batch) gen byte byte IsReplacement = !missing(replacements) frame drop replacements drop replacements

The reason I am cautious is that there is an inconsistency between the example data given in #7 and your assertion that every BatchID 19 and over is a replacement. 4BP032 and 4BP033 do not appear in the Replacement_Batch column in your data. Moreover, in your -dataex- for what you want, you show them with IsReplacement = 1. So perhaps there is some aspect of this problem I am missing that applies to those two.
Comment
Anna Heard

Join Date: Mar 2022

Posts: 8
#9

07 Apr 2022, 12:28

Ack! Good catch. I am checking with my data manager about the batches. But I thought they should have appeared in the Replacement_Batch column somewhere.

I can't get the first line of code to work. It says "frame is not recognized"

. frame put Replacement_Batch if !missing(Replacement_Batch), into(replacements)
command frame is unrecognized

Is it missing some punctuation? Does it need a preceding line?
Thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

07 Apr 2022, 12:38

Stata introduced frames in version 16. The Forum FAQ, which all are asked to read before posting, state that if you are using any version earlier than the current (which is presently 17) you are supposed to say that in your post so nobody wastes time writing code you can't run and you don't waste time trying to run it. What version of Stata are you using?

Last edited by Clyde Schechter; 07 Apr 2022, 12:42.
Comment
Anna Heard

Join Date: Mar 2022

Posts: 8
#11

08 Apr 2022, 13:01

So sorry! 14.2.

However, when you do "help frames" it provides information, it just seems to suggest that you need provide a name. However, when I redid the help, it says "not found".
apparently I need to go back and read all the guidance before posting.
Comment
Ali Atia

Join Date: May 2020

Posts: 737
#12

08 Apr 2022, 13:21

If your dataset (or at least the list of replacement values) is relatively small, you can do this in two lines without using frames:

Code:

levelsof Replacement_Batch, sep("|") clean local(match) gen replacement = ustrregexm(BatchID,"`match'")

If the dataset is larger (i.e., if the above code takes a very long time to run), you can do a similar operation to what is shown in #8 using temporary datasets and merge instead of frames and frlink:

Code:

preserve keep Replacement_Batch rename Replacement_Batch BatchID drop if BatchID=="" tempfile temp save `temp' restore merge 1:1 BatchID using `temp' gen replacement = _merge==3 drop _merge

Last edited by Ali Atia; 08 Apr 2022, 13:26.
Comment
Anna Heard

Join Date: Mar 2022

Posts: 8
#13

12 Apr 2022, 06:04

Thanks Ali Atia
Comment

Announcement