Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issues running multivariate regression

    I'm trying to run a multivariate regression analysis and I've found multiple variations of how to do so online, but none have yielded any usable results for me. I'm still quite new to Stata and teaching myself along the way, so hopefully this is a relatively easy fix.

    My data is on drug use between two groups (A and B) of patients. The dependent variables are short-term drug use (TOTAL_SH) and long-term drug use (TOTAL_L), which I figured I would need to analyze separately. Independent variables include age, gender, and length of stay (TOTAL_LOS).

    My attempt:

    manova TOTAL_SH = c.AGE c.TOTAL_LOS GENDER
    Stata said "GENDER: string variables may not be used as factor variables" (i didn't import any of the data as strings, so I'm a little confused about this)

    When I run this without GENDER and follow with mvreg, it seems to work. I'm not sure how to de-string a variable, or why gender was imported as a string variable in the first place. Any help with this would be appreciated! If there is an easier / better way I should be doing this, I'm open to suggestions with that as well!

  • #2
    Show us the results of

    Code:
    describe GENDER 
    
    tab GENDER, missing

    Comment


    • #3
      Ariana:
      as an aside to Nick's helpful guidance, provided that I'm not missing out on something along my reading your post, why going -manova- or -mvreg- if you want to analyse the two dependent variables seprately?
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        @Ariana:
        To change from string to float variable
        Recently I came across the same problem. I used the following command and it works for me. (Note: I'm still learning from experts here)
        gen GENDER2= real(GENDER)

        Comment


        • #5
          Originally posted by Nick Cox View Post
          Show us the results of

          Code:
          describe GENDER
          
          tab GENDER, missing
          Click image for larger version

Name:	Screenshot 2023-01-12 083917.jpg
Views:	1
Size:	35.5 KB
ID:	1696960

          Comment


          • #6
            Originally posted by Carlo Lazzaro View Post
            Ariana:
            as an aside to Nick's helpful guidance, provided that I'm not missing out on something along my reading your post, why going -manova- or -mvreg- if you want to analyse the two dependent variables seprately?
            Hi Carlo, I didn't realize that I could analyze them together with manova/mvreg -- but now I see that it works! Thank you.

            Comment


            • #7
              So there is no obvious deep problem with your gender variable other than 99.85% of values being missing. The solution in #4 won't work here, but it it is worth pursuing you could do something like

              Code:
              label def female 1 F 0 M 
              
              encode GENDER, gen(female) label(female)
              and then run your model on the 1537 or so observations with information.

              Alternatively, it may be worth going back to the original data source to see if there is more information there.

              Comment


              • #8
                Originally posted by Nick Cox View Post
                So there is no obvious deep problem with your gender variable other than 99.85% of values being missing. The solution in #4 won't work here, but it it is worth pursuing you could do something like

                Code:
                label def female 1 F 0 M
                
                encode GENDER, gen(female) label(female)
                and then run your model on the 1537 or so observations with information.

                Alternatively, it may be worth going back to the original data source to see if there is more information there.
                Thank you Nick! I think this seems to work.

                Comment


                • #9
                  It's your project but I would push at why you have all those missing values. Many people have grounds to dislike gender being reduced to two categories, and/or may feel that it is a private or personal matter, but 99.85% of values being missing is a big deal and -- although I am neither social nor medical scientist -- seems to be very high indeed.

                  Comment


                  • #10
                    Originally posted by Nick Cox View Post
                    It's your project but I would push at why you have all those missing values. Many people have grounds to dislike gender being reduced to two categories, and/or may feel that it is a private or personal matter, but 99.85% of values being missing is a big deal and -- although I am neither social nor medical scientist -- seems to be very high indeed.
                    At the risk of further revealing my Stata naivety, what exactly might make a value "missing"? On the original data sheet, there aren't any blank responses. Our sample size is roughly 1500 patients (equal to the number of the M and F categories combined), so I'm not sure where the million+ missing values are coming from.

                    Comment


                    • #11
                      As you know, I have no evidence on your dataset beyond this thread but #5 shows clearly that Stata thinks you have 1046367 observations with values on GENDER other than M or F

                      They could be empty -- and they could also be one or more spaces -- but they are all the same either way. I can tell this because there is one and only one category represented in the table other than F or M and it sorts before F or M.

                      It may be a coincidence but I note that the total number of observations in the dataset is just less than a particular power of 2,,

                      Code:
                      . di 2^20
                      1048576
                      and such powers of 2 often feature in limits on dataset size.

                      Where did the data come from? A wild, wild guess is that you or someone else imported the data from somewhere, cleaned up metadata partially or wholly, but with a side-effect of leaving observations that are missing on all variables.

                      If so, the good news is that you should just get rid of them and move on.

                      You could look at the data with edit and might expect to see vast oceans of missing values (which for numeric variables would be dot or period by itself)..

                      A fairly safe tool is to calculate

                      Code:
                      egen nmissing = rowmiss(*)
                      and to drop observations for which the number of missing values is the same as the number of your variables MINUS 1. nmissing itself will never be missing.




                      Comment


                      • #12
                        Originally posted by Ariana Black View Post
                        At the risk of further revealing my Stata naivety, what exactly might make a value "missing"? On the original data sheet, there aren't any blank responses.
                        Just a guess, but it's probably your worksheet software and not Stata that you're having the problem with.

                        What you're seeing is a fairly common problem with Microsoft Excel workbooks when importing data into Stata: someone has inadvertently created and left blank rows or blank columns in the worksheet before saving the workbook. Because they're blank, they're not easy to detect (they're invisible) from within the workbook software application.

                        Comment


                        • #13
                          Originally posted by Ariana Black View Post
                          I'm trying to run a multivariate regression analysis and I've found multiple variations of how to do so online, but none have yielded any usable results for me. I'm still quite new to Stata and teaching myself along the way, so hopefully this is a relatively easy fix.

                          My data is on drug use between two groups (A and B) of patients. The dependent variables are short-term drug use (TOTAL_SH) and long-term drug use (TOTAL_L), which I figured I would need to analyze separately. Independent variables include age, gender, and length of stay (TOTAL_LOS).

                          My attempt:

                          manova TOTAL_SH = c.AGE c.TOTAL_LOS GENDER
                          Stata said "GENDER: string variables may not be used as factor variables" (i didn't import any of the data as strings, so I'm a little confused about this)

                          When I run this without GENDER and follow with mvreg, it seems to work. I'm not sure how to de-string a variable, or why gender was imported as a string variable in the first place. Any help with this would be appreciated! If there is an easier / better way I should be doing this, I'm open to suggestions with that as well!
                          you can use encode to turn gender to numeric then run the regression.
                          encode GENDER, gen (gender)

                          On the issue of millions, missing, many databases SQL, access, and even excel tend to add rows if you click in the lower rows or write in data and delete. The best thing to do in this case is to drop the missing rows if they do not have any data.

                          keep if gender!=.
                          then proceed with your regression

                          Comment


                          • #14
                            #12 and #13 seem consistent with my guesses.

                            On the details of encode see #7 where I recommended using a new variable named for the category coded 1, say female.

                            Comment


                            • #15
                              Originally posted by Broline Sagini View Post
                              On the issue of millions, missing, many databases SQL, access, and even excel tend to add rows if you click in the lower rows or write in data and delete.
                              I've really only seen this with Excel workbooks given to me as data sources, and there the problem is pretty common.

                              I suppose that you could coax Microsoft Access to allow blank rows in a table if you deliberately allow NULLs or zero-length strings in all of its columns, but suspect that people who would do that are using Excel instead.

                              And it's hard to imagine anyone who's using Microsoft SQL Server to set up a table that would permit blank rows.

                              In both, DELETE FROM deletes; it doesn't leave blank rows.

                              Comment

                              Working...
                              X