Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing and standardising embedded blanks

    Hello,

    I'm working with admin data where one of my variables (Item Category) contains embedded blanks for one particular group (Children), but only before a certain year (2005). Here is an example of my data:

    ID Year Child Item number Item Category
    001 2000 N 1 a
    002 2000 Y 1
    001 2001 N 2 a
    002 2001 Y 2
    001 2002 N 3 b
    002 2002 Y 3
    001 2003 N 4 b
    002 2003 Y 4
    001 2004 N 5 c
    002 2004 Y 5
    001 2005 N 1 a
    002 2005 Y 1 a
    001 2006 N 2 a
    002 2006 Y 2 a
    001 2007 N 3 b
    002 2007 Y 3 b
    001 2008 N 4 b
    002 2008 Y 4 b
    001 2009 N 5 c
    002 2009 Y 5 c

    How do I standardise the embedded blanks according to Item Number which are matched for Children from 2005 on, and for adults for all years? I'm unable to transform the embedded blanks at all, let alone to standardise them as I would like. I first thought that the reason they don't behave like regular empty strings or missing observations is due to their type, but after playing around with that I'm still no closer to resolving things. Any help is appreciated.

    Matt

  • #2
    Sorry, but I don't follow this. Please use dataex as requested in FAQ Advice #12 to show explicitly which variables are numeric, which numeric with value labels, and which string.

    You are, I think, saying that you have "embedded blanks" that are not empty strings. If so, it is hard to know from your data examples what they are. They could be plain spaces or some more exotic character.

    trim() would remove plain spaces.

    Downloading chartab from SSC would enable you to identify unusual characters.

    Comment


    • #3
      Code:
      egen itemcat_g = group(itemcategory)
      egen itemcat2 = max(itemcat_g), by(itemnumber)
      tabstat itemcat_g, by(itemcategory)
      label values itemcat2 ITEMCAT2 
      label def ITEMCAT2 1 "a", modify
      label def ITEMCAT2 2 "b", modify
      label def ITEMCAT2 3 "c", modify

      Comment


      • #4
        Hi Nick,

        Sorry for not initially following the proper posting procedure. I should also have previously mentioned I tried using trim() although this didn't solve the issue.

        Here is the output from dataex:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float(exid lsac_member) long(date itemno) strL itemcatgrp
        1 1 13443 69217 "Pathology Services"                      
        1 1 13448    23 "Professional Attendances"                
        1 1 13161     3 "Professional Attendances"                
        1 1 13432    23 "Professional Attendances"                
        1 1 13443 73907 "Pathology Services"                      
        1 1 13443    23 "Professional Attendances"                
        1 1 13462     3 "Professional Attendances"                
        1 1 13443 65007 "Pathology Services"                      
        1 1 13443 66211 "Pathology Services"                      
        1 1 13455    23 "Professional Attendances"                
        1 1 13457 73055 "Pathology Services"                      
        1 1 13457 73901 "Pathology Services"                      
        1 1 13656 65007 "Pathology Services"                      
        1 1 13871 73907 "Pathology Services"                      
        1 1 13577 58903 "Diagnostic Imaging Services"            
        1 1 13566 65007 "Pathology Services"                      
        1 1 13566 66329 "Pathology Services"                      
        1 1 13664   110 "Professional Attendances"                
        1 1 13657    23 "Professional Attendances"                
        1 1 13592    23 "Professional Attendances"                
        1 1 13871 65007 "Pathology Services"                      
        1 1 13577 58921 "Diagnostic Imaging Services"            
        1 1 13566 73907 "Pathology Services"                      
        1 1 13656 73907 "Pathology Services"                      
        1 1 13546    23 "Professional Attendances"                
        1 1 13871    23 "Professional Attendances"                
        1 1 13811     3 "Professional Attendances"                
        1 1 13669    23 "Professional Attendances"                
        1 1 14062 73901 "Pathology Services"                      
        1 1 14062    23 "Professional Attendances"                
        1 1 13973 73915 "Pathology Services"                      
        1 1 13926 73901 "Pathology Services"                      
        1 1 13973    36 "Professional Attendances"                
        1 1 13973 65007 "Pathology Services"                      
        1 1 13973 66327 "Pathology Services"                      
        1 1 14062 73053 "Pathology Services"                      
        1 1 13984    36 "Professional Attendances"                
        1 1 13926    36 "Professional Attendances"                
        1 1 13926 73053 "Pathology Services"                      
        1 1 14244    23 "Professional Attendances"                
        1 1 14551    36 "Professional Attendances"                
        1 1 14411 11312 "Diagnostic Procedures And Investigations"
        1 1 14551 73055 "Pathology Services"                      
        1 1 14258    23 "Professional Attendances"                
        1 1 14556 73907 "Pathology Services"                      
        1 1 14453    23 "Professional Attendances"                
        1 1 14411   105 "Professional Attendances"                
        1 1 14556 65069 "Pathology Services"                      
        1 1 14551 73901 "Pathology Services"                      
        1 1 14720 73806 "Pathology Services"                      
        1 1 14720    23 "Professional Attendances"                
        1 1 14899    23 "Professional Attendances"                
        1 1 15208 72816 "Pathology Services"                      
        1 1 15218 69462 "Pathology Services"                      
        1 1 15218 57521 "Diagnostic Imaging Services"            
        1 1 15141    23 "Professional Attendances"                
        1 1 15215    36 "Professional Attendances"                
        1 1 15223    23 "Professional Attendances"                
        1 1 15126    23 "Professional Attendances"                
        1 1 15124    53 "Professional Attendances"                
        1 1 15005    23 "Professional Attendances"                
        1 1 15218 66716 "Pathology Services"                      
        1 1 15208 31205 "Therapeutic Procedures"                  
        1 1 15218 73907 "Pathology Services"                      
        1 1 15218 69333 "Pathology Services"                      
        1 1 15208    53 "Professional Attendances"                
        1 1 15208 73905 "Pathology Services"                      
        1 1 15293    23 "Professional Attendances"                
        1 1 15692 55704 "Diagnostic Imaging Services"            
        1 1 15406 73915 "Pathology Services"                      
        1 1 15532    23 "Professional Attendances"                
        1 1 15663 69411 "Pathology Services"                      
        1 1 15690    23 "Professional Attendances"                
        1 1 15406 69312 "Pathology Services"                      
        1 1 15406    23 "Professional Attendances"                
        1 1 15663    23 "Professional Attendances"                
        1 1 15663 73806 "Pathology Services"                      
        1 1 15663 73915 "Pathology Services"                      
        1 1 15663 69333 "Pathology Services"                      
        1 1 15412 55731 "Diagnostic Imaging Services"            
        1 1 15663 65111 "Pathology Services"                      
        1 1 15749 55706 "Diagnostic Imaging Services"            
        1 1 16040 73901 "Pathology Services"                      
        1 1 15818 11700 "Diagnostic Procedures And Investigations"
        1 1 15756 16500 "Therapeutic Procedures"                  
        1 0 15907 55820 ""                                        
        1 1 15818    23 "Professional Attendances"                
        1 0 16041    23 ""                                        
        1 0 15910 30653 ""                                        
        1 0 15910    36 ""                                        
        1 1 15719 16500 "Therapeutic Procedures"                  
        1 1 15859 65070 "Pathology Services"                      
        1 1 16040 73055 "Pathology Services"                      
        1 1 15789 65070 "Pathology Services"                      
        1 1 15775 16500 "Therapeutic Procedures"                  
        1 1 15719 66740 "Pathology Services"                      
        1 1 15859 65111 "Pathology Services"                      
        1 1 15789 66545 "Pathology Services"                      
        1 1 15789 65111 "Pathology Services"                      
        1 1 16040    36 "Professional Attendances"                
        end
        format %td date
        label values lsac_member lsac_member_labels
        label def lsac_member_labels 0 "Study child", modify
        label def lsac_member_labels 1 "Parent 1", modify
        And below is the output from chartab. Hopefully this helps diagnose the issue.

        Code:
        . chartab itemcatgrp
        
           decimal  hexadecimal   character |     frequency    unique name
        ------------------------------------+----------------------------------------
                32       \u0020             |     2,871,166    SPACE
                65       \u0041       A     |     1,262,982    LATIN CAPITAL LETTER A
                67       \u0043       C     |            68    LATIN CAPITAL LETTER C
                68       \u0044       D     |       191,187    LATIN CAPITAL LETTER D
                73       \u0049       I     |       167,548    LATIN CAPITAL LETTER I
                77       \u004d       M     |        96,163    LATIN CAPITAL LETTER M
                79       \u004f       O     |        39,167    LATIN CAPITAL LETTER O
                80       \u0050       P     |     2,316,817    LATIN CAPITAL LETTER P
                83       \u0053       S     |     1,109,274    LATIN CAPITAL LETTER S
                84       \u0054       T     |       221,747    LATIN CAPITAL LETTER T
                97       \u0061       a     |     4,056,127    LATIN SMALL LETTER A
                98       \u0062       b     |        20,777    LATIN SMALL LETTER B
                99       \u0063       c     |     3,061,207    LATIN SMALL LETTER C
               100       \u0064       d     |     1,534,128    LATIN SMALL LETTER D
               101       \u0065       e     |     6,998,060    LATIN SMALL LETTER E
               102       \u0066       f     |     1,255,341    LATIN SMALL LETTER F
               103       \u0067       g     |     1,345,345    LATIN SMALL LETTER G
               104       \u0068       h     |     1,113,779    LATIN SMALL LETTER H
               105       \u0069       i     |     3,213,723    LATIN SMALL LETTER I
               108       \u006c       l     |     2,381,864    LATIN SMALL LETTER L
               109       \u006d       m     |       138,994    LATIN SMALL LETTER M
               110       \u006e       n     |     4,118,634    LATIN SMALL LETTER N
               111       \u006f       o     |     4,675,598    LATIN SMALL LETTER O
               112       \u0070       p     |       221,883    LATIN SMALL LETTER P
               114       \u0072       r     |     3,066,119    LATIN SMALL LETTER R
               115       \u0073       s     |     5,304,919    LATIN SMALL LETTER S
               116       \u0074       t     |     3,752,800    LATIN SMALL LETTER T
               117       \u0075       u     |       549,821    LATIN SMALL LETTER U
               118       \u0076       v     |     1,137,896    LATIN SMALL LETTER V
               120       \u0078       x     |        39,167    LATIN SMALL LETTER X
               121       \u0079       y     |       871,255    LATIN SMALL LETTER Y
        ------------------------------------+----------------------------------------
        
                                            freq. count   distinct
        ASCII characters              =      57,133,556         31
        Multibyte UTF-8 characters    =               0          0
        Unicode replacement character =               0          0
        Total Unicode characters      =      57,133,556         31
        
        
        .

        Comment


        • #5
          Originally posted by George Ford View Post
          Code:
          egen itemcat_g = group(itemcategory)
          egen itemcat2 = max(itemcat_g), by(itemnumber)
          tabstat itemcat_g, by(itemcategory)
          label values itemcat2 ITEMCAT2
          label def ITEMCAT2 1 "a", modify
          label def ITEMCAT2 2 "b", modify
          label def ITEMCAT2 3 "c", modify
          Thank you for the suggestion here, although it does not quite fit my problem - largely because I failed to explain my data clearly.

          Comment


          • #6
            On this evidence you have empty strings and no unusual characters.

            Comment


            • #7
              Code:
              sort itemno itemcatgrp
              bys itemno: replace itemcatgrp =itemcatgrp[_N]
              or

              Code:
              bys itemno (itemcatgrp): replace itemcatgrp =itemcatgrp[_N]

              Comment

              Working...
              X