[EpiData-list] How to: checking for duplicate records

epidata-list at lists.umanitoba.ca epidata-list at lists.umanitoba.ca
Mon Jan 16 16:57:33 CST 2006


Great, but...
Conceptually I understand and in our situation B applies but I run into 
an error while playing with your syntax right there: what do I do wrong? 
Many Tx Mx
---------
. list if k = (( k[_n-1]) or (k[_n-1]))
Operator TRUNC is incompatible with String
Operation aborted
* An alternative would be to create a frequency table, but with many 
records (here 4000)
* this is not going to work. So we use instead:
. define x #
Var Name X of type Integer
Var length: 1 decimals 0
. x = 1
* now count how many times a given value in k was there
* do not show the table (takes too long) therefore /notable aggregate k 
/sum=x /close /notable.
. list n k if n > 1
Unknown Identifier N
Operation aborted
----------------
epidata-list at lists.umanitoba.ca wrote:

> The decision should be to define the time for this:
> a. At entry of data
> b. As part of the analysis or quality control during entry
>
> For A:
> ....................................................................................................................................... 
>
> Two commands are important here:
> Key unique
> Autosearch list
>
> The  "Key Unique" will prevent more than one instance of a value in a 
> given field to be entered.
> "Autosearch list" will show you any other record with the same value 
> as the current field already entered in the file.
>
> A "key unique" status can also be used when creating combined fields. 
> E.g. if you wish to combine id + area + date of visit then at the 
> combination of these the status of key will  be investigated.
> qes
> id     person id    ####     area township    #  visit  date of visit #
> chk file:
> id and area are defined mustenter and date of visit.  * control 
> assumed a string variable
> visit
> mustenter
> after entry
> control = "id" + string(id) + " area:" + string(area) + "visit:" + visit
> end
> end
>
> control
> * notice as well "key unique" as "noenter"
> key unique
> noenter
> end
> ....................................................................................................................................... 
>
> Analysis or quality control.
> To test this I added in the bromar.rec file five double records by 
> copy and paste, but with the same id numbers. I also added 5 records 
> but changed the id to a new value.
>
> First the easy one. Let us find the five where id number is the same:
>
> * Assume you wish to see variables with same value of id sex age and 
> dectime
> * I prefer the "gen" to the "define k _____ "  but this is not important.
> gen s(80) k = "id: " + string(id)             k = trim(k) + "s:" + 
> trim(string(sex) )     // the first "trim(k)  is important !!!
> k = trim(k) + "a:" + trim(string(age) )
> k = trim(k) + "d:" + trim(string(dectime) )
> sort k
> * list now any two records where two consecutive records have the same 
> value
> * notice [ _n-1] indicates previous record   [ _n+1] next record
> * notice also the many parenthesis. They could be removed but makes 
> sure of what is what.
> list if k =  ( ( k[_n-1] )   or  ( k[_n-1])  )   
> * An alternative would be to create a frequency table, but with many 
> records (here 4000)
> * this is not going to work. So we use instead:
> define x #
> x = 1
> * now count how many times a given value in k was there
> * do not show the table (takes too long) therefore /notable aggregate 
> k  /sum=x /close /notable
> . list n k if n > 1
> N       
> K                                                                                
>
> 2       id30 
> s:2a:55d:4.4833                                                             
>
> 3       id5 
> s:2a:51d:3.8667                                                              
>
> 2       id75 s:2a:52d:4.9167
>
> So we know that records are the same for id 30, 5 and 75.
>
> The same principle can be used for any number of variables which will 
> fit within 80 characters.
> The problem is how many records should be  controlled for possible 
> double entry:
>
> For my example the number of records to control in roughly 4100 
> records is:
> Combine id and 3 variables: 5 (correct)
> Three variables only (sex age dectime) :  741 combinations with more 
> than one record
> Four variables only (sex age dectime km) :  154 combinations with more 
> than one record
>
> So the question "Is there an easy way" - is answered. It is easy if 
> you have an id or a number of variables to compare by. Most likely 
> more than 4-5 variables.
>
> Jens Lauritsen
> EpiData Association
>
>------------------------------------------------------------------------
>
>_______________________________________________
>EpiData-list mailing list
>EpiData-list at lists.umanitoba.ca
>http://lists.umanitoba.ca/mailman/listinfo/epidata-list
>  
>



More information about the EpiData-list mailing list