How to: checking for duplicate records

16 Jan 2006


      The decision should be to define the time for this:
a. At entry of data
b. As part of the analysis or quality control during entry
For A:
.......................................................................................................................................
Two commands are important here:
Key unique
Autosearch list
The  "Key Unique" will prevent more than one instance of a value in a 
given field to be entered.
"Autosearch list" will show you any other record with the same value as 
the current field already entered in the file.
A "key unique" status can also be used when creating combined fields. 
E.g. if you wish to combine id + area + date of visit then at the 
combination of these the status of key will  be investigated.
qes
id     person id    ####     area township    #  visit  date of visit #
chk file:
id and area are defined mustenter and date of visit.  * control assumed 
a string variable
visit
 mustenter
 after entry
 control = "id" + string(id) + " area:" + string(area) + "visit:" + visit
 end
end
control
 * notice as well "key unique" as "noenter"
 key unique
 noenter
end
.......................................................................................................................................
Analysis or quality control.
To test this I added in the bromar.rec file five double records by copy 
and paste, but with the same id numbers. I also added 5 records but 
changed the id to a new value.
First the easy one. Let us find the five where id number is the same:
* Assume you wish to see variables with same value of id sex age and dectime
* I prefer the "gen" to the "define k _____ "  but this is not important.
gen s(80) k = "id: " + string(id)             
k = trim(k) + "s:" + trim(string(sex) )     // the first "trim(k)  is 
important !!!
k = trim(k) + "a:" + trim(string(age) )
k = trim(k) + "d:" + trim(string(dectime) )
sort k
* list now any two records where two consecutive records have the same value
* notice [ _n-1] indicates previous record   [ _n+1] next record
* notice also the many parenthesis. They could be removed but makes sure 
of what is what.
list if k =  ( ( k[_n-1] )   or  ( k[_n-1])  )
* An alternative would be to create a frequency table, but with many 
records (here 4000)
* this is not going to work. So we use instead:
define x #
x = 1
* now count how many times a given value in k was there
* do not show the table (takes too long) therefore /notable 
aggregate k  /sum=x /close /notable
. list n k if n > 1
N       
K
2       id30 
s:2a:55d:4.4833
3       id5 
s:2a:51d:3.8667
2       id75 s:2a:52d:4.9167
So we know that records are the same for id 30, 5 and 75.
The same principle can be used for any number of variables which will 
fit within 80 characters.
The problem is how many records should be  controlled for possible 
double entry:
For my example the number of records to control in roughly 4100 records is:
Combine id and 3 variables: 5 (correct)
Three variables only (sex age dectime) :  741 combinations with more 
than one record
Four variables only (sex age dectime km) :  154 combinations with more 
than one record
So the question "Is there an easy way" - is answered. It is easy if you 
have an id or a number of variables to compare by. Most likely more than 
4-5 variables.
Jens Lauritsen
EpiData Association

epidata-list＠lists.umanitoba.ca

epidata-list＠lists.umanitoba.ca

tags (0)

participants (1)