How to: checking for duplicate records
The decision should be to define the time for this: a. At entry of data b. As part of the analysis or quality control during entry
For A: ....................................................................................................................................... Two commands are important here: Key unique Autosearch list
The "Key Unique" will prevent more than one instance of a value in a given field to be entered. "Autosearch list" will show you any other record with the same value as the current field already entered in the file.
A "key unique" status can also be used when creating combined fields. E.g. if you wish to combine id + area + date of visit then at the combination of these the status of key will be investigated. qes id person id #### area township # visit date of visit # chk file: id and area are defined mustenter and date of visit. * control assumed a string variable visit mustenter after entry control = "id" + string(id) + " area:" + string(area) + "visit:" + visit end end
control * notice as well "key unique" as "noenter" key unique noenter end ....................................................................................................................................... Analysis or quality control. To test this I added in the bromar.rec file five double records by copy and paste, but with the same id numbers. I also added 5 records but changed the id to a new value.
First the easy one. Let us find the five where id number is the same:
* Assume you wish to see variables with same value of id sex age and dectime * I prefer the "gen" to the "define k _____ " but this is not important. gen s(80) k = "id: " + string(id) k = trim(k) + "s:" + trim(string(sex) ) // the first "trim(k) is important !!! k = trim(k) + "a:" + trim(string(age) ) k = trim(k) + "d:" + trim(string(dectime) ) sort k * list now any two records where two consecutive records have the same value * notice [ _n-1] indicates previous record [ _n+1] next record * notice also the many parenthesis. They could be removed but makes sure of what is what. list if k = ( ( k[_n-1] ) or ( k[_n-1]) )
* An alternative would be to create a frequency table, but with many records (here 4000) * this is not going to work. So we use instead: define x # x = 1 * now count how many times a given value in k was there * do not show the table (takes too long) therefore /notable aggregate k /sum=x /close /notable . list n k if n > 1 N K
2 id30 s:2a:55d:4.4833
3 id5 s:2a:51d:3.8667
2 id75 s:2a:52d:4.9167
So we know that records are the same for id 30, 5 and 75.
The same principle can be used for any number of variables which will fit within 80 characters. The problem is how many records should be controlled for possible double entry:
For my example the number of records to control in roughly 4100 records is: Combine id and 3 variables: 5 (correct) Three variables only (sex age dectime) : 741 combinations with more than one record Four variables only (sex age dectime km) : 154 combinations with more than one record
So the question "Is there an easy way" - is answered. It is easy if you have an id or a number of variables to compare by. Most likely more than 4-5 variables.
Jens Lauritsen EpiData Association
Great, but... Conceptually I understand and in our situation B applies but I run into an error while playing with your syntax right there: what do I do wrong? Many Tx Mx --------- . list if k = (( k[_n-1]) or (k[_n-1])) Operator TRUNC is incompatible with String Operation aborted * An alternative would be to create a frequency table, but with many records (here 4000) * this is not going to work. So we use instead: . define x # Var Name X of type Integer Var length: 1 decimals 0 . x = 1 * now count how many times a given value in k was there * do not show the table (takes too long) therefore /notable aggregate k /sum=x /close /notable. . list n k if n > 1 Unknown Identifier N Operation aborted ---------------- epidata-list@lists.umanitoba.ca wrote:
The decision should be to define the time for this: a. At entry of data b. As part of the analysis or quality control during entry
For A: .......................................................................................................................................
Two commands are important here: Key unique Autosearch list
The "Key Unique" will prevent more than one instance of a value in a given field to be entered. "Autosearch list" will show you any other record with the same value as the current field already entered in the file.
A "key unique" status can also be used when creating combined fields. E.g. if you wish to combine id + area + date of visit then at the combination of these the status of key will be investigated. qes id person id #### area township # visit date of visit # chk file: id and area are defined mustenter and date of visit. * control assumed a string variable visit mustenter after entry control = "id" + string(id) + " area:" + string(area) + "visit:" + visit end end
control
- notice as well "key unique" as "noenter"
key unique noenter end .......................................................................................................................................
Analysis or quality control. To test this I added in the bromar.rec file five double records by copy and paste, but with the same id numbers. I also added 5 records but changed the id to a new value.
First the easy one. Let us find the five where id number is the same:
- Assume you wish to see variables with same value of id sex age and
dectime
- I prefer the "gen" to the "define k _____ " but this is not important.
gen s(80) k = "id: " + string(id) k = trim(k) + "s:" + trim(string(sex) ) // the first "trim(k) is important !!! k = trim(k) + "a:" + trim(string(age) ) k = trim(k) + "d:" + trim(string(dectime) ) sort k
- list now any two records where two consecutive records have the same
value
- notice [ _n-1] indicates previous record [ _n+1] next record
- notice also the many parenthesis. They could be removed but makes
sure of what is what. list if k = ( ( k[_n-1] ) or ( k[_n-1]) )
- An alternative would be to create a frequency table, but with many
records (here 4000)
- this is not going to work. So we use instead:
define x # x = 1
- now count how many times a given value in k was there
- do not show the table (takes too long) therefore /notable aggregate
k /sum=x /close /notable . list n k if n > 1 N K
2 id30 s:2a:55d:4.4833
3 id5 s:2a:51d:3.8667
2 id75 s:2a:52d:4.9167
So we know that records are the same for id 30, 5 and 75.
The same principle can be used for any number of variables which will fit within 80 characters. The problem is how many records should be controlled for possible double entry:
For my example the number of records to control in roughly 4100 records is: Combine id and 3 variables: 5 (correct) Three variables only (sex age dectime) : 741 combinations with more than one record Four variables only (sex age dectime km) : 154 combinations with more than one record
So the question "Is there an easy way" - is answered. It is easy if you have an id or a number of variables to compare by. Most likely more than 4-5 variables.
Jens Lauritsen EpiData Association
EpiData-list mailing list EpiData-list@lists.umanitoba.ca http://lists.umanitoba.ca/mailman/listinfo/epidata-list
participants (1)
-
epidata-list@lists.umanitoba.ca