How to: Select observations based on string variables
To work efficiently with string variables a few techniques are needed:
To work with string variables some functions are needed: Length of a string: length("Abcde") = 5 Convert to upper: upper("Abcde") = "ABCDE" Remove blanks: trim("Abcde ") = "Abcde" Look for a given text in a string: pos("Abcde","cd") = 3
A. How do I select those with empty contents of a variable S (string).
In person language this would mean: 1. Remove all blanks/spaces from my variable 2. Is the trimmed length of the variable greater than zero ? Yes: The string is NOT empty No: The string is empty
So in analysis language this is: 1: trim(S) 2: length(trim(S)) and to use this as select : "YES" from above: Select if (length(trim(S)) > 0) "NO" from above: Select if (length(trim(S)) = 0)
alternative: "YES" from above: Select if ( trim(S) <> "") "NO" from above: Select if ( trim(S) = "")
The user will notice the many parenthesis ( ). The reason I add an outer parenthesis is to be sure that the whole sentence is handled correctly.
B. How to select observations with a certain string value:
In person language this would mean: 1. Make sure not to confuse lower and upper case entry of the text 2. Identify the string I wish to search on 3. Does the variable contain the string ?
And in analysis language this is: 1: upper(S) 2: e.g. "DOG" 3: pos(upper(S),"DOG") and to use this as select : "YES" from above: Select if (pos(upper(S),"DOG") > 0 ) "NO" from above: Select if (pos(upper(S),"DOG") = 0 )
C. Complicated string values: Assume you wish to select all text strings with content DOG, but not if these also contained the word "WHITE" - that is find all dogs, but exclude the white ones. Select if ( (pos(upper(S),"DOG") > 0 ) and (pos(upper(S),"WHITE") = 0 ))
If you wished to find all DOGs or CATs: Select if ( (pos(upper(S),"DOG") > 0 ) or (pos(upper(S),"CAT") = 0 ) )
It is often difficult to get the correct number of parenthesis. But start with the innermost one and build from there. Handling all types of strings with different national characters is difficult, therefore it is good practice to use upper() and many parenthesis to assist the programme in working correctly.
Jens Lauritsen EpiData Association
participants (1)
-
epidata-list@lists.umanitoba.ca