I would like to hear opinions about using epidata labels when importing into R. Here's a fragment of XML showing a value label set from sample.epx that comes with the data manager.
<ValueLabelSet id="valuelabelset_id_1"> <Type>3</Type> <Name>Float Set</Name> <Internal> <ValueLabel order="1" value="1,11"> <Label lang="en">Float value A</Label> </ValueLabel> <ValueLabel order="2" value="2,22"> <Label lang="en">Float value B</Label> </ValueLabel> <ValueLabel order="3" value="3,33"> <Label lang="en">Float value C</Label> </ValueLabel> <ValueLabel order="4" value="8,88" missing="true"> <Label lang="en">Second last missing</Label> </ValueLabel> <ValueLabel order="5" value="9,99" missing="true"> <Label lang="en">Float missing</Label> </ValueLabel> </Internal> </ValueLabelSet>
This says that a numeric (float) variable that uses this label set should give the numeric value 1.11 the label "Float value A", etc. and defines two types of missing value: 8.88 and 9.99. R does not have a concept of labelling values in this way (unless things have changed recently; I have not kept up with the R list for a couple of years or so). R also does not have the concept of multiple missing values, you either have a value to work with, or you don't. It is possible to attach a comment to an R object, but these are not transferred when an object is copied, so they can easily become lost.
I think we have the following options:
1) use the actual stored value (e.g. 1.11), ignore the labels except for the missing value definitions and for these just make them NA. This loses information that might be useful, but means that numeric values (and dates, etc) keep their data type and can be analysed appropriately, i.e. it is possible to calculate the mean of 1.11, but not the mean of "Float value A";
2) use the labels in all cases. The result of this is the opposite of option 1: we keep the extra information that the labels provide, but coerce all data to characters/factors and can no longer do many statistical analyses;
3) create a new column for each variable that uses a label set, so that we have both the original data and next to it a column with the labels. This could result in a much larger data set in R, and possible confusion, especially if analyses change the value of one column and not the other.
4) Code in options 1-3, allowing the user to specify which approach to take;
5) Code in options 1-3, allowing the user to specify for each column which approach to take.
6) Do nothing at all to the data, just return the label information as part of the result of the read function (in the same way that the table structure is returned and the study meta data will be returned) so that the user can then easily read it in R and use it to recode variables/values appropriately. This would mean manually recoding missing values. I could write functions that are able to work with this information so that users can then fairly easily query the label set representation to find out the label information. In R this could work like this, assuming that x is an object that contains label information:
epidata.label.value(1.11, x, "Float Set")
"Float value A"
epidata.label.value(8.88, x, "Float Set")
"Second last missing"
## If the value is not in the value set, return NA epidata.label.value(1.66, x, "Float Set")
NA
is.epidata.label.na(1.11, x, "Float Set")
FALSE
is.epidata.label.na(8.88, x, "Float Set")
TRUE
is.epidata.label.na(1.66, x, "Float Set")
NA
As I work through this I am tending towards option 6; I would probably have to write functions like this anyway, and I think that making them visible to the user provides the greatest flexibility and probably limits the risk of being surprised. Recoding a variable in R using the labels would then work like this, assuming that x is an object created by importing an epidata XML file into R, and that the dataframe contains a field called "height":
## Extract the data frame and labels into new objects to make the code easier to read y <- x[[1]] z <- x[["value.labels"]]
## Apply the labels y$height <- epidata.label.value(y$height, z, "Float Set")
## Set the missing values is.epidata.label.na(y$height, z, "Float Set") <- NA
David -- David Whiting, PhD | Senior Epidemiology & Public Health Specialist tel +32-2-6437945 | mob +32-496-266436 | David.Whiting@idf.org
International Diabetes Federation 166 Chaussée de La Hulpe, B-1170 Brussels, Belgium tel +32-2-5385511 | fax +32-2-5385114 info@idf.org | www.idf.org | VAT BE 0433.674.528
IDF | Promoting diabetes care, prevention and a cure worldwide