Epidata XML to R: how to handle labels?
I would like to hear opinions about using epidata labels when importing into R. Here's a fragment of XML showing a value label set from sample.epx that comes with the data manager.
<ValueLabelSet id="valuelabelset_id_1"> <Type>3</Type> <Name>Float Set</Name> <Internal> <ValueLabel order="1" value="1,11"> <Label lang="en">Float value A</Label> </ValueLabel> <ValueLabel order="2" value="2,22"> <Label lang="en">Float value B</Label> </ValueLabel> <ValueLabel order="3" value="3,33"> <Label lang="en">Float value C</Label> </ValueLabel> <ValueLabel order="4" value="8,88" missing="true"> <Label lang="en">Second last missing</Label> </ValueLabel> <ValueLabel order="5" value="9,99" missing="true"> <Label lang="en">Float missing</Label> </ValueLabel> </Internal> </ValueLabelSet>
This says that a numeric (float) variable that uses this label set should give the numeric value 1.11 the label "Float value A", etc. and defines two types of missing value: 8.88 and 9.99. R does not have a concept of labelling values in this way (unless things have changed recently; I have not kept up with the R list for a couple of years or so). R also does not have the concept of multiple missing values, you either have a value to work with, or you don't. It is possible to attach a comment to an R object, but these are not transferred when an object is copied, so they can easily become lost.
I think we have the following options:
1) use the actual stored value (e.g. 1.11), ignore the labels except for the missing value definitions and for these just make them NA. This loses information that might be useful, but means that numeric values (and dates, etc) keep their data type and can be analysed appropriately, i.e. it is possible to calculate the mean of 1.11, but not the mean of "Float value A";
2) use the labels in all cases. The result of this is the opposite of option 1: we keep the extra information that the labels provide, but coerce all data to characters/factors and can no longer do many statistical analyses;
3) create a new column for each variable that uses a label set, so that we have both the original data and next to it a column with the labels. This could result in a much larger data set in R, and possible confusion, especially if analyses change the value of one column and not the other.
4) Code in options 1-3, allowing the user to specify which approach to take;
5) Code in options 1-3, allowing the user to specify for each column which approach to take.
6) Do nothing at all to the data, just return the label information as part of the result of the read function (in the same way that the table structure is returned and the study meta data will be returned) so that the user can then easily read it in R and use it to recode variables/values appropriately. This would mean manually recoding missing values. I could write functions that are able to work with this information so that users can then fairly easily query the label set representation to find out the label information. In R this could work like this, assuming that x is an object that contains label information:
epidata.label.value(1.11, x, "Float Set")
"Float value A"
epidata.label.value(8.88, x, "Float Set")
"Second last missing"
## If the value is not in the value set, return NA epidata.label.value(1.66, x, "Float Set")
NA
is.epidata.label.na(1.11, x, "Float Set")
FALSE
is.epidata.label.na(8.88, x, "Float Set")
TRUE
is.epidata.label.na(1.66, x, "Float Set")
NA
As I work through this I am tending towards option 6; I would probably have to write functions like this anyway, and I think that making them visible to the user provides the greatest flexibility and probably limits the risk of being surprised. Recoding a variable in R using the labels would then work like this, assuming that x is an object created by importing an epidata XML file into R, and that the dataframe contains a field called "height":
## Extract the data frame and labels into new objects to make the code easier to read y <- x[[1]] z <- x[["value.labels"]]
## Apply the labels y$height <- epidata.label.value(y$height, z, "Float Set")
## Set the missing values is.epidata.label.na(y$height, z, "Float Set") <- NA
David -- David Whiting, PhD | Senior Epidemiology & Public Health Specialist tel +32-2-6437945 | mob +32-496-266436 | David.Whiting@idf.org
International Diabetes Federation 166 Chaussée de La Hulpe, B-1170 Brussels, Belgium tel +32-2-5385511 | fax +32-2-5385114 info@idf.org | www.idf.org | VAT BE 0433.674.528
IDF | Promoting diabetes care, prevention and a cure worldwide
I think the answer lies in this: http://www.statmethods.net/input/valuelabels.html
Which by the way is a very good source for solutions in R.
Partial cut from that is:
--- To understand value labels in *R*, you need to understand the data structure factor http://www.statmethods.net/input/datatypes.html.
You can use the factor function to create your own value lables.
|# variable v1 is coded 1, 2 or 3 # we want to attach value labels 1=red, 2=blue, 3=green
mydata$v1 <- factor(mydata$v1, levels = c(1,2,3), labels = c("red", "blue", "green")) |
|So I think the answer to use factor principles in R.
What might have to be handled differently are floats, which seldom would be needed (but can).
---
| One alternative to use the xml directly would be to export to Stata format with Manager and then use in R |# input Stata file library(foreign) mydata <- read.dta("c:/mydata.dta") |
Regarding Missing values: What I would do is to compare the direct (simplest) way with the XML with the "import" from Stata and then test the behaviour of missing values. But do be aware that currently there are some problems with import/export in Manager to and from Stata format with missing values.
Stata has many missing values (.a to .z) and most likely the import to R from a Stata file have solved that.
regards Jens Lauritsen EpiData Association
Quick reply: yes, strings are easy, converting to factors, it's the other data types that need some consideration. Regarding using the stata export, I did at one point wonder if what I was doing was completely unecessary and that using the stata export would be sufficient, but I think that having something that is native to R provides more possibilities.
I'm proceeding with the idea of reporting the labels and providing functions to ease working with them. I'm nearly there ...
David --
On Mon, Jun 13, 2011 at 02:11:36PM +0200, epidata-list@lists.umanitoba.ca wrote:
I think the answer lies in this: http://www.statmethods.net/input/valuelabels.html
Which by the way is a very good source for solutions in R.
Partial cut from that is:
To understand value labels in *R*, you need to understand the data structure factor http://www.statmethods.net/input/datatypes.html.
You can use the factor function to create your own value lables.
|# variable v1 is coded 1, 2 or 3 # we want to attach value labels 1=red, 2=blue, 3=green
mydata$v1 <- factor(mydata$v1, levels = c(1,2,3), labels = c("red", "blue", "green")) |
|So I think the answer to use factor principles in R.
What might have to be handled differently are floats, which seldom would be needed (but can).
| One alternative to use the xml directly would be to export to Stata format with Manager and then use in R |# input Stata file library(foreign) mydata <- read.dta("c:/mydata.dta") |
Regarding Missing values: What I would do is to compare the direct (simplest) way with the XML with the "import" from Stata and then test the behaviour of missing values. But do be aware that currently there are some problems with import/export in Manager to and from Stata format with missing values.
Stata has many missing values (.a to .z) and most likely the import to R from a Stata file have solved that.
regards Jens Lauritsen EpiData Association
David Whiting, PhD | Senior Epidemiology & Public Health Specialist tel +32-2-6437945 | mob +32-496-266436 | David.Whiting@idf.org
International Diabetes Federation 166 Chaussée de La Hulpe, B-1170 Brussels, Belgium tel +32-2-5385511 | fax +32-2-5385114 info@idf.org | www.idf.org | VAT BE 0433.674.528
IDF | Promoting diabetes care, prevention and a cure worldwide
_______________________________________________
EpiData-list mailing list EpiData-list@lists.umanitoba.ca http://lists.umanitoba.ca/mailman/listinfo/epidata-list
Great work on this so far, David. This will be useful, because you can get more of the metadata into R directly. I'm not sure what you mean by "other data types". As I see it, if an EpiData field uses value labels, then that field (variable) should be a factor in R.
Not well documented, because you cannot create them in EpiData Manager, are boolean fields. These are a holdover from the current version of EpiData and so far they persist in the new version if imported from old .rec files. FIeld type is 0 (zero) and valid values are Y, N or missing.
I don't think there was an answer to someone's question about the field ST. This is record status (0=active, 1=deleted). In the future, another code will be used for "verified". An import option would be to ignore deleted records (default for EpiData Analysis).
Since date, time and decimal separators are metadata, the R function should read these from the XML and convert to whatever R requires. This should not require user choice.
I'll be trying this out, even though I don't use R much anymore. We taught EpiData to a group last week, among which were several R users. They liked the simplicity of EpiData Analysis, but will be very happy with the direct import to R.
Jamie Hockin Ottawa
Quick reply: yes, strings are easy, converting to factors, it's the other data types that need some consideration. Regarding using the stata export, I did at one point wonder if what I was doing was completely unecessary and that using the stata export would be sufficient, but I think that having something that is native to R provides more possibilities.
I'm proceeding with the idea of reporting the labels and providing functions to ease working with them. I'm nearly there ...
David
Jamie,
On Tue, Jun 14, 2011 at 03:58:10PM +0200, epidata-list@lists.umanitoba.ca wrote:
Great work on this so far, David. This will be useful, because you can get more of the metadata into R directly. I'm not sure what you mean by "other data types". As I see it, if an EpiData field uses value labels, then that field (variable) should be a factor in R.
It seems from the sample.epx file that any data type can have labels. The snippet I included was from the sample file and was for numeric values. So, it seems you could have a field for systolic blood pressure and a label for 160 that says "a bit too high". You can't do this in R (except by using attr() but that has limitations).
Once I have the approach I'm playing with working I'll upload it and seek feedback.
Not well documented, because you cannot create them in EpiData Manager, are boolean fields. These are a holdover from the current version of EpiData and so far they persist in the new version if imported from old .rec files. FIeld type is 0 (zero) and valid values are Y, N or missing.
I don't think there was an answer to someone's question about the field ST. This is record status (0=active, 1=deleted). In the future, another code will be used for "verified". An import option would be to ignore deleted records (default for EpiData Analysis).
Yes, I asked about this. Thanks, this is clear.
Since date, time and decimal separators are metadata, the R function should read these from the XML and convert to whatever R requires. This should not require user choice.
Yes, I plan to use the metadata.
I'll be trying this out, even though I don't use R much anymore. We taught EpiData to a group last week, among which were several R users. They liked the simplicity of EpiData Analysis, but will be very happy with the direct import to R.
Please note that it is still a bit experimental, and some of the code will need cleaning up.
Thanks for your comments.
David --
Jamie Hockin Ottawa
Quick reply: yes, strings are easy, converting to factors, it's the other data types that need some consideration. Regarding using the stata export, I did at one point wonder if what I was doing was completely unecessary and that using the stata export would be sufficient, but I think that having something that is native to R provides more possibilities.
I'm proceeding with the idea of reporting the labels and providing functions to ease working with them. I'm nearly there ...
David
David Whiting, PhD | Senior Epidemiology & Public Health Specialist tel +32-2-6437945 | mob +32-496-266436 | David.Whiting@idf.org
International Diabetes Federation 166 Chaussée de La Hulpe, B-1170 Brussels, Belgium tel +32-2-5385511 | fax +32-2-5385114 info@idf.org | www.idf.org | VAT BE 0433.674.528
IDF | Promoting diabetes care, prevention and a cure worldwide
_______________________________________________
EpiData-list mailing list EpiData-list@lists.umanitoba.ca http://lists.umanitoba.ca/mailman/listinfo/epidata-list
When you create value labels in EpiData, you are also restricting the values that a field can hold. So value labels never apply to a continuous variable like systolic blood pressure. Value labels are exclusively for coded values.
Jamie
On 2011-06-14, at 10:30 AM, epidata-list@lists.umanitoba.ca wrote:
It seems from the sample.epx file that any data type can have labels. The snippet I included was from the sample file and was for numeric values. So, it seems you could have a field for systolic blood pressure and a label for 160 that says "a bit too high". You can't do this in R (except by using attr() but that has limitations).
On Tue, Jun 14, 2011 at 07:49:36PM +0200, epidata-list@lists.umanitoba.ca wrote:
When you create value labels in EpiData, you are also restricting the values that a field can hold. So value labels never apply to a continuous variable like systolic blood pressure. Value labels are exclusively for coded values.
OK, now I get it. I've just added some code to help deal with labels. Not completed yet, but now when an epx file is read in the labels are also added to the object. At the moment the object contains the data, a summary of the table structure and a summary of the labels, e.g.:
x <- read.epidata.xml("sample.epx", dec.sep = ".")
names(x)
[1] "datafile_id_0" "field.info" "labels"
x$labels
$`Integet Set` $`Integet Set`$type [1] "1"
$`Integet Set`$labels value order label missing 1 10 1 Value A FALSE 2 20 2 Value B FALSE 3 30 3 Value C FALSE 4 99 4 A missing TRUE
$`Float Set` $`Float Set`$type [1] "3"
$`Float Set`$labels value order label missing 1 1,11 1 Float value A FALSE 2 2,22 2 Float value B FALSE 3 3,33 3 Float value C FALSE 4 8,88 4 Second last missing TRUE 5 9,99 5 Float missing TRUE
$`String Set` $`String Set`$type [1] "12"
$`String Set`$labels value order label missing 1 AAA 1 First string label FALSE 2 BBB 2 Second string label FALSE 3 CCC 3 Third string label FALSE 4 ZZZ 4 Missing string label TRUE
x$field.info
id name type length decimals question 1 f19 VLAST 1 2 0 Last field... 2 f22 M3 1 5 0 The No-Enter field 3 f23 M2 1 4 0 Confirm-entry field 4 f24 M1 1 2 0 Just a field 5 f14 A1 2 4 0 Auto Incrementing (4 digits) 6 f9 A2 7 10 0 Date Field Auto (dmy) 7 f10 A3 8 10 0 Date Field Auto (mdy) 8 f11 A4 9 10 0 Date Field Auto (ymd) 9 f13 A5 11 8 0 Time Field Auto 10 f0 VL1 1 2 0 Integer Field (2 digits) 11 f2 VL2 3 4 2 Float Field (1.2 digits) 12 f16 VL3 12 3 0 String Field (3 characters) 13 f8 S3 13 40 0 Sample text 14 f4 S2 12 40 0 Sample text 15 f15 S1 12 20 0 Language 16 f1 I1 1 8 0 Integer Field (8 digits) 17 f3 F1 3 9 4 Float Field (4.4 digits) 18 f17 I2 1 3 0 Integer field (3 digits) 19 f5 D1 4 10 0 Date Field (DD/MM/YYYY) 20 f6 D2 5 10 0 Date Field (MM/DD/YYYY) 21 f12 T1 10 8 0 Time Field (HH:MM:SS) 22 f7 D3 6 10 0 Date Field (YYYY/MM/DD) 23 f18 J1 1 2 0 Jump field 24 f20 J2 1 2 0 With valuelabel 25 f21 J3 3 4 2 Float with second max 26 f26 O2 1 3 0 Default value 27 f25 O1 1 3 0 Repeat Value 28 f27 O3 1 3 0 Compare field 29 f32 O4 1 2 0 Notes field 30 f28 C1 1 2 0 Day 31 f29 C2 1 2 0 Month 32 f30 C3 1 4 0 Year 33 f31 C4 6 10 0 Result Date Field (YMD)
summary(x$datafile_id_0)
VL1 I1 A3 A4 Min. :10.00 Min. : 42 01/05/2011:12 2011/01/05:12 1st Qu.:10.00 1st Qu.: 928037 Median :25.00 Median :29012340 Mean :44.67 Mean :36010019 3rd Qu.:99.00 3rd Qu.:62317068 Max. :99.00 Max. :90123456
T1 A5 A1 S1 VL3 01.00.00:1 14.22.55:1 Min. :100.0 ภาษาไทย :1 AAA:3 02.00.00:1 14.25.36:1 1st Qu.:102.8 Dansk :1 BBB:1 03.00.00:1 14.26.38:1 Median :105.5 English :1 CCC:4 04.00.00:1 14.27.40:1 Mean :105.5 Español :1 ZZZ:4 05.00.00:1 14.28.23:1 3rd Qu.:108.2 Français:1 06.00.00:1 14.30.54:1 Max. :111.0 Íslenska:1 (Other) :6 (Other) :6 (Other) :6 I2 J1 VLAST VL2 Min. : 50.00 Min. :0.000 Min. : 1.000 Min. :1.110 1st Qu.: 52.75 1st Qu.:0.000 1st Qu.: 3.500 1st Qu.:1.110 Median : 55.50 Median :1.500 Median : 7.000 Median :2.775 Mean : 76.25 Mean :1.583 Mean : 6.636 Mean :4.625 3rd Qu.: 58.25 3rd Qu.:3.000 3rd Qu.: 9.500 3rd Qu.:9.990 Max. :250.00 Max. :4.000 Max. :12.000 Max. :9.990 NA's : 1.000 J2 J3 M3 M2 M1 Min. :10.0 Min. :1.110 Min. : NA Min. : 2.0 Min. : 1.000 1st Qu.:10.0 1st Qu.:6.938 1st Qu.: NA 1st Qu.: 6.0 1st Qu.: 3.000 Median :54.5 Median :8.880 Median : NA Median : 8.0 Median : 7.000 Mean :54.5 Mean :7.354 Mean :NaN Mean : 143.6 Mean : 6.556 3rd Qu.:99.0 3rd Qu.:9.990 3rd Qu.: NA 3rd Qu.: 11.0 3rd Qu.: 9.000 Max. :99.0 Max. :9.990 Max. : NA Max. :1234.0 Max. :12.000 NA's : 6.0 NA's :4.000 NA's : 12 NA's : 3.0 NA's : 3.000 O1 O2 O3 C1 Min. :2.000 Min. : 1.00 Min. : 2.00 Min. : 1.00 1st Qu.:3.000 1st Qu.: 4.25 1st Qu.: 5.25 1st Qu.: 3.75 Median :4.500 Median : 5.50 Median : 6.50 Median : 6.50 Mean :4.833 Mean : 13.25 Mean : 15.08 Mean : 8.75 3rd Qu.:6.250 3rd Qu.: 8.25 3rd Qu.: 9.25 3rd Qu.:10.00 Max. :9.000 Max. :100.00 Max. :111.00 Max. :25.00
C2 F1 C3 C4 Min. : 1.000 Min. : 25.02 Min. : 3.00 1920/02/02:1 1st Qu.: 3.000 1st Qu.: 113.71 1st Qu.: 4.75 2003/03/03:1 Median : 4.500 Median :2901.23 Median : 7.50 2003/03/22:1 Mean : 5.333 Mean :3613.54 Mean : 174.92 2004/04/04:1 3rd Qu.: 7.250 3rd Qu.:6231.71 3rd Qu.: 11.00 2005/05/05:1 Max. :12.000 Max. :9012.35 Max. :2012.00 2006/06/06:1 (Other) :6 O4 S2 D1 Min. :0.00 นี้คือข้อความตัวอย่าง :1 Min. :2001-01-01 1st Qu.:1.00 Đây là một văn bản mẫu :1 1st Qu.:2003-12-26 Median :1.50 Dette er en prøve tekst:1 Median :2007-07-08 Mean :1.75 échantillon de texte :1 Mean :2007-01-21 3rd Qu.:2.25 ejemplo de un texto :1 3rd Qu.:2010-09-10 Max. :4.00 This is a sample text :1 Max. :2011-09-11 (Other) :6 D2 D3 S3 A2 01/01/2001:1 1912/09/10:1 สำหรับไฟล์ :1 05/01/2011:12 01/13/2011:1 2001/01/01:1 Đây là một văn bản mẫu :1 02/02/2002:1 2002/02/02:1 ÐETTA ER A SÝNISHORN :1 03/03/2003:1 2003/03/03:1 DETTE ER EN PRØVE TEKST:1 04/04/2004:1 2004/04/04:1 ÉCHANTILLON DE TEXTE :1 05/05/2005:1 2005/05/05:1 EJEMPLO DE UN TEXTO :1 (Other) :6 (Other) :6 (Other) :6 st 0:12
David --
Jamie
On 2011-06-14, at 10:30 AM, epidata-list@lists.umanitoba.ca wrote:
It seems from the sample.epx file that any data type can have labels. The snippet I included was from the sample file and was for numeric values. So, it seems you could have a field for systolic blood pressure and a label for 160 that says "a bit too high". You can't do this in R (except by using attr() but that has limitations).
EpiData-list mailing list EpiData-list@lists.umanitoba.ca http://lists.umanitoba.ca/mailman/listinfo/epidata-list
participants (1)
-
epidata-list@lists.umanitoba.ca