David
AFAIK Epidata is using the XML file as the backend. So to do what you suggest we would need a way of exporting the XML to SQLite. Then it could be accessed from within R. This route would not use the R XML package. Doing this would certainly be possible (but I wouldn't know how to). However, this still means doing an export.
No, sorry, I wasn't suggesting exporting to SQLite, I was just saying that with an SQLite backend to a database you can query the database directly (ie without loading it all into R) and asking if there was a way to query the XML database in the same way. If you go back to my OP that started this thread that was what I was asking , but it then developed into a question about running R code from Epidata analysis.
I have started working on an R package that imports an epidata XML file directly into R using the R XML package. So far it creates a dataframe in R and uses the field information to convert the data to appropriate R data types. I haven't started working with the value labels yet.
This would be great and I look forward to it being available.
One potential disadvantage of this route is that the R XML approach reads the whole XML file into memory (I think) and this could create problems with large files on machines with limited resources. However, as a test I have created an XML file with 6 columns and 16,000 records and a netbook with 1Gb of memory reads it into R in 7 seconds---so, not exactly instantaneous, but then any export is also going to take a few seconds. I haven't tried to optimize it yet, so it might be possible to improve this.
I suspect that unless there is some way of querying the XML file directly, then this is something we have to live with, there is an R package to use SQL query language on txt files, but not sure if that is before or after they are loaded into R however I can't find it now.
Personally, I feel the advantages of loading the XML file directly and not needing to convert it, outweighs the speed issue, certainly at the sample sizes I would be looking at. but it would be more of an issue for large studies.
It actually creates an R list object with a list of dataframes (because an epidata file may potentially contain multiple tables).
The names of the dataframes come from the epidata XML file.
In R it is used like this:
library(epidata) x <- read.epidata.xml("myepidatafile.epx")
names(x) [1] "datafile_id_0"
names(x$datafile_id_0) [1] "Name" "age" "height" "dob" "uppercasetest" "st"
If you know that there is only one table in the XML file you could do this so that the dataframe is stored in the object x:
library(epidata) x <- read.epidata.xml("myepidatafile.epx")[[1]]
names(x) [1] "Name" "age" "height" "dob" "uppercasetest" "st"
This sounds excellent, and I would be happy to to give it a trial, once you get to the stage of sharing, if that would be useful.
Having said that, at the moment I can't even get XML to install. Some issue with the xml2-config file not being available, but that is just on my main computer running Ubuntu, it seems fine on my MacBook.
Thanks,
Graham