[EpiData-list] Reading the new XML files into R

epidata-list at lists.umanitoba.ca epidata-list at lists.umanitoba.ca
Sat Jun 11 02:33:57 CDT 2011

On Sat, Jun 11, 2011 at 08:22:48AM +0200, epidata-list at lists.umanitoba.ca wrote:
> > Maybe it would be nice to add that implementation to Manager, also. In
> > that way other users of operative systems like Windows (GNU/Linux, and
> > Mac) can be benefit.
> >
> I am assuming that Epidata analysis will eventually become available to
> Linux and Mac users, but it would still be useful to allow direct access to
> the Epidata files from R.  If Epdata used SQLite as a backend,  you would be
> able to access the data directly from R using SQL queries and not need to
> export/import anything.
> This was what I was hoping someone knew if this could be done with the XML
> database.
> Graham

AFAIK Epidata is using the XML file as the backend. So to do what you suggest we would need a way of exporting the XML to SQLite. Then it could be accessed from within R. This route would not use the R XML package.
Doing this would certainly be possible (but I wouldn't know how to). However, this still means doing an export.

I have started working on an R package that imports an epidata XML file directly into R using the R XML package. So far it creates a dataframe in R and uses the field information to convert the data to appropriate R data types. I haven't started working with the value labels yet. 

One potential disadvantage of this route is that the R XML approach reads the whole XML file into memory (I think) and this could create problems with large files on machines with limited resources. However, as a test I have created an XML file with 6 columns and 16,000 records and a netbook with 1Gb of memory reads it into R in 7 seconds---so, not exactly instantaneous, but then any export is also going to take a few seconds. I haven't tried to optimize it yet, so it might be possible to improve this. 

It actually creates an R list object with a list of dataframes (because an epidata file may potentially contain multiple tables). The names of the dataframes come from the epidata XML file.
In R it is used like this:

x <- read.epidata.xml("myepidatafile.epx")

[1] "datafile_id_0"

[1] "Name"      "age"       "height"    "dob"       "uppercasetest" "st"

If you know that there is only one table in the XML file you could do this so that the dataframe is stored in the object x:

x <- read.epidata.xml("myepidatafile.epx")[[1]]

[1] "Name"      "age"       "height"    "dob"       "uppercasetest" "st"


More information about the EpiData-list mailing list