Reading the new XML files into R

newer
Value labels and variable labels...

epidata-list＠lists.umanitoba.ca

9 Jun 2011 9 Jun '11

1:32 p.m.

To save me maybe re-inventing the wheel, has anyone looked at reading the new native XML files directly into R, ideally with SQL queries.

There is an XML package for R http://www.omegahat.org/RSXML/c but it would be really useful if someone more capable than me, had already put together a method.

Many thanks,

Graham

Show replies by date

epidata-list＠lists.umanitoba.ca

9 Jun 9 Jun

2:49 p.m.

On 2011-06-09 20:32, epidata-list@lists.umanitoba.ca wrote:

...

To save me maybe re-inventing the wheel, has anyone looked at reading the new native XML files directly into R, ideally with SQL queries.

There is an XML package for R http://www.omegahat.org/RSXML/c but it would be really useful if someone more capable than me, had already put together a method.

Many thanks,

Graham _______________________________________________ EpiData-list mailing list EpiData-list@lists.umanitoba.ca http://lists.umanitoba.ca/mailman/listinfo/epidata-list

It will be really welcomed if someone can write a specifice EpiData-R conversion, since my plan is that we will include in the rewriting of EpiData Analysis the possibility (if we succeed) to write command files which will start R and return with the results (behind the scenes of the user).

regards Jens Lauritsen EpiData Association

epidata-list＠lists.umanitoba.ca

10 Jun 10 Jun

6:16 a.m.

On Thu, Jun 09, 2011 at 09:49:34PM +0200, epidata-list@lists.umanitoba.ca wrote:

[...]

...

It will be really welcomed if someone can write a specifice EpiData-R conversion, since my plan is that we will include in the rewriting of EpiData Analysis the possibility (if we succeed) to write command files which will start R and return with the results (behind the scenes of the user).

regards Jens Lauritsen

Is this example data file one that I could work with to develop such code: http://www.epidata.org/dokuwiki/doku.php/documentation:datafileformat:exampl...

David --

epidata-list＠lists.umanitoba.ca

6:30 a.m.

On Fri, Jun 10, 2011 at 01:16:01PM +0200, David Whiting wrote:

...

On Thu, Jun 09, 2011 at 09:49:34PM +0200, epidata-list@lists.umanitoba.ca wrote:

[...]

...
It will be really welcomed if someone can write a specifice EpiData-R conversion, since my plan is that we will include in the rewriting of EpiData Analysis the possibility (if we succeed) to write command files which will start R and return with the results (behind the scenes of the user).

regards Jens Lauritsen

Is this example data file one that I could work with to develop such code: http://www.epidata.org/dokuwiki/doku.php/documentation:datafileformat:exampl...

Ah, seems not. I've just downloaded the data entry client and manager and entered some data. It seems that it is different. I'll have a play with it and see what I can come up with.

David --

epidata-list＠lists.umanitoba.ca

10:35 a.m.

Unfortunately the wiki is not completely up to date at this point. Within a month a correct wiki description will be in place.

The best is to create an xml file as you did and use that one.

regards Jens Lauritsen On 2011-06-10 13:30, epidata-list@lists.umanitoba.ca wrote:

...

On Fri, Jun 10, 2011 at 01:16:01PM +0200, David Whiting wrote:

...
On Thu, Jun 09, 2011 at 09:49:34PM +0200, epidata-list@lists.umanitoba.ca wrote:

[...]

...
It will be really welcomed if someone can write a specifice EpiData-R conversion, since my plan is that we will include in the rewriting of EpiData Analysis the possibility (if we succeed) to write command files which will start R and return with the results (behind the scenes of the user).

regards Jens Lauritsen

Is this example data file one that I could work with to develop such code: http://www.epidata.org/dokuwiki/doku.php/documentation:datafileformat:exampl...

Ah, seems not. I've just downloaded the data entry client and manager and entered some data. It seems that it is different. I'll have a play with it and see what I can come up with.

David

EpiData-list mailing list EpiData-list@lists.umanitoba.ca http://lists.umanitoba.ca/mailman/listinfo/epidata-list

epidata-list＠lists.umanitoba.ca

1:07 p.m.

Perhaps, Duncan Temple Lang, the author of the R module XML, might be willing to do this, Jens.

http://www.stat.ucdavis.edu/~duncan/

Why not ask him to look at the new EpiData XML format and get his opinion/help?

Pete Geddes

On 6/9/2011 3:49 PM, epidata-list@lists.umanitoba.ca wrote:

...

On 2011-06-09 20:32, epidata-list@lists.umanitoba.ca wrote:

...
To save me maybe re-inventing the wheel, has anyone looked at reading the new native XML files directly into R, ideally with SQL queries.

There is an XML package for R http://www.omegahat.org/RSXML/c but it would be really useful if someone more capable than me, had already put together a method.

Many thanks,

Graham _______________________________________________ EpiData-list mailing list EpiData-list@lists.umanitoba.ca http://lists.umanitoba.ca/mailman/listinfo/epidata-list

It will be really welcomed if someone can write a specifice EpiData-R conversion, since my plan is that we will include in the rewriting of EpiData Analysis the possibility (if we succeed) to write command files which will start R and return with the results (behind the scenes of the user).

regards Jens Lauritsen EpiData Association _______________________________________________ EpiData-list mailing list EpiData-list@lists.umanitoba.ca http://lists.umanitoba.ca/mailman/listinfo/epidata-list

epidata-list＠lists.umanitoba.ca

6:17 p.m.

Maybe it would be nice to add that implementation to Manager, also. In that way other users of operative systems like Windows (GNU/Linux, and Mac) can be benefit.

Regards,

2011/6/9 epidata-list@lists.umanitoba.ca:

...

On 2011-06-09 20:32, epidata-list@lists.umanitoba.ca wrote:

...
To save me maybe re-inventing the wheel, has anyone looked at reading the new native XML files directly into R, ideally with SQL queries.

There is an XML package for R http://www.omegahat.org/RSXML/c but it would be really useful if someone more capable than me, had already put together a method.

Many thanks,

Graham _______________________________________________ EpiData-list mailing list EpiData-list@lists.umanitoba.ca http://lists.umanitoba.ca/mailman/listinfo/epidata-list

It will be really welcomed if someone can write a specifice EpiData-R conversion, since my plan is that we will include in the rewriting of EpiData Analysis the possibility (if we succeed) to write command files which will start R and return with the results (behind the scenes of the user).

regards Jens Lauritsen EpiData Association _______________________________________________ EpiData-list mailing list EpiData-list@lists.umanitoba.ca http://lists.umanitoba.ca/mailman/listinfo/epidata-list

-- Omar Bautista González - Colaborador en autogestión comunitaria desde República Dominicana

epidata-list＠lists.umanitoba.ca

11 Jun 11 Jun

1:22 a.m.

...

Maybe it would be nice to add that implementation to Manager, also. In that way other users of operative systems like Windows (GNU/Linux, and Mac) can be benefit.

I am assuming that Epidata analysis will eventually become available to Linux and Mac users, but it would still be useful to allow direct access to the Epidata files from R. If Epdata used SQLite as a backend, you would be able to access the data directly from R using SQL queries and not need to export/import anything.

This was what I was hoping someone knew if this could be done with the XML database.

Graham

epidata-list＠lists.umanitoba.ca

2:33 a.m.

On Sat, Jun 11, 2011 at 08:22:48AM +0200, epidata-list@lists.umanitoba.ca wrote:

...

...
Maybe it would be nice to add that implementation to Manager, also. In that way other users of operative systems like Windows (GNU/Linux, and Mac) can be benefit.

I am assuming that Epidata analysis will eventually become available to Linux and Mac users, but it would still be useful to allow direct access to the Epidata files from R. If Epdata used SQLite as a backend, you would be able to access the data directly from R using SQL queries and not need to export/import anything.

This was what I was hoping someone knew if this could be done with the XML database.

Graham

AFAIK Epidata is using the XML file as the backend. So to do what you suggest we would need a way of exporting the XML to SQLite. Then it could be accessed from within R. This route would not use the R XML package. Doing this would certainly be possible (but I wouldn't know how to). However, this still means doing an export.

I have started working on an R package that imports an epidata XML file directly into R using the R XML package. So far it creates a dataframe in R and uses the field information to convert the data to appropriate R data types. I haven't started working with the value labels yet.

One potential disadvantage of this route is that the R XML approach reads the whole XML file into memory (I think) and this could create problems with large files on machines with limited resources. However, as a test I have created an XML file with 6 columns and 16,000 records and a netbook with 1Gb of memory reads it into R in 7 seconds---so, not exactly instantaneous, but then any export is also going to take a few seconds. I haven't tried to optimize it yet, so it might be possible to improve this.

It actually creates an R list object with a list of dataframes (because an epidata file may potentially contain multiple tables). The names of the dataframes come from the epidata XML file. In R it is used like this:

library(epidata) x <- read.epidata.xml("myepidatafile.epx")

names(x) [1] "datafile_id_0"

names(x$datafile_id_0) [1] "Name" "age" "height" "dob" "uppercasetest" "st"

If you know that there is only one table in the XML file you could do this so that the dataframe is stored in the object x:

library(epidata) x <- read.epidata.xml("myepidatafile.epx")[[1]]

names(x) [1] "Name" "age" "height" "dob" "uppercasetest" "st"

David --

epidata-list＠lists.umanitoba.ca

3:53 a.m.

David

...

AFAIK Epidata is using the XML file as the backend. So to do what you suggest we would need a way of exporting the XML to SQLite. Then it could be accessed from within R. This route would not use the R XML package. Doing this would certainly be possible (but I wouldn't know how to). However, this still means doing an export.

No, sorry, I wasn't suggesting exporting to SQLite, I was just saying that with an SQLite backend to a database you can query the database directly (ie without loading it all into R) and asking if there was a way to query the XML database in the same way. If you go back to my OP that started this thread that was what I was asking , but it then developed into a question about running R code from Epidata analysis.

...

I have started working on an R package that imports an epidata XML file directly into R using the R XML package. So far it creates a dataframe in R and uses the field information to convert the data to appropriate R data types. I haven't started working with the value labels yet.

This would be great and I look forward to it being available.

...

One potential disadvantage of this route is that the R XML approach reads the whole XML file into memory (I think) and this could create problems with large files on machines with limited resources. However, as a test I have created an XML file with 6 columns and 16,000 records and a netbook with 1Gb of memory reads it into R in 7 seconds---so, not exactly instantaneous, but then any export is also going to take a few seconds. I haven't tried to optimize it yet, so it might be possible to improve this.

I suspect that unless there is some way of querying the XML file directly, then this is something we have to live with, there is an R package to use SQL query language on txt files, but not sure if that is before or after they are loaded into R however I can't find it now.

Personally, I feel the advantages of loading the XML file directly and not needing to convert it, outweighs the speed issue, certainly at the sample sizes I would be looking at. but it would be more of an issue for large studies.

...

It actually creates an R list object with a list of dataframes (because an epidata file may potentially contain multiple tables).

The names of the dataframes come from the epidata XML file.

...

In R it is used like this:

library(epidata) x <- read.epidata.xml("myepidatafile.epx")

names(x) [1] "datafile_id_0"

names(x$datafile_id_0) [1] "Name" "age" "height" "dob" "uppercasetest" "st"

If you know that there is only one table in the XML file you could do this so that the dataframe is stored in the object x:

library(epidata) x <- read.epidata.xml("myepidatafile.epx")[[1]]

names(x) [1] "Name" "age" "height" "dob" "uppercasetest" "st"

This sounds excellent, and I would be happy to to give it a trial, once you get to the stage of sharing, if that would be useful.

Having said that, at the moment I can't even get XML to install. Some issue with the xml2-config file not being available, but that is just on my main computer running Ubuntu, it seems fine on my MacBook.

Thanks,

Graham

epidata-list＠lists.umanitoba.ca

8:44 a.m.

On Sat, Jun 11, 2011 at 10:53:08AM +0200, epidata-list@lists.umanitoba.ca wrote:

...

...
I have started working on an R package that imports an epidata XML file directly into R using the R XML package. So far it creates a dataframe in R and uses the field information to convert the data to appropriate R data types. I haven't started working with the value labels yet.

[...]

...

This sounds excellent, and I would be happy to to give it a trial, once you get to the stage of sharing, if that would be useful.

OK, here you go. I've put the code on github: https://github.com/daudi/Epidata-XML-to-R

It's not a proper package yet, just some functions in a single file for now. And it probably isn't particularly great code (I'm still learning how to use XML), but it works. I've got some code in there for logging and debugging that can come out later (i.e. save(), status.log()). git clone it or just download the R file.

Some TODOs:

i) handle value labels;

ii) map the remaining data types;

iii) tighten up the code, e.g. replace for loops;

iv) deal with records of different lengths.

The last one is an issue that needs some thinking about. If all rows/records have the same number of columns it works okay. But if you create a screen, enter some data, then add a new field and only enter data in the subsequent records then the first set of records will have fewer fields than the second set of records. R will recycle values and issue a warning. Detecting this and then dealing with it will mean changing the function that gets the records and I need to ponder this one.

David --

epidata-list＠lists.umanitoba.ca

9:44 a.m.

David,

Many thanks for this, I have imported my tiny sample file, but don't seem able to do anything with it, is this related to you saying you haven't started working with value labels.

Here is some output from R

x <- read.epidata.xml("test.epx", dec.sep = ".")> x$datafile_id_0 Date Species Count st 1 2009-11-23 Blackbird 34 0 2 2010-11-23 Thrush 57 0 3 2011-12-24 Blackbird 130 0 4 2006-11-23 Blackbird 134 0 5 2011-06-23 Thrush 34 0 6 2005-05-23 Sparrow 24 0

...

summary(x) Length Class Mode

datafile_id_0 4 data.frame list> boxplot(x$Count)Error in plot.window(xlim = xlim, ylim = ylim, log = log, yaxs = pars$yaxs) : need finite 'ylim' valuesIn addition: Warning messages:1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'4: In min(x) : no non-missing arguments to min; returning Inf5: In max(x) : no non-missing arguments to max; returning -Inf

I'm also not sure what the st column is.

You will gather I am right on the edge of my undertstanding of R here.

Graham

On 11 June 2011 14:44, epidata-list@lists.umanitoba.ca wrote:

...

On Sat, Jun 11, 2011 at 10:53:08AM +0200, epidata-list@lists.umanitoba.cawrote:

...
...
I have started working on an R package that imports an epidata XML file directly into R using the R XML package. So far it creates a dataframe

in R

...
...
and uses the field information to convert the data to appropriate R

data

...
...
types. I haven't started working with the value labels yet.

[...]

...
This sounds excellent, and I would be happy to to give it a trial, once

you

...
get to the stage of sharing, if that would be useful.

OK, here you go. I've put the code on github: https://github.com/daudi/Epidata-XML-to-R

It's not a proper package yet, just some functions in a single file for now. And it probably isn't particularly great code (I'm still learning how to use XML), but it works. I've got some code in there for logging and debugging that can come out later (i.e. save(), status.log()). git clone it or just download the R file.

Some TODOs:

i) handle value labels;

ii) map the remaining data types;

iii) tighten up the code, e.g. replace for loops;

iv) deal with records of different lengths.

The last one is an issue that needs some thinking about. If all rows/records have the same number of columns it works okay. But if you create a screen, enter some data, then add a new field and only enter data in the subsequent records then the first set of records will have fewer fields than the second set of records. R will recycle values and issue a warning. Detecting this and then dealing with it will mean changing the function that gets the records and I need to ponder this one.

David

EpiData-list mailing list EpiData-list@lists.umanitoba.ca http://lists.umanitoba.ca/mailman/listinfo/epidata-list

epidata-list＠lists.umanitoba.ca

10:34 a.m.

On Sat, Jun 11, 2011 at 04:44:19PM +0200, epidata-list@lists.umanitoba.ca wrote:

...

David,

Many thanks for this, I have imported my tiny sample file, but don't seem able to do anything with it, is this related to you saying you haven't started working with value labels.

Here is some output from R

x <- read.epidata.xml("test.epx", dec.sep = ".")> x$datafile_id_0 Date Species Count st 1 2009-11-23 Blackbird 34 0 2 2010-11-23 Thrush 57 0 3 2011-12-24 Blackbird 130 0 4 2006-11-23 Blackbird 134 0 5 2011-06-23 Thrush 34 0 6 2005-05-23 Sparrow 24 0

...
summary(x) Length Class Mode

datafile_id_0 4 data.frame list> boxplot(x$Count)Error in plot.window(xlim = xlim, ylim = ylim, log = log, yaxs = pars$yaxs) : need finite 'ylim' valuesIn addition: Warning messages:1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'4: In min(x) : no non-missing arguments to min; returning Inf5: In max(x) : no non-missing arguments to max; returning -Inf

I'm also not sure what the st column is.

You will gather I am right on the edge of my undertstanding of R here.

Graham

Graham,

read.epidata.xml() returns a list of dataframes. To use other R functions like summary() you just need to get/use the dataframe from the list using double square brackets to extract an element from a list, i.e. [[]]. There is only one dataframe in the list, so it will be the first one, so we use [[1]]:

x <- read.epidata.xml("test.epx", dec.sep = ".")[[1]] x summary(x) table(x$Species) barplot(table(x$Species)) etc.

BTW, the field st comes from epidata, I don't know what it is for, I haven't checked the docs yet.

David --

epidata-list＠lists.umanitoba.ca

10:53 a.m.

David,

...

read.epidata.xml() returns a list of dataframes. To use other R functions like summary() you just need to get/use the dataframe from the list using double square brackets to extract an element from a list, i.e. [[]]. There is only one dataframe in the list, so it will be the first one, so we use [[1]]:

x <- read.epidata.xml("test.epx", dec.sep = ".")[[1]] x summary(x) table(x$Species) barplot(table(x$Species)) etc.

BTW, the field st comes from epidata, I don't know what it is for, I haven't checked the docs yet.

Thanks again, I did see the commented out bit of code but didn't grasp the relevance of it.

All looking good so far, and I am learning new things so many thanks.

Graham

epidata-list＠lists.umanitoba.ca

12 Jun 12 Jun

9:44 a.m.

David,

I have just come across Xpath http://www.w3schools.com/xpath/default.asp

Along with questions in stack overlfow which "suggests" you can use Xpath with XML in R to directly query an XML database without importing.

e.g. http://stackoverflow.com/questions/3876571/how-can-i-use-xpath-querying-usin... and http://stackoverflow.com/questions/4870207/xpath-within-r-using-xml-package

I admit I don't really follow it, but thought I should pass it on in case it was useful.

Graham

epidata-list＠lists.umanitoba.ca

9:53 a.m.

And another link showing some Xpath code in R

http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-...

Apologies if this is of no interest, and I now see there are several posts on stack overflow covering this issue.

Graham

On 12 June 2011 15:44, Graham Smith myotistwo@gmail.com wrote:

...

David,

I have just come across Xpath http://www.w3schools.com/xpath/default.asp

Along with questions in stack overlfow which "suggests" you can use Xpath with XML in R to directly query an XML database without importing.

e.g. http://stackoverflow.com/questions/3876571/how-can-i-use-xpath-querying-usin... and http://stackoverflow.com/questions/4870207/xpath-within-r-using-xml-package

I admit I don't really follow it, but thought I should pass it on in case it was useful.

Graham

epidata-list＠lists.umanitoba.ca

4:08 p.m.

On Sun, Jun 12, 2011 at 04:53:08PM +0200, epidata-list@lists.umanitoba.ca wrote:

...

And another link showing some Xpath code in R

http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-...

Apologies if this is of no interest, and I now see there are several posts on stack overflow covering this issue.

Thanks for these. To do any useful R work we'll still need to import the records into R objects, but it might be possible to use this to limit the number of records that are read in from the XML file, e.g. selecting the first n records, records that match a pattern or perhaps a random sample of records.

David

epidata-list＠lists.umanitoba.ca

4:28 p.m.

David,

Thanks for these. To do any useful R work we'll still need to import the

...

records into R objects, but it might be possible to use this to limit the number of records that are read in from the XML file, e.g. selecting the first n records, records that match a pattern or perhaps a random sample of records.

Glad they might be of some use.

Even if it only restricts the number of records imported, that would seem to be really useful when dealing with very large data sets.

Graham

epidata-list＠lists.umanitoba.ca

9 a.m.

Interesting to see how the "read into R" is progressing in latest days. Just a few comments:

Basicly any external system reading xml files should read at least 1-3 of: 1. The data file structure (type and number of fields). 2. The contained data 3. The metadata - that is defined value labels, defined missing variables (now contained as part of the value labels), the questions (or variable labels). But possibly also depending on purpose: 4. System data, such as defined delimeters for decimals and dates (written in the header section of the xml). 5. Project information

Mark Myatt active in the early EpiData development has written an introduction to R, including data examples. You find this on: http://www.brixtonhealth.com/Rex.zip

Regarding the specific discussions lately here: a. David writes:

iv) deal with records of different lengths.

The reason we have decided to only write variables containing data in the xml structure is that this makes data files much smaller. However I think we could consider whether this can be defined by the user, such that all fields are written in each record, regardless of whether they contain data or not.

b Once you are done with the script/principles, please write up in a document that we can either link to or save in the wiki under examples.

regards Jens Lauritsen EpiData Association

epidata-list＠lists.umanitoba.ca

4:05 p.m.

Jens,

Thanks for the guidance.

On Sun, Jun 12, 2011 at 04:00:20PM +0200, epidata-list@lists.umanitoba.ca wrote:

...

Interesting to see how the "read into R" is progressing in latest days. Just a few comments:

Basicly any external system reading xml files should read at least 1-3 of:

The data file structure (type and number of fields).

I've added an implementation of this. The list object now has a field info table.

...

The contained data

Yep.

...

The metadata - that is defined value labels, defined missing

variables (now contained as part of the value labels), the questions (or variable labels).

Still need to do this.

...

But possibly also depending on purpose: 4. System data, such as defined delimeters for decimals and dates (written in the header section of the xml).

Agreed. Will be easy.

...

Project information

Agreed. Will be easy.

...

Mark Myatt active in the early EpiData development has written an introduction to R, including data examples. You find this on: http://www.brixtonhealth.com/Rex.zip

Regarding the specific discussions lately here: a. David writes:

iv) deal with records of different lengths.

The reason we have decided to only write variables containing data in the xml structure is that this makes data files much smaller. However I think we could consider whether this can be defined by the user, such that all fields are written in each record, regardless of whether they contain data or not.

This makes sense. Anyway, I've changed the code to work around this issue, adding NA where the fields are not in the attributes. One side effect is that the field order may be changed as I need to sort them to ensure that they are all in the same order.

...

b Once you are done with the script/principles, please write up in a document that we can either link to or save in the wiki under examples.

Will do. It is all very experimental at the moment. Once I feel it has settled down I'll do this.

David --

epidata-list＠lists.umanitoba.ca

11 Jun 11 Jun

10:45 a.m.

Ok, Graham.

Regards,

2011/6/11 epidata-list@lists.umanitoba.ca:

...

...
Maybe it would be nice to add that implementation to Manager, also. In that way other users of operative systems like Windows (GNU/Linux, and Mac) can be benefit.

I am assuming that Epidata analysis will eventually become available to Linux and Mac users, but it would still be useful to allow direct access to the Epidata files from R. If Epdata used SQLite as a backend, you would be able to access the data directly from R using SQL queries and not need to export/import anything.

This was what I was hoping someone knew if this could be done with the XML database.

Graham _______________________________________________ EpiData-list mailing list EpiData-list@lists.umanitoba.ca http://lists.umanitoba.ca/mailman/listinfo/epidata-list

-- Omar Bautista González - Colaborador en autogestión comunitaria desde República Dominicana

5178

Age (days ago)

5181

Last active (days ago)

List overview

Download

20 comments

1 participants

tags (0)

participants (1)

epidata-list＠lists.umanitoba.ca