R News

xml2 1.0.0

by RStudio | Open source & professional software for data science teams on RStudio · July 5, 2016

This article is originally published at https://www.rstudio.com/blog/

We are pleased to announced that xml2 1.0.0 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library, and makes it easy to work with XML and HTML files in R. Install the latest version with:

install.packages("xml2")

There are three major improvements in 1.0.0:

You can now modify and create XML documents.
xml_find_first() replaces xml_find_one(), and provides better semantics for missing nodes.
Improved namespace handling when working with XPath.

There are many other small improvements and bug fixes: please see the release notes for a complete list.

Modification and creation

xml2 now supports modification and creation of XML nodes. This includes new functions xml_new_document(), xml_new_child(), xml_new_sibling(), xml_set_namespace(), xml_remove(), xml_replace(), xml_root(), and replacement methods for xml_name(), xml_attr(), xml_attrs() and xml_text().

The basic process of creating an XML document by hand looks something like this:

root <- xml_new_document() %>% xml_add_child("root")

root %>%
  xml_add_child("a1", x = "1", y = "2") %>%
  xml_add_child("b") %>%
  xml_add_child("c") %>%
  invisible()

root %>%
  xml_add_child("a2") %>%
  xml_add_sibling("a3") %>%
  invisible()

cat(as.character(root))
#> <?xml version="1.0"?>
#> <root><a1 x="1" y="2"><b><c/></b></a1><a2/><a3/></root>

For a complete description of creation and mutation, please see vignette("modification", package = "xml2").

`xml_find_first()`

xml_find_one() has been deprecated in favor of xml_find_first(). xml_find_first() now always returns a single node: if there are multiple matches, it returns the first (without a warning), and if there are no matches, it returns a new xml_missing object.

This makes it much easier to work with ragged/inconsistent hierarchies:

x1 <- read_xml("<a>
  <b></b>
  <b><c>See</c></b>
  <b><c>Sea</c><c /></b>
</a>")

c <- x1 %>%
  xml_find_all(".//b") %>%
  xml_find_first(".//c")
c
#> {xml_nodeset (3)}
#> [1] <NA>
#> [2] <c>See</c>
#> [3] <c>Sea</c>

Missing nodes are replaced by missing values in functions that return vectors:

xml_name(c)
#> [1] NA  "c" "c"
xml_text(c)
#> [1] NA    "See" "Sea"

XPath and namespaces

XPath is challenging to use if your document contains any namespaces:

x <- read_xml('
 <root>
   <doc1 xmlns = "http://foo.com"><baz /></doc1>
   <doc2 xmlns = "http://bar.com"><baz /></doc2>
 </root>
')
x %>% xml_find_all(".//baz")
#> {xml_nodeset (0)}

To make life slightly easier, the default xml_ns() object is automatically passed to xml_find_*():

x %>% xml_ns()
#> d1 <-> http://foo.com
#> d2 <-> http://bar.com
x %>% xml_find_all(".//d1:baz")
#> {xml_nodeset (1)}
#> [1] <baz/>

If you just want to avoid the hassle of namespaces altogether, we have a new nuclear option: xml_ns_strip():

xml_ns_strip(x)
x %>% xml_find_all(".//baz")
#> {xml_nodeset (2)}
#> [1] <baz/>
#> [2] <baz/>

Thanks for visiting r-craft.org
This article is originally published at https://www.rstudio.com/blog/
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

xml2 1.0.0

You may also like...

Categories

xml2 1.0.0

Modification and creation

xml_find_first()

XPath and namespaces

You may also like...

Patchwork R package goes nerd viral

R vs. Python: What’s the best language for Data Science?

bigrquery 1.1.0

Categories

`xml_find_first()`