Pitfall of XML package: issues specific to cp932 locale, Japanese Shift-JIS, on Windows
This article is originally published at https://tomizonor.wordpress.com
CRAN package XML
has something wrong at parsing html pages encoded in cp932
(shift-jis
). In this report, I will show these issues and also their solutions which is workable at user side.
I found the issues are common at least on both Windows 7
and 10
with Japanese language
. Though other versions and languages are not checked, the issues may common on world wide Windows with non-European multibyte
languages encoded in national locales
, not in utf-8
.
Versions on my machines:
Windows 7 + R 3.2.3 + XML 3.98-1.3 Mac OS X 10.9.5 + R 3.2.0 + XML 3.98-1.3
Locales:
Windows > Sys.getlocale('LC_CTYPE') [1] "Japanese_Japan.932" Mac > Sys.getlocale('LC_CTYPE') [1] "ja_JP.UTF-8"
1. incident
# Mac
library(XML)
src <- 'http://www.taiki.pref.ibaraki.jp/data.asp'
t1 <- as.character(
readHTMLTable(src, which=4, trim=T, header=F,
skip.rows=2:48, encoding='shift-jis')[1,1]
)
> t1 # good
[1] "最新の観測情報 (2016年1月17日 8時)"
Above small R script was written by me when I improved my PM 2.5 script in the previous article. This was working on my Mac, but not on Windows PC at my office.
Of course a small change was needed for Windows to handle the difference of locales.
# Windows
t2 <- iconv(as.character(
readHTMLTable(src, which=4, trim=T, header=F,
skip.rows=2:48, encoding='shift-jis')[1,1]
), from='utf-8', to='shift-jis')
> t2 # bad
[1] NA
It completely failed.
I found this problem occurs depending the text in the html. So we must know “when and how” to avoid the error. This report is to show the solutions. Technical details will be shown in the next report.
2. solutions
2-1. No-Break Space (U+00A0)
Unicode character No-Break Space (U+00A0): \xc2\xa0 in utf-8 , &#160; or &#xa0; in html
When a shift-jis
encoded html has u+00a0
as html entity
, such as
, the package XML
brings a issue. More strictly, it’s not originated from the package XML
but from function iconv
. Function iconv
returns NA
when it tries to convert u+00a0
into shift-jis
. But we must be aware of this issue at using the package XML
because it always comes with famous html entity
.
A solution is to use an option sub=
in function iconv
, which can convert unknown characters into a specific one instead of NA
.
sub='' sub=' ' sub='byte'
# Windows
t3 <- iconv(as.character(
readHTMLTable(src, which=4, trim=T, header=F,
skip.rows=2:48, encoding='shift-jis')[1,1]
), from='utf-8', to='shift-jis', sub=' ')
> t3 # bad
The result is a broken string and not shown.
This can be a solution of the u+00a0
issue in shift-jis
encoded page. But unfortunately, the above t3
still fails because there is another issue on that html page.
2-2. trim
An option trim=
is commonly used in text functions of package XML
, in such as readHTMLTable
and xmlValue
. With trim=TRUE
, a text removed space characters such as \t
or \r
from both ends of the node text is returned. This option is very useful to treat html pages, because they usually have a plenty of spaces and line feeds.
But trim=TRUE
is not safe when a shift-jis encoded html
is read on a Windows PC
with shift-jis (cp932) locale
. This issue is serious and the text string is completely destroyed.
Additionally, we must be aware of the default value of this option; trim=FALSE
for xmlValue
, and trim=TRUE
for readHTMLTable
.
A solution is to use trim=FALSE
and to remove spaces with function gsub
after we get a pure string.
# Windows
t4 <- gsub('\\s', '', iconv(
readHTMLTable(src, which=4, trim=F, header=F,
skip.rows=2:48, encoding='shift-jis')[1,1]
, from='utf-8', to='shift-jis', sub=' '))
> t4 # good
[1] "最新の観測情報(2016年1月17日8時)"
The regular expression of gsub
is safe to the platform locale.
More precisely, the t4
above is not same as the result of trim=TRUE
. That regular expression remove all spaces in the sentence, although it doesn’t matter in Japanese language.
We may want to improve this as:
gsub('(^\\s+)|(\\s+$)', '', x)
# Windows
t5 <- gsub('(^\\s+)|(\\s+$)', '', iconv(
readHTMLTable(src, which=4, trim=F, header=F,
skip.rows=2:48, encoding='shift-jis')[1,1]
, from='utf-8', to='shift-jis', sub=' '))
> t5 # very good
[1] "最新の観測情報 (2016年1月17日 8時)"
Finally, two issues are solved. We get a script workable on Windows.
Strictly the t1
and the t5
are different. Spaces in t5
is u+0020
, while these in t1
is u+00a0
.
Thanks for visiting r-craft.org
This article is originally published at https://tomizonor.wordpress.com
Please visit source website for post related comments.