R / R News

Pitfall of XML package: issues specific to cp932 locale, Japanese Shift-JIS, on Windows

by tomizono · January 31, 2016

This article is originally published at https://tomizonor.wordpress.com

CRAN package XML has something wrong at parsing html pages encoded in cp932 (shift-jis). In this report, I will show these issues and also their solutions which is workable at user side.

I found the issues are common at least on both Windows 7 and 10 with Japanese language. Though other versions and languages are not checked, the issues may common on world wide Windows with non-European multibyte languages encoded in national locales, not in utf-8.

Versions on my machines:

Windows 7 + R 3.2.3 + XML 3.98-1.3
Mac OS X 10.9.5 + R 3.2.0 + XML 3.98-1.3

Locales:

Windows
> Sys.getlocale('LC_CTYPE')
[1] "Japanese_Japan.932"
Mac
> Sys.getlocale('LC_CTYPE')
[1] "ja_JP.UTF-8"

1. incident

# Mac
library(XML)
src <- 'http://www.taiki.pref.ibaraki.jp/data.asp'
t1 <- as.character(
        readHTMLTable(src, which=4, trim=T, header=F, 
        skip.rows=2:48, encoding='shift-jis')[1,1]
      )
> t1 # good
[1] "最新の観測情報  （2016年1月17日  8時）"

Above small R script was written by me when I improved my PM 2.5 script in the previous article. This was working on my Mac, but not on Windows PC at my office.

Of course a small change was needed for Windows to handle the difference of locales.

# Windows
t2 <- iconv(as.character(
        readHTMLTable(src, which=4, trim=T, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      ), from='utf-8', to='shift-jis')
> t2 # bad
[1] NA

It completely failed.

I found this problem occurs depending the text in the html. So we must know “when and how” to avoid the error. This report is to show the solutions. Technical details will be shown in the next report.

2. solutions

2-1. No-Break Space (U+00A0)

Unicode character No-Break Space (U+00A0): 
    \xc2\xa0 in utf-8
    &nbsp;, &＃160; or &＃xa0; in html

When a shift-jis encoded html has u+00a0 as html entity, such as  , the package XML brings a issue. More strictly, it’s not originated from the package XML but from function iconv. Function iconv returns NA when it tries to convert u+00a0 into shift-jis. But we must be aware of this issue at using the package XML because it always comes with famous html entity  .

A solution is to use an option sub= in function iconv, which can convert unknown characters into a specific one instead of NA.

sub=''
sub=' '
sub='byte'

# Windows
t3 <- iconv(as.character(
        readHTMLTable(src, which=4, trim=T, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      ), from='utf-8', to='shift-jis', sub=' ')
> t3 # bad
The result is a broken string and not shown.

This can be a solution of the u+00a0 issue in shift-jis encoded page. But unfortunately, the above t3 still fails because there is another issue on that html page.

2-2. trim

An option trim= is commonly used in text functions of package XML, in such as readHTMLTable and xmlValue. With trim=TRUE, a text removed space characters such as \t or \r from both ends of the node text is returned. This option is very useful to treat html pages, because they usually have a plenty of spaces and line feeds.

But trim=TRUE is not safe when a shift-jis encoded html is read on a Windows PC with shift-jis (cp932) locale. This issue is serious and the text string is completely destroyed.

Additionally, we must be aware of the default value of this option; trim=FALSE for xmlValue, and trim=TRUE for readHTMLTable.

A solution is to use trim=FALSE and to remove spaces with function gsub after we get a pure string.

# Windows
t4 <- gsub('\\s', '', iconv(
        readHTMLTable(src, which=4, trim=F, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      , from='utf-8', to='shift-jis', sub=' '))
> t4 # good
[1] "最新の観測情報（2016年1月17日8時）"

The regular expression of gsub is safe to the platform locale.

More precisely, the t4 above is not same as the result of trim=TRUE. That regular expression remove all spaces in the sentence, although it doesn’t matter in Japanese language.

We may want to improve this as:

gsub('(^\\s+)|(\\s+$)', '', x)

# Windows
t5 <- gsub('(^\\s+)|(\\s+$)', '', iconv(
        readHTMLTable(src, which=4, trim=F, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      , from='utf-8', to='shift-jis', sub=' '))
> t5 # very good
[1] "最新の観測情報 （2016年1月17日 8時）"

Finally, two issues are solved. We get a script workable on Windows.

Strictly the t1 and the t5 are different. Spaces in t5 is u+0020, while these in t1 is u+00a0.

Thanks for visiting r-craft.org
This article is originally published at https://tomizonor.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Pitfall of XML package: issues specific to cp932 locale, Japanese Shift-JIS, on Windows

You may also like...

Categories

Pitfall of XML package: issues specific to cp932 locale, Japanese Shift-JIS, on Windows

1. incident

2. solutions

2-1. No-Break Space (U+00A0)

2-2. trim

You may also like...

europeanaR: Exploring A Digitized Art Database with Shiny

Freedman’s paradox

How to Use Pandas Reset Index

Categories