Loading “wide data” in Oracle Database using Oracle R Enterprise
This article is originally published at https://blogs.oracle.com/r/compendium/rss
Data sets come in many shapes and sizes. Some are tall and thin, others are short and wide. Some take on the form of dense data, a.k.a., single-record case, where each row represents one entity, such as a customer or vehicle. Others take on the form of sparse data, a.k.a., transactional data, where each row typically consists of an identifier, variable name, and value, and a single "case" is represented by multiple rows sharing the same identifier.
R provides a variety of ways to morph data from one form into another, e.g., using the reshape2 package. What does this have to do with Oracle R Enterprise and Oracle Database?
As data science practitioners, we can receive data in flat files and we often want to load this data into Oracle Database as a table. However, if that data is wide, i.e., more than 1000 columns, you cannot just create a database table from a data.frame with more than 1000 columns. In my previous post, Consolidating wide and shallow data with ORE Datastore, I discussed how you could use the Oracle R Enterprise datastore feature to directly store wide tables; however, this does not allow direct manipulation in Oracle Database, since it's an R object serialized within an Oracle Database table.
While Oracle Database has the concept of nested columns, and the Oracle Advanced Analytics in-database machine learning algorithms can use nested columns to build models and score data with tens of thousands of predictors, ORE does not support nested columns. But we can use ORE to convert the data into transactional format and easily populate a table that can be used to create nested columns within Oracle Database using SQL, or to further manipulate the data using ORE or SQL.
An ExampleTo illustrate, we'll first create an R data.frame with 10000 rows and 9000 columns and write it to a file:
rr <- 10000 cc <- 9000 widematrix <- matrix(rnorm(rr*cc), nrow=rr, ncol=cc) widedata <- data.frame(widematrix) widedata$target <- rnorm(rr)*10 widedata$id <- 1:rr dim(widedata) write.table(widedata, file="WIDE_DATA.csv", row.names=FALSE)Then, we'll read that file and pivot the data into transactional format using the melt function from the reshape2 package. Finally, we create that table in Oracle Database.widedata <- read.table("WIDE_DATA.csv", header=TRUE) dim(widedata) library(reshape2) transdata <- melt(widedata, id.vars="id") head(transdata) dim(transdata)library(ORE) ore.connect(...)ore.drop(table="TRANS_DATA") ore.create(transdata, table="TRANS_DATA") dim(TRANS_DATA)Running this, we see that reading the 90,020,000 cells worth of data took about 22 minutes in open source R on an Exadata X2 machine. Performing the pivot is very fast - under 2 seconds. Creating the database table took under 7 minutes for the 90 million row table!> dim(widedata) [1] 10000 9002 > write.table(widedata, file="WIDE_DATA.csv") > system.time(widedata <- read.table("WIDE_DATA.csv", header=TRUE)) user system elapsed 1332.363 9.708 1343.965 > dim(widedata) [1] 10000 9002 > system.time(transdata <- melt(widedata, id.vars="id")) user system elapsed 1.231 0.001 1.236 > dim(transdata) [1] 90010000 3 > ore.drop(table="TRANS_DATA") > system.time(ore.create(transdata, table="TRANS_DATA")) user system elapsed 62.080 5.648 413.627 > dim(TRANS_DATA) [1] 90010000 3Now, it's possible you don't have enough RAM on your machine to do this all at once. So there is another option that involves using R and ROracle to batch the data reads and table writes.
First, we'll create an empty table with the same structure as our result table above, TRANS_DATA_NEW. Then, we connect to Oracle Database using ROracle directly, which gives us the ability to append to an existing table using a for loop.To ensure we can make forward progress, we commit after each loop, but this isn't necessary, just be sure to commit at the end.
ore.drop(table="TRANS_DATA_NEW") ore.create(head(TRANS_DATA,0), table="TRANS_DATA_NEW")con <- dbConnect(dbDriver("Oracle"), username="rquser",password="pwdoaa", dbname="(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=host)(PORT=port))(CONNECT_DATA=(SERVICE_NAME=svcname)))")header <- names(read.table("WIDE_DATA2.csv", header=TRUE, nrows=1)) batchsize <- 100 # rows at a time for(i in 0:(rr/batchsize-1)) { widedata <- read.table("WIDE_DATA2.csv", header=FALSE, skip=i*batchsize+1, nrows=batchsize) names(widedata) <- header transdata <- melt(widedata, id.vars="id") dbWriteTable(con,"TRANS_DATA_NEW2",transdata, row.names=FALSE, append=TRUE) dbCommit(con) }Once the data is in Oracle Database in transactional format, it's easy to use SQL to construct nested columns for use with the Oracle Advanced Analytics SQL API. Here's a link to an example involving setting up data for association rules mining.
Thanks for visiting r-craft.org
This article is originally published at https://blogs.oracle.com/r/compendium/rss
Please visit source website for post related comments.