Data Scientists – scale your R applications with Oracle Machine Learning
This article is originally published at https://blogs.oracle.com/r/compendium/rss
You may have seen that Oracle R Distribution 3.6.1 was recently released along with compatibility for Oracle Machine Learning for R (OML4R), formerly Oracle R Enterprise, version 1.5.1.
What you may not realize is that for the past year OML4R is included with your Oracle Database license, so if you have Oracle Database, you can immediately start using OML4R in your production applications.
With OML4R, data scientists and other R users are able to access and manipulate database data and algorithms using familiar R syntax and semantics using Oracle Database as a high performance compute engine, while minimizing or eliminating data movement to the client. Augment in-database functionality with open source packages and deploy user-defined R functions easily using SQL. Here are a few highlights of OML4R functionality.
Upgraded R version compatibility
OML4R 1.5.1 is certified with R 3.6.1 - both open source R and Oracle R Distribution. See the server support matrix for the complete list of supported R versions.
OREdplyr
The dplyr package is widely used providing a grammar for data manipulation while working with data.frame-like objects, both in memory and out of memory. The dplyr package is also an interfaces to database management systems, operating on data.frame or numeric vector objects.
OREdplyr provides a subset of dplyr functionality extending the OML4R transparency layer. OREdplyr functions accept ore.frame proxy objects, which map to database tables and views, instead of data.frames for in-database performance and scalability. OREdplyr allows users to avoid costly movement of data while scaling to larger data volumes because operations are not constrained by R client memory or data access latency.
Featured in-database algorithms
OML4R provides a range of in-database algorithms supporting various machine learning techniques, including:
Expectation Maximization (EM) - a popular probability density estimation technique used to implement a distribution-based clustering algorithm. Special features of this algorithm implementation include: automated model search that finds the number of clusters or components up to a stated maximum; protects against overfitting; supports numeric and multinomial distributions; produces models with high quality probability estimates; generates cluster hierarchy, rules, and other statistics; supports both Gaussian and multi-value Bernoulli distributions; and includes heuristics that automatically choose distribution types.
Explicit Semantic Analysis (ESA) - designed to improve text categorization, this algorithm computes "semantic relatedness" using cosine similarity between vectors representing the text, collectively interpreted as a space of concepts explicitly defined and described by humans. The name "explicit semantic analysis" contrasts with latent semantic analysis (LSA) because ESA uses a knowledge base that makes it possible to assign human-readable labels to concepts comprising the vector space.
Singular Value Decomposition (SVD) - this feature extraction algorithm uses orthogonal linear transformations to capture the underlying variance of data by decomposing a rectangular matrix into three matrices: U, D and V. Matrix D is a diagonal matrix and its singular values reflect the amount of data variance captured by the bases. Special features of this algorithm implementation include: support for narrow data via Tall and Skinny solvers and wide data via stochastic solvers, and providing traditional SVD for more stable results and eigensolvers for faster analysis with sparse data.
Automated Text Processing
For select algorithms in the OREdm package (Support Vector Machine, Singular Value Decomposition, Non-negative Matrix Factorization, Explicit Semantic Analysis), users can identify columns that should be treated as text, similar to how OML4SQL enables automated text processing as a precursor to model building and scoring. Users can specify Oracle Text attribute-specific settings and specify one or more columns that should be treated as text while the list value specifies the text transformation. Tokens and themes are automatically extracted and combined with other structured data for model building and scoring.
Partitioned Models
OML4R enables the building of a type of ensemble model where each model consists of multiple sub-models. A sub-model is automatically built for each partition of data, where partitions are determined based on the unique values found in user-specified columns. Partitioned models automate scoring by allowing users to reference the top-level model only, ensuring proper sub-model selection based on the values of the partitioned column(s) for each row of data to be scored.
For a complete list of features, see the User's Guide, to learn more, visit Oracle Machine Learning for R, and explore the Oracle R Technologies blog.Thanks for visiting r-craft.org
This article is originally published at https://blogs.oracle.com/r/compendium/rss
Please visit source website for post related comments.