R News

Are all algorithm implementations created equal?

by Mark Hornick · July 26, 2019

This article is originally published at https://blogs.oracle.com/r/compendium/rss

Unfortunately, not all machine learning algorithm implementations are the same, which can have significant impact on data science project success. Too often, a data science project that shows promise in the lab meets scalability, performance, and deployment issues when moving to production.

At one level, we can view an algorithm as a set of instructions that performs a particular computation. Ideally, these instructions are unambiguous, but even when clear their implementation can take on many forms – from research prototype to enterprise-ready software. From a role perspective, a scientist has the insight to design the algorithm, but often an engineer needs to implement it to meet certain production specifications.

Other things being equal, most algorithm implementations will work effectively on small or even moderate size data sets. However, when placed into production and scaling up to enterprise workloads, many algorithm implementations experience issues in performance and scalability. These often result from single-threaded algorithm implementations and the requirement for all the data to reside in memory for processing. While many open source packages offer an extensive and highly valuable set of machine learning algorithms and techniques, these are often single threaded and expect data to fully fit in memory.

State-of-the-art engineering

This is where the benefit of applying state-of-the-art engineering techniques pays off. Many successful and useful algorithms can be redesigned to take advantage of parallelism (multi-threading) and distributed execution (across multiple nodes). This enables overcoming performance issues as model building and scoring can take advantage of multiple CPUs and compute nodes on large-scale hardware, e.g., Oracle Exadata, and cloud-based solutions, e.g., Oracle Autonomous Database.

To scale to larger data volumes, especially those that do not fit in memory, machine learning algorithms need to be redesigned to improve memory utilization. This may occur through, e.g., working on smaller batches of data incrementally and having efficient internal data representations to minimize memory consumption, especially for sparse data.

A kinder, gentler interface

We can go one step further and reflect on the requirements of the algorithm inputs. These come in two flavors: data and algorithm settings, which include hyperparameters (see What's the difference between a parameter and a hyperparameter?). Regarding data, many algorithms have explicit requirements on data format or representation. For example, neural networks normally require all data to be numeric and normalized. While data scientists may be familiar with the details for each algorithm, less expert users are often stymied by individual algorithm peculiarities. In terms of hyperparameters, some algorithms provide few, if any, “knobs” that can be adjusted to affect model quality or performance, while others provide a wide range of such knobs that may not be well-understood by typical users. The combinatorial space of possible settings, which can include feature selection, can make the machine learning task tedious or mundane,

To address this, algorithm implementations can be augmented by support to perform automatic data preparation such that data scientists do not need to perform perfunctory transformations on every data set (unless they want to), and have those transformations automatically applied when scoring data. Since each algorithm may have a specific data input representation, the set of required transformation can become part of the model building process. Any statistics can be maintained with the model and used during scoring. Typical transformation include binning, normalization, and outlier treatment. See Understanding Automatic Data Preparation for more details. Further, a degree of intelligence can be built into algorithm to automatically and efficiently tune the hyperparameters so the data scientist can minimize time spent on building and comparing many models that span a wide range of possible hyperparameters.

Oracle Machine Learning

As such, not all algorithm implementations are created equal. Oracle Machine Learning, through the Oracle Advanced Analytics option to Oracle Database, addresses the need of enterprises for scalable and performant algorithms, while also providing automatic data preparation and support for hyperparameter tuning. Oracle software engineers apply decades of experience with Oracle Database parallelism and software optimization to achieve machine learning algorithms that execute in the Oracle Database kernel. Since enterprise data often reside in an Oracle database, there is no need to move data to external servers to perform machine learning. This eliminates data access latency, duplication, and the corresponding security, backup, and recovery issues that ensue. The Oracle Machine Learning algorithms are available directly from a SQL API (OML4SQL), R API (OML4R), notebook interface, and the Oracle Data Miner user interface. See Oracle Machine Learning for more information.

Thanks for visiting r-craft.org
This article is originally published at https://blogs.oracle.com/r/compendium/rss
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Are all algorithm implementations created equal?

You may also like...

Categories

Are all algorithm implementations created equal?

You may also like...

Carpe Talk

Passing user-supplied C++ functions with RcppXPtrUtils

Bio7 3.1 Released

Categories