R News / RBlog

Demystifying data science terminology

by Mango Solutions · August 15, 2018

This article is originally published at https://www.mango-solutions.com

The language used by data scientists can be confusing to anyone encountering it for the first time. Ever changing best practices and constantly evolving technologies and methodologies have given rise to a range of nuanced terms used throughout casual data conversation. Unfamiliarity with these terms often leads to disconnected expectations across different parts of a business when undertaking projects involving data and analytics. To make the most out of any data science project, it is important that participants have a shared vocabulary and an understanding of key terms at a level that is required of their role.

Mango Solutions is regularly involved in data science projects spanning different levels of a business. Below, we’ve outlined the most common data science terms that act as communication barriers in such projects:

Terms (common examples)	Definition for…
Terms (common examples)	… a data scientist	… a data science manager	… a business director
Data Science	An interdisciplinary field spanning mathematics, statistics and computer science aimed at delivering insights from data using a variety of technologies and methodologies.	An interdisciplinary business function making use of predictive and prescriptive analytics to make better business decisions.	The proactive use of data and advanced analytics to drive better decision making.
Descriptive Analytics	Examination of historical data to understand the changes occurring to a business. Used to answer the question “what happened?”
Diagnostic Analytics	Examination of historical data to understand why changes have occurred within a business. Used to answer the question “why did something happen?”
Predictive Analytics	The use of historical data to make predictions about future events. Used to answer the question “what will happen next?”
Prescriptive Analytics	The use of data and above forms of analytics to determine the best course of action for a business. Used to answer the question “what’s the best decision we can make based on the data we have?”
Model	The mathematical relationships describing how a sample of data is generated from other observations.	A data science product where mathematical and statistical relationships are estimated from historical data and later used to make predictions.	The mathematical and statistical relationships used to make predictions about key business metrics (e.g. future sales or probability a customer will make a purchase).
Artificial Intelligence (AI)	In practice, this term is generally used to refer to “narrow AI” and encompasses the types of problems that can be solved with machine learning. AI usually encompasses topics like machine learning, natural language processing and computer vision among others.
Machine Learning (e.g. random forest, xgboost, neural networks)	Variety of computational methods implementing supervised and unsupervised learning methods to predict class labels or continuous measures.	Typically, regression and classification algorithms for building models with many open-source implementations.	A broad range of leading predictive modelling methodologies.
Deep Learning	A generalisation of artificial neural networks that makes use of many intermediate layers of representation to better capture relationships between the observed data and predictions.	A subcategory of machine learning well-suited for complex models and particularly successful in image classification and speech translation.
Supervised Learning	Machine learning algorithms where existing data exists for both the prediction target and the observations with which the prediction will be made.		Machine learning problems where models are estimated from known examples (e.g. identifying fraudulent credit card transactions from reported cases).
Unsupervised Learning	A category of machine learning problems where labels or prediction targets are unknown and must be discovered from patterns in the data.	The class of machine learning problems where object groupings need to be discovered (e.g. clusters/labels for pieces of text).
Over-fitting	Estimation error where the model fits the noise in the data.
Over-fitting	This is often the result of using models that are too complex for either the problem or available data.	e.g. A complex image classifier trained using 20 photographs will likely have 100% classification accuracy on those images but otherwise perform poorly on new images.
Cross-Validation	An iterative approach for splitting data into train and test sets to ensure robust model estimation.	Critical strategy to ensure machine learning models don’t overfit the data and provide misleading predictions. This is needed to ensure models are general enough to be useful for making future predictions.
Training/Test Data	A division of data that allows unbiased model validation. Typically, models are estimated on training data and validated on “test” data that is withheld until the end of the analysis.
Classification	A general term for a class of predictive problems where the target of the prediction is a label (e.g. if an observation belongs to one of two categories).
Regression	Statistical and mathematical procedures for estimated the relationship between a set of variables and a target quantity while minimising the prediction errors		A broad term often used to refer to model estimation where the target variable is a continuous value (e.g. weekly sales)
Forecasting	The prediction of future events using mathematical or statistical models.
Cloud (AWS, GCP, Azure, Cloudera)	A shared set of computational resources allowing on-demand scaling of infrastructure to meet business or project computational requirements.	A broad term for scalable on demand infrastructure and computing.	A shared set of computational resources that allow businesses to avoid upfront infrastructure costs.
Version/Source Control (Git, SVN, Github, Gitlab)	A system for tracking, managing, and integrating code changes through a process involving branching and merging code repositories.	A system for tracking, managing, and integrating code changes while ensuring a full history of code changes is preserved along with comments from the individuals making those changes.	Framework for tracking code changes and allowing for the roll back to previous versions of software.
Unit Testing	The automation of code validation through tests designed to ensure the correct functioning of small components of code.	An often time-consuming step during development that helps programmers test code functionality and protect against future bugs. The benefit in unit testing is often realised in the long term.	Development practice that helps ensure correct code functionality.
Continuous Integration	A development practice where code changes are committed to a shared repository and validated by an automated build and testing process.	A practice used by a team of developers that helps protect against code integration failures and code changes that break existing or expected functionality.

Mango Solutions can help you build a shared language around data science in your organisation. Based on our experience working with the world’s leading companies, we have developed 3 workshops to build a common language.

Find out which of the three workshops would be valuable to your organisation:

Thanks for visiting r-craft.org
This article is originally published at https://www.mango-solutions.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Demystifying data science terminology

You may also like...

Categories

Demystifying data science terminology

You may also like...

TrelliscopeJS with Plotly

R Weekly 2021-10 Serverless dashboards, tidy eval and dplyr, Bootstrap confidence intervals

I fell out with tapply and in love with dplyr

Categories