Azure / Python / R / R News / Spark

SQL vs. NoSQL for Data Science

by tomaztsql · February 14, 2022

This article is originally published at https://tomaztsql.wordpress.com

Data come in variety of form, at different pace, and at different volume. And if all three criteria define the difference between SQL and NoSQL and there, all three are still irrelevant for data science.

My theorem is, that no matter what shape, size, frequeny, value and trustworthiness, SQL type of presenting the data is still the number one player.

Before you all jump, and start writing comments, hold on and continue reading.

Technology

SQL Databases as we have known them and as we know them nowadays have improved over the years in terms of velocity, performance, volume, frequency, and up until this day, keep the relational model, data ACID, isolations, and so on. All this – if you would like to address them – fix, rigid, complex, … tendencies have not only kept the databased number one, but provided the business, digitalization wave and cloud computing with structure and data visibility.

NoSQL database technologies are, on the other hand as described by many, flexible, can provide horizontal scaling, give better query performances due to data normalization and can do parallel computations.

I cannot argue with parallel computation and horizontal scaling. But with rest, I have big issues.

Offerings in Microsoft Azure

Let’s take a look into the different offerings in comparison between SQL and NoSQL:

Feature	SQL	NoSQL
Storage	Predefined tables	Key-value pair storage, Document Storage, Columnstore storage, Graph based storage
Data Model	Tables of rows and columns with relations between other tables, related data stored separately, joined to form complex queries	Stores data depending on database type: Key-value, documents, columnar type and graph databases
Example	SQL Server, Azure SQL Server, Azure Database	Azure Cosmos DB ( Table API, Cassandra API, Graph API), HBase in HDInsight, Azure Cache for Redis, Azure Table Storage, Azure Blob Store, Azure Data Lake store, Azure File Store, Azure Time Series Insights
Business Purpose	General purpose systems, CRM, Accounting, Finance, Human resources, Planning, Inventory management, Transactional management System	Mobile Apps, IoT Apps, Real-time data Stream and Analytics, Content management
Scale	Vertically by increasing server load	Horizontally by sharding across multiple
Analytics	SQL, SQL with R, Python, and Java, Analysis Services with Data Mining, Integration Services with data profiling, Azure Machine Learning	Spark, Scala, Python, R, Hive, SQL, .NET Core, Azure Machine Learning for Python, Stream Analytics

Analytical hiccups and trade-offs with NoSQL

Horizontal scaling and parallel computations are great. They both provide elasticity of resources and faster analytical results. When delivering data to analytical department, the ends must meet.

Strong consistency

You have probably heard the phrase “from eventual to strong consistency”. This simply means that the schema is defined on read and when a document, key-value pair binary file, columnar file is copied between different transformation zones it will eventually become consistent with schema, ACID rules, integrity constraints. But with eventual consistent data, you are gaining fast response (analytics) with the cost of potentials errors or stale data. And delivering this type of data to data science department will always result in back and forth communication full of nagging questions and clearing many data inconsistencies.

Data transformation

Keeping data in original format is great; Transforming it to a readable dataset is a big trade-off for data architect, data engineers and a huge must for data science.

Faster query performance can be achieved with in NoSQL without complex query joins. Normalized data would mean, that all transactional data, along with all dimensional data (with names, explanations) are included. Making datasets inadvertently large, but with cheap storage and scalability, this is not a problem. Problem lies with data consistency, updated, well, ACID. And it’s orchestration. Making shards and copies of data every time, something is changed, updated or deleted, can be tedious, invisible and result of many storage issues. But all orchestration issues are also solvable with right software and some coding.

Data usage and delivery

Data scientists love Python Pandas, Numpy, R’s Dplyr, data.table, Spark’s dataframe, datasets, Julia’s dataframe.jl. You get the picture. All are column and row based. In other words, all NoSQL data are delivered as SQL typed data or columnar typed. Even graph data (with edges and vertices) are transformed in this format.

Flexibility

I tend to calculate flexibility of NoSQL data through “time to market” KPI. How fast can add a new type of data (image, alter schema, new KPI) or a new change on the data and deliver it to the data science department is key to flexibility. There are for sure also other key factors regarding flexibility, for example through deployment and operations, replication or even availability.

Final thoughts

I am happy that NoSQL concepts, technologies and statistical approached have penetrated world of data science. Not only it helped developed new ways of calculation, improved and developed new algorithms, but also opened new ways of analysing formats, brought it to community faster that we have envisioned. But there are caveats to these concepts. Data still need to be cleaned, harmonised, consolidated. In order to bring the consistency, accuracy through transformation, wrangling and orchestration. And these processes must (!) not be neglected, overlooked nor underestimated. If they are, the complete NoSQL paradigm will go down the drain. And this will cost company precious resources, energy and unhappiness.

Thanks for visiting r-craft.org
This article is originally published at https://tomaztsql.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

SQL vs. NoSQL for Data Science

You may also like...

Categories

SQL vs. NoSQL for Data Science

Technology

Offerings in Microsoft Azure

Analytical hiccups and trade-offs with NoSQL

Strong consistency

Data transformation

Data usage and delivery

Flexibility

Final thoughts

You may also like...

Deep Learning with R and Keras: Build a Handwritten Digit Classifier in 10 Minutes

Testing rounded data for a circular uniform distribution

R Esquisse: How to Explore Data in R Through a Tableau-like Drag-and-Drop Interface

Categories