Azure / Python / R / R News / Spark

Advent of 2020, Day 13 – Using Python Databricks Koalas with Azure Databricks

by tomaztsql · December 14, 2020

This article is originally published at https://tomaztsql.wordpress.com

Series of Azure Databricks posts:

So far, we looked into SQL, R and Python and this post will be about Python Koalas package. A special implementation of pandas DataFrame API on Apache Spark. Data Engineers and data scientist love Python pandas, since it makes data preparation with pandas easier, faster and more productive. And Koalas is a direct “response” to make writing and coding on Spark, easier and more familiar. Also follow the official documentation with full description of the package.

Koalas come pre-installed on Databricks Runtine 7.1 and above and we can use package directly in the Azure Databricks notebook. Let us check the Runtime version. Launch your Azure Databricks environment, go to clusters and there you should see the version:

My cluster is rocking Databricks Runtime 7.3. So create a new notebook and name it: Day13_Py_Koalas and select the Language: Python. And attach the notebook to your cluster.

1.Object Creation

Before going into sample Python code, we must import the following packages: pandas and numpy so we can create from or convert from/to Databricks Koalas.

import databricks.koalas as ks
import pandas as pd
import numpy as np

Creating a Koalas Series by passing a list of values, letting Koalas create a default integer index:

s = ks.Series([1, 3, 5, np.nan, 6, 8])

Creating a Koalas DataFrame by passing a dict of objects that can be converted to series-like.

kdf = ks.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])

with the result:

Now, let’s create a pandas DataFrame by passing a numpy array, with a datetime index and labeled columns:

dates = pd.date_range('20200807', periods=6)
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

and getting the results as pandas dataframe:

Pandas dataframe can easly be converted to Koalas dataframe:

kdf = ks.from_pandas(pdf)
type(kdf)

With type of: Out[67]: databricks.koalas.frame.DataFrame

And we can output the dataframe to get the same result as with pandas dataframe:

kdf

Also, it is possible to create a Koalas DataFrame from Spark DataFrame. We need to load additional pyspark package first, then create a SparkSession and create a Spark Dataframe.

#Load package
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(pdf)

Since spark is lazy we need to explicitly call the show function in order to see the spark dataframe.

sdf.show()

Creating Koalas DataFrame from Spark DataFrame. to_koalas() is automatically attached to Spark DataFrame and available as an API when Koalas is imported.

kdf = sdf.to_koalas()

2. Viewing data

See the top rows of the frame. The results may not be the same as pandas though: unlike pandas, the data in a Spark dataframe is not ordered, it has no intrinsic notion of index. When asked for the head of a dataframe, Spark will just take the requested number of rows from a partition.

kdf.head()

You can also display the index, columns, and the underlying numpy data.

kdf.index
kdf.columns
kdf.to_numpy()

And you can also use describe function to get a statistic summary of your data:

kdf.describe()

You can also transpose the data, by adding a T function:

kdf.T

and many other functions. Group is also another great way to get summary of your data. Grouping can be done by “chaining” or adding a group by clause. The internal process – when grouping is applied – happens in three steps:

Splitting data into groups (base on criteria)
applying the function and
combining the results back to data structure.

kdf.groupby('A').sum()
#or 
kdf.groupby(['A', 'B']).sum()

Both are grouping data, first time on Column A and second time on both columns A and B:

3. Plotting data

Databricks Koalas is also compatible with matplotlib and inline plotting. We need to load the package:

%matplotlib inline
from matplotlib import pyplot as plt

And can continue by creating a simple pandas series:

pser = pd.Series(np.random.randn(1000),
                 index=pd.date_range('1/1/2000', periods=1000))

that can be simply converted to Koalas series:

kser = ks.Series(pser)

After we have a series in Koalas, we can create cumulative sum of values using series and plot it:

kser = kser.cummax()
kser.plot()

And many other variations of plot. You can also load the seaborn package, boket package and many others.

This blogpost is shorter version of a larger “Koalas in 10 minutes” notebook, and is available in the same Azure Databricks repository as all the samples from this Blogpost series. Notebook briefly touches also data conversion to/from CSV, Parquet (*.parquet data format) and Spark IO (*.orc data format). All will be persistent and visible on DBFS.

Tomorrow we will explore the Databricks jobs, from configuration to execution and troubleshooting., so stay tuned.

Complete set of code and Notebooks will be available at the Github repository.

Happy Coding and Stay Healthy!

Thanks for visiting r-craft.org
This article is originally published at https://tomaztsql.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Advent of 2020, Day 13 – Using Python Databricks Koalas with Azure Databricks

You may also like...

Categories

Advent of 2020, Day 13 – Using Python Databricks Koalas with Azure Databricks

You may also like...

Single-nucleus transcriptome analysis reveals cell type-specific molecular signatures across reward circuitry in the human brain

Introducing Biostatistics to first year LCG students

A perfect RStudio layout

Categories