Data Aggregation with PySpark
This article is originally published at https://statcompute.wordpress.com
Import CSV File into Spark Dataframe
import pyspark as spark sc = spark.SQLContext(spark.SparkContext()) sdf1 = sc.read.csv("Documents/nycflights13.csv", header = True, inferSchema = True)
Data Aggregation with Spark Dataframe
import pyspark.sql.functions as fn sdf1.cache() \ .filter("month in (1, 3, 5)") \ .groupby("month") \ .agg(fn.mean("dep_time").alias("avg_dep"), fn.mean("arr_time").alias("avg_arr")) \ .show() +-----+------------------+------------------+ |month| avg_dep| avg_arr| +-----+------------------+------------------+ | 1| 1347.209530642299|1523.1545262203415| | 3|1359.4997676330747|1509.7429767741473| | 5|1351.1682074168525|1502.6846604007803| +-----+------------------+------------------+
Data Aggregation with Spark SQL
sc.registerDataFrameAsTable(sdf1, "tbl1") sc.sql("select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month").show() sc.dropTempTable(sc.tableNames()[0]) +-----+------------------+------------------+ |month| avg_dep| avg_arr| +-----+------------------+------------------+ | 1| 1347.209530642299|1523.1545262203415| | 3|1359.4997676330747|1509.7429767741473| | 5|1351.1682074168525|1502.6846604007803| +-----+------------------+------------------+
Thanks for visiting r-craft.org
This article is originally published at https://statcompute.wordpress.com
Please visit source website for post related comments.