hat tip: join two spark dataframe on multiple columns (pyspark)
This article is originally published at https://hameddaily.blogspot.com/
df1.show()
+----+------+-------+
|id_a|time_a|value_a|
+----+------+-------+
| 1| 1| CA|
| 1| 2| CA|
| 2| 1| TX|
| 3| 5| NE|
| 4| 6| WA|
+----+------+-------+
df2.show()
+----+------+-----------+
|id_b|time_b| value_b|
+----+------+-----------+
| 1| 1| San Jose|
| 2| 1|Los Angeles|
| 2| 2| Austin|
+----+------+-----------+
id
columns and time
columns. This can easily be done in pyspark:df = df1.join(df2,(df1.id==df2.id_b)&(df1.time==df2.time),joinType="inner")
df.show()
+----+------+-------+----+------+-----------+
|id_a|time_a|value_a|id_b|time_b| value_b|
+----+------+-------+----+------+-----------+
| 1| 1| CA| 1| 1| San Jose|
| 2| 1| TX| 2| 1|Los Angeles|
+----+------+-------+----+------+-----------+
Thanks for visiting r-craft.org
This article is originally published at https://hameddaily.blogspot.com/
Please visit source website for post related comments.