Advent of 2023, Day 5 – Getting data into lakehouse
This article is originally published at https://tomaztsql.wordpress.com
In this Microsoft Fabric series:
- Dec 01: What is Microsoft Fabric?
- Dec 02: Getting started with Microsoft Fabric
- Dec 03: What is lakehouse in Fabric?
- Dec 04: Delta lake and delta tables in Microsoft Fabric
We have learned about delta lake and delta tables. But since we have uploaded the file directly, let’s explore, how we can also get the data into lakehouse.
One way to get the data into lakehouse is to use the data pipelines:
I have named my “Day5-Ingesting data” and after the name, select “copy data”. You will be presented with multiple choices for copying data from different sources. Explore the categories for data sources. From Workspaces to databases, files and Services/Apps. With all these sources, you should have a sense of how many drivers and connectors are available for data engineers.
Select the “Public holidays” dataset (for sake of brevity) and proceed until choosing the data destination. There, select the “Lakehouse workspace” and proceed:
The root folder is selected as tables, but should you have multiple files on a blob store, files should be a more suitable option. For now, check the table name and mappings. Continue to “Review and save” and start the data transfer immediately.
A super simple data copy object will be created with the pipeline run ID and activity status.
Go to your Advent2023 lakehouse and refresh the tables. and there should be the “Public_holidays” delta table. And if you check the files in the table, you will see the _delta_log folder and parquet file:
Right-click on the ellipses by the table (Public_holidays …) and select “Open in a notebook >> New notebook”. And start exploring the delta log.
Before doing anything DML operations, check the history of the table.
%%sql
DESCRIBE HISTORY Public_holidays
And since there is only a single initial load, this will also reflect the history of the table. With the following query, see the number of records per country:
%%sql
SELECT
count(*) as Nof_records
,countryRegionCode
FROM Public_holidays
GROUP BY countryRegionCode
And now, what we will do is delete all the records for one country. The country will be “SI”
%%sql
DELETE
FROM Public_holidays
WHERE countryRegionCode = 'SI'
And now, we will check the history, where we can see the changes made.
As part of the delta tables, we can always go back in time, revert to the previous state or simply insert deleted data back into the table. Or we can restore back to the desired version. Both should do the job!
%%sql
INSERT INTO Public_holidays
SELECT * FROM Public_holidays VERSION AS OF 1
WHERE countryRegionCode = 'SI'
--or we can restore
RESTORE Public_holidays VERSION AS OF 1
And finally, check the results:
%%sql
SELECT
count(*) as Nof_records
,countryRegionCode
FROM Public_holidays
GROUP BY countryRegionCode
ORDER BY countryRegionCode ASC
Tomorrow we will look continue exploring the delta tables and SQL endpoints.
Complete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Microsoft-Fabric
Happy Advent of 2023!
Thanks for visiting r-craft.org
This article is originally published at https://tomaztsql.wordpress.com
Please visit source website for post related comments.