Lab – Create a Notebook - Head First Data

Download the files containing data from https://github.com/MicrosoftLearning/dp-data/raw/main/orders.zip.
Extract the zipped archive and verify that you have a folder named orders which contains three CSV files: 2019.csv, 2020.csv, and 2021.csv.
Return to the lakehouse LH_Fabric_Bootcamp that you’ve created in Lab 1. In the Explorer pane, next to the Files folder select the … menu, and select Upload and Upload folder. Navigate to the orders folder on your local computer and select Upload.
After the files have been uploaded, expand Files and select the orders folder. Check that the CSV files have been uploaded, as shown below:

Create a notebook

You can now create a Fabric notebook to work with your data. Notebooks provide an interactive environment where you can write and run code. There are multiple ways to create a new notebook. You can either go back to the workspace main page and select the New item from the top left corner, or you can open the notebook directly from the lakehouse UI, as we are going to do in this lab.

From the top ribbon, locate the Open notebook option. Click on the down arrow, and then pick New notebook from the dropdown menu

A new notebook named Notebook 1 is created and opened. Fabric assigns a name to each notebook you create, such as Notebook 1, Notebook 2, etc. Click the name panel above the Home tab on the menu to change the name to something more descriptive. Let’s rename the notebook to: Fabric Bootcamp

Select the first cell (which is currently a code cell), and then in the top-right tool bar, use the M↓ button to convert it to a markdown cell. The text contained in the cell will then be displayed as formatted text.

Use the 🖉 (Edit) button to switch the cell to editing mode, then modify the markdown as shown below.

# Sales order data exploration
Use this notebook to explore sales order data

When you are finished, click anywhere in the notebook outside of the cell to stop editing it.

Create a DataFrame

Now that you’ve created a notebook, you are ready to work with the data. We will use PySpark, which is the default language for Fabric notebooks, and the version of Python that is optimized for Spark.

If you’ve closed the notebook, select the Fabric Bootcamp workspace from the left bar. You will see a list of items contained in the workspace, including your lakehouse and notebook.
Select the lakehouse LH_Fabric_Bootcamp and look for the orders folder in the Files area.
From the top menu, select Open notebook, Existing notebook, and then open the notebook Fabric Bootcamp. The notebook should now be open next to the Explorer pane. Expand Lakehouses, expand the Files list, and select the orders folder. The CSV files that you uploaded are listed next to the notebook editor, like this:

From the … menu for 2019.csv, select Load data > Spark. The following code is automatically generated in a new code cell:

df = spark.read.format("csv").option("header","true").load("Files/orders/2019.csv")
# df now is a Spark DataFrame containing CSV data from "Files/orders/2019.csv".
display(df)

Select ▷ Run cell to the left of the cell to run the code.

NOTE: The first time you run Spark code, a Spark session is started. This can take a few seconds or longer. Subsequent runs within the same session will be quicker.

When the cell code has completed, review the output below the cell, which should look like this:

The output shows data from the 2019.csv file displayed in columns and rows. Notice that the column headers contain the first line of the data. To correct this, you need to modify the first line of the code as follows:

df = spark.read.format("csv").option("header","false").load("Files/orders/2019.csv")

Run the code again, so that the dataframe correctly identifies the first row as data. Notice that the column names have now changed to _c0, _c1, etc.

Descriptive column names help you make sense of data. To create meaningful column names, you need to define the schema and data types. You also need to import a standard set of Spark SQL types to define the data types.

Replace the existing code with the following:

from pyspark.sql.types import *

orderSchema = StructType([
    StructField("SalesOrderNumber", StringType()),
    StructField("SalesOrderLineNumber", IntegerType()),
    StructField("OrderDate", DateType()),
    StructField("CustomerName", StringType()),
    StructField("Email", StringType()),
    StructField("Item", StringType()),
    StructField("Quantity", IntegerType()),
    StructField("UnitPrice", FloatType()),
    StructField("Tax", FloatType())
])
df = spark.read.format("csv").schema(orderSchema).load("Files/orders/2019.csv")
display(df)

Run the cell and review the output:

This DataFrame includes only the data from the 2019.csv file.

Modify the code so that the file path uses a * wildcard to read all the files in the orders folder:

from pyspark.sql.types import *

orderSchema = StructType([
    StructField("SalesOrderNumber", StringType()),
    StructField("SalesOrderLineNumber", IntegerType()),
    StructField("OrderDate", DateType()),
    StructField("CustomerName", StringType()),
    StructField("Email", StringType()),
    StructField("Item", StringType()),
    StructField("Quantity", IntegerType()),
    StructField("UnitPrice", FloatType()),
    StructField("Tax", FloatType())
])

df = spark.read.format("csv").schema(orderSchema).load("Files/orders/*.csv")
display(df)

When you run the modified code, you should see sales for 2019, 2020, and 2021. Only a subset of the rows is displayed, so you may not see rows for every year.

NOTE: You can hide or show the output of a cell by selecting … next to the result. This makes it easier to work in a notebook.

Explore data in a dataframe

The dataframe object provides additional functionality such as the ability to filter, group, and manipulate data.

Filter a dataframe

Add a code cell by selecting + Code which appears when you hover the mouse above or below the current cell or its output. Alternatively, from the ribbon menu select Edit and + Add code cell below.
The following code filters the data so that only two columns are returned. It also uses count and distinct to summarize the number of records:

customers = df['CustomerName', 'Email']

print(customers.count())
print(customers.distinct().count())

display(customers.distinct())

Run the code, and examine the output:

The code creates a new dataframe called customers which contains a subset of columns from the original df dataframe. When performing a dataframe transformation you do not modify the original dataframe, but return a new one.
Another way of achieving the same result is to use the select method:

customers = df.select("CustomerName", "Email")

The dataframe functions count and distinct are used to provide totals for the number of customers and unique customers.
Modify the first line of the code by using select with a where function as follows:

customers = df.select("CustomerName", "Email").where(df['Item']=='Road-250 Red, 52')
print(customers.count())
print(customers.distinct().count())
display(customers.distinct())

Run the modified code to select only the customers who have purchased the Road-250 Red, 52 product. Note that you can “chain” multiple functions together so that the output of one function becomes the input for the next. In this case, the dataframe created by the select method is the source DataFrame for the where method that is used to apply filtering criteria.

Aggregate and group data in a dataframe

Add a code cell, and enter the following code:

productSales = df.select("Item", "Quantity").groupBy("Item").sum()
display(productSales)

Run the code. You can see that the results show the sum of order quantities grouped by product. The groupBy method groups the rows by Item, and the subsequent sum aggregate function is applied to the remaining numeric columns – in this case, Quantity.
Add another code cell to the notebook, and enter the following code:

from pyspark.sql.functions import *

yearlySales = df.select(year(col("OrderDate")).alias("Year")).groupBy("Year").count().orderBy("Year")
display(yearlySales)

Run the cell. Examine the output. The results now show the number of sales orders per year:

The import statement enables you to use the Spark SQL library.
The select method is used with a SQL year function to extract the year component of the OrderDate field.
The alias method is used to assign a column name to the extracted year value.
The groupBy method groups the data by the derived Year column.
The count of rows in each group is calculated before the orderBy method is used to sort the resulting dataframe.

Use Spark to transform data files

A common task for data engineers and data scientists is to transform data for further downstream processing or analysis.

Use dataframe methods and functions to transform data

Add a code cell to the notebook, and enter the following:

from pyspark.sql.functions import *

# Create Year and Month columns
transformed_df = df.withColumn("Year", year(col("OrderDate"))).withColumn("Month", month(col("OrderDate")))

# Create the new FirstName and LastName fields
transformed_df = transformed_df.withColumn("FirstName", split(col("CustomerName"), " ").getItem(0)).withColumn("LastName", split(col("CustomerName"), " ").getItem(1))

# Filter and reorder columns
transformed_df = transformed_df["SalesOrderNumber", "SalesOrderLineNumber", "OrderDate", "Year", "Month", "FirstName", "LastName", "Email", "Item", "Quantity", "UnitPrice", "Tax"]

# Display the first five orders
display(transformed_df.limit(5))

Run the cell. A new dataframe is created from the original order data with the following transformations:
Year and Month columns added, based on the OrderDate column.
FirstName and LastName columns added, based on the CustomerName column.
The columns are filtered and reordered, and the CustomerName column removed.
Review the output and verify that the transformations have been made to the data.

You can also use the Spark SQL library to transform the data by filtering rows, deriving, removing, renaming columns, and applying other data modifications.

Save the transformed data

At this point you might want to save the transformed data so that it can be used for further analysis.

Parquet is a popular data storage format because it stores data efficiently and is supported by most large-scale data analytics systems. Indeed, sometimes the data transformation requirement is to convert data from one format such as CSV, to Parquet.

To save the transformed DataFrame in Parquet format, add a code cell and add the following code:

transformed_df.write.mode("overwrite").parquet('Files/transformed_data/orders')

print ("Transformed data saved!")

Run the cell and wait for the message that the data has been saved. Then, in the Explorer pane on the left, in the … menu for the Files node, select Refresh. Select the transformed_data folder to verify that it contains a new folder named orders, which in turn contains one or more Parquet files.
Add a cell with the following code:

orders_df = spark.read.format("parquet").load("Files/transformed_data/orders") display(orders_df)

Run the cell. A new dataframe is created from the parquet files in the transformed_data/orders folder. Verify that the results show the order data that has been loaded from the parquet files.

Save data in partitioned files

When dealing with large volumes of data, partitioning can significantly improve performance and make it easier to filter data.

Add a cell with code to save the dataframe, partitioning the data by Year and Month:

orders_df.write.partitionBy("Year","Month").mode("overwrite").parquet("Files/partitioned_data")
print ("Transformed data saved!")

Run the cell and wait for the message that the data has been saved. Then, in the Lakehouses pane on the left, in the … menu for the Files node, select Refresh and expand the partitioned_data folder to verify that it contains a hierarchy of folders named Year=xxxx, each containing folders named Month=xxxx. Each month folder contains a parquet file with the orders for that month.

Add a new cell with the following code to load a new DataFrame from the orders.parquet file:

orders_2021_df = spark.read.format("parquet").load("Files/partitioned_data/Year=2021/Month=*")

display(orders_2021_df)

Run the cell and verify that the results show the order data for sales in 2021. Notice that the partitioning columns specified in the path (Year and Month) are not included in the dataframe.

Work with tables and SQL

You’ve now seen how the native methods of the dataframe object enable you to query and analyze data from a file. However, you may be more comfortable working with tables using SQL syntax.

The Spark SQL library supports the use of SQL statements to query tables. This provides the flexibility of a data lake with the structured data schema and SQL-based queries of a relational data warehouse – hence the term “data lakehouse”.

Create a table

Tables in a Spark metastore are relational abstractions over files in the data lake. Tables can be managed by the metastore, or external and managed independently of the metastore.

Add a code cell to the notebook and enter the following code, which saves the dataframe of sales order data as a table named salesorders:

# Create a new table
df.write.format("delta").saveAsTable("salesorders")

# Get the table description
spark.sql("DESCRIBE EXTENDED salesorders").show(truncate=False)

Run the code cell and review the output, which describes the definition of the new table. In the Explorer pane, in the … menu for the Tables folder, select Refresh. Then expand the Tables node and verify that the salesorders table has been created.
In the … menu for the salesorders table, select Load data > Spark. A new code cell is added containing code similar to the following:

Run the new code, which uses the Spark SQL library to embed a SQL query against the salesorder table in PySpark code and load the results of the query into a dataframe.

Run SQL code in a cell

While it’s useful to be able to embed SQL statements into a cell containing PySpark code, we often just want to work directly in SQL.

Add a new code cell to the notebook, and enter the following code:

%%sql
SELECT YEAR(OrderDate) AS OrderYear,
       SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue
FROM salesorders
GROUP BY YEAR(OrderDate)
ORDER BY OrderYear;

Run the cell and review the results. Observe that:

The %%sql command at the beginning of the cell (called a magic) changes the language to Spark SQL instead of PySpark.
The SQL code references the salesorders table that you created previously.
The output from the SQL query is automatically displayed as the result under the cell.

Congratulations! In this lab, you’ve created your first Fabric notebook. Not only that – but you also explored the data in the lakehouse using both PySpark and Spark SQL. You should be proud of your work in this lab! Keep up the momentum and see you soon in the next lab😊