Conceptualizing the Processing Model for Azure Databricks Service

Course Overview

Welcome to course, Conceptualizing the Processing Model for Azure Databricks Service. Modern data pipelines often include streaming data that needs to be processed in real time. While Spark Structured Streaming can help us build streaming pipelines, managing the Spark environment is no cakewalk. Wouldn't it be great to have a cloud service just to do that? This course walks you through Azure Databricks, which is Spark‑based Unified Analytics Platform running on Microsoft Azure, and you will see how it can be used as a platform for Spark Structured Streaming. Some of the major topics that we will cover include understanding the architecture and competence of Azure Databricks, setting up the Azure Databricks environment, building an end‑to‑end streaming pipeline, customizing the cluster to suit your own requirements, and other aspects, like pricing, best practices, and how it compares to other managed offerings. By the end of this course, you will be comfortable to build streaming pipelines on Azure Databricks using Spark Structured Streaming. Before beginning the course, I would recommend being familiar with basics of Microsoft Azure. I hope you will join me on this journey to learn building streaming pipelines with my course, Conceptualizing the Processing Model for Azure Databricks Service, here at Pluralsight.

Getting Started with Structured Streaming on Azure Databricks

Module Overview

Conceptualizing the Processing Model for Azure Databricks Service. Before you even begin the course, you might be thinking, what's the need of Azure Databricks? The data pipelines today have gone well beyond simple batch ETL jobs. With growing data volumes, changing requirements, and need for real‑time processing, there is a need to modernize the data platforms. And Azure Databricks can help you do just that, on top of a Apache Spark. Hey! Wait a second. Where does Spark come into the picture now? Apache Spark is the underlying in‑memory data processing engine for Azure Databricks. So to better understand Databricks, you must first have a good understanding of what Spark is. So in this module, you'll also learn about the basics of Spark. And what about streaming pipelines? Here, you'll also see how Spark Structured Streaming works. Sounds good? All right! Then what exactly is Databricks? Databricks is an independent product. It is a fast, easy, and collaborative analytics platform which runs in the cloud and allows you to build Spark applications without worrying about the challenges that come with Spark. So here we're going to dive in and understand the architecture and features of Databricks and how easy it is to get started. All good so far. But is Databricks just a hosted solution on Microsoft Azure? No. The teams of Azure and Databricks came together to make it a first‑party service on Azure, bring enterprise‑grade security, and have high‑speed connectors for Azure services. By the end of this module, you will be all set to learn how to build your streaming pipeline using Azure Databricks. So sit back, relax, and enjoy the journey.

Course Outline

Before we jump into the course, let's take a look at what you're going to learn in this course. The focus here is to learn about Azure Databricks service, its architecture, features, and components, and how it can be used as a platform for Spark Structured Streaming. So this course will only cover the basics of Spark Structured Streaming, and it'll deep dive into Azure Databricks platform so that you can build basic streaming pipelines on top of it. This course is being followed up with an in‑depth course on Spark Structured Streaming using the same platform. So in this module, you'll start by learning about as Azure Databricks platform. Why do we need it, its architecture, and features. You learn about Spark basics, RDDs, and DataFrames, and how the Structured Streaming processing model works. And finally, you'll see what is Azure brings to the table and how deployment happens in Azure. Next, you'll see how to set up the Databricks environment, by launching a workspace, setting up a cluster, and creating a notebook,. And you'll also see how set up of the security. Then we'll start the journey for developing the streaming pipeline. You'll see what sources and sinks can be used and how to configure them. Following this, you'll learn how to build the streaming pipeline, handle streaming data, visualize the output on a dashboard, and load the processed data into a data lake. We'll then go ahead and build a production‑ready pipeline and schedule it as a job in Databricks. You'll also see some of the best practices here. Then we'll talk about different workloads and tiers and how pricing works for Azure Databricks and then see how it compares to other managed services in the market. And finally, we'll learn how to customize the cluster for your own requirements, using initialization scripts and Docker containers. Excited all set? So let's get going!

Modern Data Pipelines on Databricks

Let's understand how Databricks can help us build modern data pipelines, especially streaming pipelines. A typical pipeline involves doing significant ETL operations. ETL stands for extract, transform, and load. This means you extract the data from a source system like customer data, apply business‑specific transformations like combining their first name and last name, and load the data into the target repository. Now let's see how we can do ETL operations on modern data pipelines. You may need to extract data from variety of data sources. It can be structured data coming from business applications or relational databases. But it could also be semi‑structured or unstructured data like CSV and and JSON files, log and telemetry data, or data coming from NoSQL databases. And modern data processes often include real‑time and streaming data, like data coming from IoT devices. You store this raw data typically into a data lake, or if it's streaming data, then store that into a stream injection service like Kafka or Azure Event Hubs. This raw data helps to maintain history. Then you need to process this data and store it in a data warehouse. This data warehouse can be a relational database, or it can be a data lake as well. And finally, you can visualize the data, build the reports, or use it in downstream applications. Remember that it's just a reference architecture. There are many other ways in which you can define it, but it typically consists of these layers only. There are two types of data pipelines, a batch pipeline and a streaming pipeline. Let's take an example to understand this. Assume you are building an e‑commerce solution. So let's see what kind of solutions you can build with batch and streaming pipelines. In a batch pipeline, you might want to figure out that how much sales have happened this week across different product categories, compare with historical data, and check what is the growth in revenue, say month on month or year on year? And what is the impact of multiple promotions that you have run on the site? So this means in a batch pipeline, you work with finite datasets to provide solutions. It typically involves a lot of historical data, so datasets are large and pipelines take a lot of time to complete. Event time is usually not important here. For example, precisely at what time the sale happened may not be that useful. And the data is processed periodically here; it could be weekly, daily, or once every 6 hours. On the other hand, streaming pipeline, it works with infinite datasets. The dataset is continuously getting updated with the new data, and there is no finite boundary here. It involves real‑time data and not much of historical data. The precise time at which the event happened or the event time is very important here. And you process this data continuously and as soon as it arrives. Using this, you can provide recommendations to users based on current products they are looking at on your e‑com site. Use it to monitor the application logs and identify system failures. Carefully notice that event time is very important here. Now you can use historical delivery information to analyze and optimize delivery processes. That's the batch pipeline, but use the streaming pipeline to track the current ones. Makes sense, right? Now, this also brings us to another observation. Batch and streaming pipelines need not be totally separate. They follow similar architecture and work on nearly same sets of data. Now streaming applications does not always mean real time. It can be a near real‑time application where speed is important, but you don't need the output immediately. For example, you're okay to have 10 seconds to 10 minutes of latency. These applications could be movie recommendations to users, tracking social media for posts and comments, monitoring applications for performance, and providing weather updates. On the other hand, you might want to build real‑time applications where information needs to be processed immediately, and the output should be available, say, within 100 ms took 10 seconds, or even better. These kinds of applications could be built for financial fraud detection, processing data from a self‑driving car, for online games, monitoring the networks, and much more. Important point here is that the time window for output totally depends on your application requirements. But building a fast and robust stream processing solution is difficult. Let's see what are the complexities involved. Batch and stream pipelines are similar, but building and managing separate pipelines for both adds to complexity. You need to extract data from diverse data sources and handle variety of formats. Data may reach your system late, or it may be corrupt. Also, you may need to run interactive queries on your streaming data for analysis. And in modern pipelines, it's a common requirement to apply machine learning, even on the streaming data, like while providing recommendations to users. And of course, pipelines should be robust and fault tolerant. And this is where Apache Spark comes in. It is open source, and it's very popular in the big data community, whether you want to process batch or streaming data. Apache Spark is an extremely fast and powerful in‑memory analytics engine for large‑scale data processing, be it structured, semi‑structured, or unstructured data. It allows to build unified batch and streaming pipelines. It has a highly scalable and fault tolerant architecture that allows it to run on hundreds of machines and still recover faster from failures. And the great part is it is natively integrated with advanced processing libraries like machine learning, graph processing, etc. So Apache Spark allows us to build unified modern data pipelines. Sounds great, right? While Spark has got great features, a lot of developers feel it's hard to work with. The biggest challenge is the infrastructure management. Though Spark can run on hundreds of machines, handling the physical hardware, patching the machines, managing the disks, or scaling out to meet the growing demands, all this is an extremely costly and complex affair. It also needs to be installed and configured on all the machines. And all this makes it difficult to upgrade to a newer version of Spark in production. And since Spark is only an engine, it requires setting up an equal system of tools for activities like development, deployment, security, etc. Spark does not have a native user interface, but there are other IDEs that can be used for development. And in big team setups, it's difficult to collaborate on projects. That's why we need an intuitive and collaborative environment in which we can easily work with Spark without worrying about infrastructure and upgrades. And this is where comes Databricks. It has been founded by the same set of engineers that started the Spark project. While Spark is just an engine, Databricks is a completely managed and optimized platform for running Apache Spark. It provides a whole bunch of tools out of the box, so you don't have to plug in the basic components for Spark to work. It also means you can quickly start building your Spark‑based applications. It also provides an intuitive UI and an integrated workspace, where you can write the code and do real‑time collaboration with your colleagues. And finally, the best part. It allows you to set up and configure the infrastructure with just a few clicks and manages the rest on its own, be it scalability, failure recovery, upgrades, and much more. So the processing capabilities of Spark power the Databricks platform. And Databricks runs on top of Microsoft Azure Cloud platform. So Azure brings all the features provided by an enterprise‑grade cloud to the mix. Together it forms a natively integrated, first‑party service on Azure called Azure Databricks. That's amazing, right?

Spark 101

Since Azure Databricks is built on top of Apache Spark, so to better understand it, you must first have a good understanding of what Spark is. In the next few minutes, we'll do a quick walk‑through on some of the basics of Spark. Apache Spark is an extremely powerful in‑memory analytics engine for large‑scale data processing, which is built on cluster computing technology. So to work with Spark, you need to set up a cluster. You can install the Spark on a single machine, also called as node, or you can install and run it together on multiple nodes. Together, they constitute a cluster. At the core of Spark there are RDDs. RDDs, or Resilient Distributed Dataset, is the fundamental data structure of Spark. All your data is in the form of RDD, and it is stored in the memory of the cluster. So RDDs are in‑memory objects in Spark. Think of RDD as a collection of elements or data, which is distributed to multiple nodes and stored in their memory. So when you write code to process the data, this processing happens on RDDs. There are four important features of RDDs. It is in‑memory, partitioned, read‑only, and resilient. Let's understand these properties with the help of an example. Let's say you have three nodes in the cluster. Now when you read the data from the data source, it comes into the memory of the cluster and is called an RDD. But since there are multiple nodes, the data is partitioned and each partition is stored in the memory of a separate node. And all partitions together constitute an RDD. Make sense? And as you saw, RDDs are read only. This means if you apply an operation, it creates a new RDD. For example, if the first RDD contains customers' first name and last name and you want to combine them into a full name, it creates a new RDD. This type of operation that produces a new RDD is called a transformation operation. Now when you want to store this data into the destination, you apply another operation. Since you are storing it, this time there is no new RDD created, and this is called in action operation, which is responsible for returning the final result of RDD computations. Sounds good? And finally, think about it, what if there is a failure in one node? RDDs know exactly how they were constructed by looking at their lineage graph, which means from where they started and what operations were applied. This helps in restarting and automatically reprocessing the data. So to summarize, RDDs are partitioned, which means the input dataset is split into partitions. They reside in memory. This means the partitions are stored on multiple nodes in the cluster and processed in parallel. RDDs are read‑only objects, so once they are created, they cannot be modified, but you can apply operations on top of them. And they're resilient. They can track their creation, from where the data came and what operations were applied. This is called their lineage graph, and because of this they can be reconstructed in case of a failure. And that's how RDDs provide fault tolerance. You also saw two times of operations, a transformation operation, that's like a function which takes an RDD as input and creates one or more RDDs as output. And because with transformation operation a new RDD has been created, this means you are defining a chain of transformations on a dataset, and this chain, as you saw, is called a lineage graph. So loading the dataset from a source, converting the sales amount from INR to USD, or merging the first and last names to full name are all examples of a transformation operation. Now comes the interesting part. Transformations are lazy operations. What does this mean? This means a transformation or a chain of transformations, which is called lineage graph, they are never executed, until and unless the second type of operation, which is the action operation, is performed. So action operation is responsible for returning the final result of RDD computations. It optimizes the transformations applied to the dataset and then triggers the execution using the lineage graph. This helps in running a highly optimal execution plan. So if you want to load the data into destination, show the output on a screen, or display the count, all these are examples of an action operation. Now that you know about RDDs, let's talk about another concept, DataFrames. So what's a DataFrame? A DataFrame is just like a table you have in a relational database. It has got columns and rows. So using DataFrame API allows you to work with a table‑like structure, which is much more simple and straightforward. But you may be wondering, what's the relation between RDD and DataFrame? DataFrame is a high‑level API built on top of RDD. Wow! That means all the great features of RDD also apply DataFrames. So DataFrames are also in‑memory, partitioned, read‑only, and resilient. But not just this. In a DataFrame, data is organized into named columns, and it imposes tabular structure on the data. Because it imposes a structure, Spark can now go ahead and apply a lot of optimizations. This gives you much better performance. So if you want more control over your dataset, you use RDDs directly. But if you're looking for better performance and less development effort, you use DataFrame API. For most of our structured data processing needs like data pipeline development, DataFrame is good enough. So throughout the course we'll be using Spark SQL library, Python language, and DataFrame API.

Structured Streaming Processing Model

Now that we have gone through the basics of Spark, let's understand the processing model for Spark Structured Streaming. So what is Spark Structured Streaming? It is a scalable and fault‑tolerant stream processing engine build on the Spark SQL. And as you know, the batch processing in Spark also uses Spark SQL. Now that's interesting. So structured streaming provides a unified set of APIs for batch and stream processing using Spark SQL. So it's not just the APIs that are common. The underlying runtime is the same as well. Spark SQL engine takes care of doing computation in an incremental fashion on the streaming data. You'll see what that means in just a moment. Just like for batch processing, you can use DataFrame APIs for stream processing as well, and you can use any language to do that, be it Scala, Python, R, or Java. Structured Streaming can help you achieve end‑to‑end latencies as low 100 ms, and you can run interactive queries or even apply machine learning seamlessly on the streaming data. Now let's see what are the components of a structured streaming pipeline. The first one is an unbounded dataset or the stream of data. Then there is ETL query where you will define the sources from which you want to extract the streaming data, the transformations that you want to apply on this data, and, finally, the sink where you want to load this transform data. Then there are execution plans, logical execution plan and physical execution plan based on the ETL query you have specified. Followed by this, there is trigger where you specify the interval for processing. And there are more like Checkpoint, Watermark, Windows, Aggregations, OutputMode, etc. We'll discuss about some of these components in this course. Alright, let's first see what is an unbounded dataset. In batch processing, you extract data from the source. Once extracted, this is the static set of data that you work with. This is what is called as a bounded dataset where you have a defined boundary. Now when you're working with streaming data, let's say a new record or event arrives at the source. This new record is appended to the dataset. If another one arrives, that also gets appended to the dataset. As the records keep arriving, they are continuously getting added to the dataset. This is important to understand. Here you're always working with continuously changing data, and this type of dataset, that is dynamically changing, is called an unbounded dataset. Now unbounded dataset is also referred to as an input table. As you saw, new data in the data stream becomes new rows that are appended to the input table. Now let's take a step back. Even though the dataset is unbounded, at any point when you run a query, it's on a dataset. Next time when you run it, it's on a change dataset. And this is the logic behind Spark Structured Streaming, that every streaming execution becomes like a batch execution. Interesting, right? And this is why, internally, this input table is a streaming DataFrame. So many of the operations that you perform on a regular DataFrame can also be performed on a streaming DataFrame. Now let's see the second component, the query for structured streaming. First, let's extract the data from the source by using spark.readStream. In the format, you need to provide the source from which you want to extract. For example, Event Hubs to extract from Azure Event Hubs. This will create an unbounded dataset and load it into a streaming DataFrame, inputDF. Next, let's transform the data. On the streaming DataFrame, you can apply various types of transformations, like selecting only limited columns, adding a new derived column, renaming columns, doing aggregations, and more. The output of these transformations on input DataFrame will provide you with a new streaming DataFrame, resultDF. And finally, you can load the output data continuously into the sink. For this, you need to use writeStream on resultDF. Provide the sink information using format. And once this is ready, use to start method to run the query continuously. Simple, right? Now Spark SQL internally uses Catalyst Optimizer to prepare an optimized query plan to execute the query operations. Let's understand this at a high level. Once you submit this query to Spark execution engine, Catalyst Optimizer analyzes the query to create a logical query plan. It then tries to optimize this plan using pre‑defined rule‑based optimizations to generate an optimized logical plan. Great. Now at this point, it uses the optimized logical plan to generate multiple physical plans. A physical plan determines how data will be computed. And finally, it selects one of those plans based on the lowest cost of execution. This selected physical plan is the one that is used to execute the operations. Sounds good. Now Spark Structured Streaming runs queries using a micro‑batch processing engine at specific trigger intervals. Now let's see how that works. Let's assume we specify a trigger interval of 5 seconds. So in 15 seconds, the first trigger interval is 0 to 5 seconds, second from 5 to 10, and third from 10 to 15 seconds. So there will be three executions, which will happen at the end of every trigger interval. Now let's say there are new events getting added to the source. The first trigger will happen at the fifth second. At this point, there are two events. These two events, E1 and E2, will be part of micro‑batch 1 that'll be extracted and processed. While first micro‑batch is getting processed, new events are arriving. And now, at the second trigger, t2, events E3, E4, and E5 will get processed, and this is the second micro‑batch. In the same way, new events, E7 and E8, will be processed at t3 as third micro‑batch. This is how micro‑batch engine processes the data streams as a series of small batch jobs and helps to achieve end‑to‑end latencies as low as 100 ml. Interesting, right? And how can we add trigger to the query? So in the previous query, we created an input DataFrame by extracting from the source. Then we applied some transformations. And finally, we specified sink information to write the processed data. So a trigger can be specified just before starting the query by using trigger method and then specifying the interval. And there are various trigger options, like you will see in later modules. Finally, let's see how end‑to‑end execution works. Using the previous example, let's say events are arriving at the source. At the first trigger, these events, E1 and E2, will be the part of unbounded dataset or the input table. Spark will run the query on this input table using the selected query execution plan. This is also called incremental execution plan. The open of this query is called result table. This result table will be returned to the sink. When you're writing the data to sink, there are different output modes. In this case, we are going to take only new rows and append them to the sink. Now we have three more events. At the second trigger, only these new events will be extracted from the source as part of micro‑batch. But the input table will contain old records and these new ones since previous records were already present in the memory. And now you know, this is why it's called an unbounded dataset. Spark will now run the query on the complete input table on events E1 to E5. Again, a result table is generated, but only the newly generated rows will be appended to the sink since we are using append mode, and this continues. At third trigger, E7 and E8 will become part of the input table, and the new rows in the result table will be appended to the sink. Awesome, right? So to summarize, Structured Streaming allows you to run batch‑like queries on streaming data using incremental execution plans. Before we wrap this discussion, you should be aware that there are two types of steaming support in Spark. First is called Spark Streaming, which was the first implementation for streaming data. It works on RDDs and was not very high performing. Also, it had separate batch and streaming APIs. The second type is called Spark Structured Streaming that we have been discussing. It was introduced with version 2 of Spark, and it works on DataFrames. Because of this, they are highly optimized. It also supports unified batch and streaming APIs that you have just seen. It also comes with two modes, trigger‑based, which is the focus of this course, and provides 100‑ms latency. Second one is the continuous mode that can achieve latencies as low as 1 ms. It was introduced with version 2.3 and is marked as experimental at the time of recording this course.

What Is Databricks?

Now that you have a good understanding of Spark, let's understand what is Databricks. Databricks is a fast, easy, and collaborative Apache Spark‑based unified analytics platform that has been optimized for the cloud. Let me repeat that. It's in an Apache Spark‑based unified analytics platform that has been optimized for the cloud. It has been founded by the same set of engineers that started the Spark project. Because it is based on Apache Spark, the data is distributed and processed in memory of multiple nodes in a cluster. All the languages supported by Spark are also supported on Databricks, be it Scala, Python, SQL, R, or Java. And it has support for all the Spark use cases, batch persisting, stream processing, machine learning, and advanced analytics. But along with all the Spark functionality, Databricks brings a host of features to the table. First, and I believe the most important one, is the infrastructure management. Since Spark is an engine, so to work with it, you need to set up a cluster, install Spark, handle the scalability, physical hardware failures, upgrades, and much more. But with Databricks, you can launch an optimized Spark environment with just a few clicks and autoscale it on demand. With Databricks, you also get a workspace where different users on the data analytics team, like data engineers, data scientists, and business analysts can work together. They can share code and datasets, explore and visualize the data, post comments, and integrate with source control. Databricks also helps you to easily execute data pipelines on demand or automate to run them on a schedule. And Databricks comes with built‑in access control and enterprise‑grade security, so you can securely deploy your applications to production. Let's have a look at the architecture of Databricks. It is divided into three important layers, the cloud service, the runtime, and the workspace. And the security is available across all these layers. Let's understand these layers and their components one by one. First, the cloud service. Databricks is available on the most famous cloud platforms, Microsoft Azure and Amazon Web Services. Later in the module, we'll discuss why Azure is the preferred provider for Databricks. Because Databricks runs on the cloud, it can easily provision the VMs or nodes of a cluster after you select their configuration. Databricks also allow you to launch multiple clusters at a time. This means you can work with clusters having different configuration, making it easier to upgrade your applications, or test the performance. And whenever you create a cluster, it comes pre‑installed with Databricks runtime. We'll talk about runtime in just a minute. And one of the great features of Databricks is the native support of a distributed file system. A file system is required to persist the data. So whenever you create a cluster in Databricks, it comes pre‑installed with Databricks file system or DBFS. An important point to note is that DBFS is just an abstraction layer, and it uses Azure Blob Storage at the back end to persist the data. So if users are working with some files, they can store the files in DBFS. Those files will actually be persisted in Azure Storage. Using this, the files are also cached in the cluster. So even after the cluster is terminated, all the data is safe in Azure Storage. You'll see that in detail in upcoming modules. The second layer is the Databricks Runtime. Databricks Runtime is a collection of core components that runs on Databricks clusters. So whenever you are creating a cluster, you select a Databricks Runtime version. Each runtime version comes bundled with a specific version of Apache Spark, some additional sets of optimizations of Spark. In Azure, Databricks runs on Ubuntu OS, so the runtime comes with system libraries of Ubuntu. All the languages with their corresponding libraries are preinstalled. If you are interested to do machine learning, it pre‑installs machine learning libraries. And if you provision GPU‑enabled clusters, GPU libraries are installed. And it also installs the Delta Lake component. The good thing is that versions of these libraries that are installed with runtime works well with each other, preventing the trouble of manual configuration and compatibility issues. And finally, how about building your own Databricks runtime? Interested? You'll see that in the last module. As part of Databricks Runtime, there is little Databricks I/O or DBIO. DBIO is the module that brings additional optimizations on top of Spark, related to caching. disk read/write, file decoding, etc. You can control these optimizations, but that's outside the scope of this course. But because of this, workloads running on Databricks can perform 50 times faster than vanilla Spark deployments. Now even though you can create multiple clusters in Databricks, doing so adds to cost. So you would want to maximize the usage of clusters. This is where comes Databricks high concurrency. Databricks high concurrency clusters has got an automatically managed shared pool of resources that enables multiple users and workloads to use it simultaneously. But you might think, what if a large workload consumes a lot of resources and blocks the short and interactive queries by other users? Your question is very valid. That's why each user in the cluster gets a fair share of resources, complete isolation, and security from other processes without doing any manual configuration or tuning. This improves cluster utilization and provides another 10x performance improvement over native Spark deployments. You'll see how to configure it in the next module. Databricks also provides native support for various machine learning frameworks via Databricks Runtime ML. It is built on top of Databricks Runtime, so whenever you want to enable machine learning, you need to select Databricks Runtime ML while creating the cluster. The cluster then comes pre‑installed with libraries like Tensorflow, Pytorch, Keras, GraphFrames, and more. And it also supports third party libraries that you can install in the cluster, like scikit‑learn, XGBoost, DataRobot, etc. And a very interesting component here is Delta Lake. Delta Lake is an open source storage layer. It brings features to a Data Lake that are very close to relational databases and tables and much beyond that, like ACID transaction support where multiple users can work with same files and get ACID guarantees, schema enforcement for the files, full DML operations on files, like insert, update, delete, and merge. And using time travel, you can keep snapshots of the data, enabling audits and rollbacks. I would highly recommend you go check it out. The third layer in the Databricks architecture is the workspace. It includes two parts. The first one is an interactive workspace. Here you can explore and analyze the data interactively. Just like you open an Excel file, apply the formula, and see the results immediately, in the same way, you can do complex operations and interactively see the results in the workspace. You can also render and visualize the data in the form of charts. In Databricks workspace, you get a collaborative environment. Multiple people can write code in the same notebook, track the changes to the code, and push them to source control when done. And you can build interactive dashboards for end users or use it to monitor the system. After you're done exploring the data, you can now build end‑to‑end workflows by orchestrating the notebooks. These workflows can then be deployed as Spark jobs and can be scheduled using the job scheduler. And, of course, you can monitor these jobs, check the logs, and set up alerts. So in the same workspace, you can not just interactively explore the data. You can also take it to production with minimal effort. And finally, there is the security layer. Databricks provides enterprise‑grade security, which is embedded across all the layers. The infrastructure security, which includes VMs deployed to the cluster, disks used to store the data, Azure Storage used for DBFS, etc., is all secured by the underlying cloud provider, which, in this case, is taken care by Azure. Since Databricks is well‑integrated with Azure, the user authentication is secured using Azure Active Directory single sign on, and you don't have to manage Databricks users separately. And finally, there is authorization for Databricks assets, which means providing fine‑grained access permissions for clusters, notebooks, jobs, etc. This is built in and secured via Databricks. So to summarize, Databricks securely run an optimized version of Spark on cloud platform. You can create multiple clusters and share resources between multiple workloads. It brings together data engineering and data science workloads so you can quickly get started to build your ETL pipelines, handle streaming data, do machine learning, and much more. And it has an interactive environment for building solutions, sharing it with colleagues, and taking it to production, taking the game of data processing to a whole new level. And the security is enabled across all the layers. Sounds exciting, right? Now let's have a quick look at the components of Databricks. First, there is a workspace to handle all the resources. Then there are clusters and pools, which you can use to run your applications, notebooks, which are used to write the code, and then there are jobs for automated and periodic execution. If there are third party libraries available, you can use it in Databricks, and you can also manage your data using databases and tables. And finally, you can build, store, and execute machine learning models on the Databricks platform. Just hold on, we'll discuss about many of them in the upcoming modules.

What Is Azure Databricks?

Alright, now that you know a lot about Databricks, let's look at the final piece of the puzzle. What is it that Azure brings to the table? Databricks is not a marketplace app on Azure. The teams of Azure and Databricks came together to make it a managed first party service on Azure. Databricks is natively integrated with Azure and its services. This also means Azure SLA applies to Azure Databricks as well, which is 99.95% uptime. And you also get the technical support for it, depending on your support plan. This is a big deal for organizations because Databricks service is fully backed by Microsoft. Next, Azure transparently deploys the Databricks workspace, clusters, and most of the resources in your own subscription, even though those resources are locked and you can't modify them, but you can track those resources in terms of usage and billing. Being a native service, Azure Databricks gets enterprise‑grade security. It is fully integrated with Azure Active Directory and provides role‑based access control so you don't have to manage the users and their access separately. Super awesome for administrators. And finally, you get unified billing. You pay for the usage of Databricks for storage and for VMs and disks created as part of the cluster all through a single bill. This will matter less to a developer, but for organizations, it's super important. Let us now understand how Databricks resources are deployed in Azure. There are two high‑level components, the control plane and the data plane. The control plane resides in a Microsoft‑managed subscription, while the data plane is in your own subscription. Whenever you create an Azure Databricks workspace, a Microsoft‑managed virtual network, or VNet, is deployed in the control plane along with Databrick services, like Databricks UI, jobs service, cluster manager, and notebooks. On the other hand, another Microsoft‑managed VNet is also departing the data plane. A network security group is attached to handle the inbound and outbound traffic, and an Azure Blob Storage account is provisioned that is used for Databricks file system or DBFS. The control plane VNet and the data plane VNet are securely connected to each other. Now when you want to work with Databricks, you will have to sign in using Azure Active Directory. Based on the permissions, you will get access to the workspace. Now when you set up a cluster, the cluster VMs and the disks will be deployed in the data planes VNet. This means the data is processed and stored in your own subscription. The important point to note here is even though the data plane resources are in your own subscription, they are completely locked, and you can't make any changes to them. This is similar to how other Azure first party services operate. The goal is to provide transparency by deploying it in your subscription, but making it easy to use and avoid any unintended changes to these resources. So as you saw, you can use control plane to launch a cluster, invoke the jobs, run interactive commands, view the output, and do much more. Your code resides in control plane, but it's encrypted and fully secure. And you can access control plane resources using Azure Active Directory single sign on or even by using REST APIs. On the other hand, the actual data resides in the data plane and is processed here. This means that data ownership and control completely remains with you. Interesting, right? And the cherry on the cake, Azure has several high speed connectors to its services that you can use with Databricks like Azure SQL Database, Data Lake Store, Blob Storage, Cosmos DB, Event Hubs, Synapse Analytics, Power BI, and much more. You'll see the usage of some of them in this course.

Summary

Wow, we covered a lot in this module. We looked at how Spark can help us build unified batch and streaming pipelines. Along with learning basics of Spark, you saw what is a processing model for structured streaming and how it executes batch‑like queries on streaming data using micro‑batches and triggers. You then saw how Databricks can take Spark to the masses with its unified platform and provides a bunch of features that enable collaboration, as well as faster development. It frees us from managing the infrastructure with built‑in support for creating multiple clusters while handling physical hardware, managing security, and much more. It also allows quick production deployments via jobs, run them on schedule, and monitor them. It brings its own set of optimizations via DB I/O, high concurrency clusters, etc. that allows you to run several times faster than traditional Spark deployments. And the Databricks runtime is a set of pre‑installed components like Spark, machine learning libraries, and GPU libraries that helps you quickly set up the environment. And finally, Databricks runs on Microsoft Azure platform as a native service and gets full features expected by a first party service. It allows for transparent deployments, full integration with Azure Active Directory, native high‑speed connectors to other Azure services, SLA guarantees, technical support, and unified billing. This makes Azure the go‑to platform for running Databricks. Now I'm pretty sure you must be waiting to see Azure Databricks in action. So see you in the next module where we'll start the journey to build a streaming pipeline using Azure Databricks.

Setting up Databricks Environment

Module Overview

Hi, and welcome to this module on Setting up Databricks Environment. Now that you have a solid understanding of Azure Databricks, it's time to get our hands dirty. We'll start by setting up an Azure Databricks workspace. We'll then see the different types of clusters and how to create them. You will also see what's a cluster pool and how useful it is. And then, you will learn how autoscaling works in Databricks. We'll then go ahead and create a notebook, attach it to the cluster, and run it. You will also see how to use it followed by some really cool features. We'll then go into the security aspect, how we can configure that, and what are the different options available? And finally, we'll walk through the scenario that we'll be using throughout the course. So let's get going.

Setting up Workspace

Let's start by understanding what's a workspace. We'll then go ahead and set it up. Workspace is a fundamental unit of isolation in Databricks. Each workspace and its resources are completely isolated from other workspaces, even if they are in the same subscription. When you create an instance of Azure Databricks service, one workspace is created. The created workspace is identified by a unique workspace ID. As part of a workspace, several resources are deployed in the control plane and data plane. As you saw in the previous module, data plane components are in your own subscription. These resources are locked and cannot be mortified. Once you launch the workspace, you can organize various assets in a folder structure, be it notebooks, libraries, and dashboards, as well as clusters and data. And you can define fine‑grained access control on all these assets, allowing users to use the same workspace, but only giving them restricted access. Along with that, there are various global settings that you can configure, like how to handle storage, archive logs, set up version control, and much more. Let's see how to create a workspace through Azure portal. You can sign in to Azure portal by going to portal.azure.com. Here, I'm going to search for Databricks service and create a new instance. Add the subscription. Since all Azure resources reside in the resource group, so let's create a resource group with the name PluralsightDemRG. Add a name for the workspace, let's keep it as PluralsightWorkspace, and select the location as East US 2. There are two pricing tiers that we can select from, standard and premium. You can even choose the premium tier trial. You'll learn about these tiers and their features in a later module. So let's select the premium tier and create the workspace. The workspace is now ready. Before launching the workspace, let's see an interesting thing here, and that is the managed resource group. This resource group has been created along with the workspace in our subscription and is locked for any changes. If you navigate to this resource group, you will see three resources here. These are the data plane resources. It contains a virtual network deployed in the data plane, a network security group for managing the inbound and outbound traffic, and a storage account, which is the underlying storage for DBFS. Also notice the workspace URL. It's in the format of adb‑workspace id followed by a random number .azuredatabricks.net. So you can bookmark and next time log in to workspace directly via this you URL. Let us now launch the workspace. Carefully notice that it is using Azure Active Directory single sign on to log in to Databricks platform. What you see now is the workspace UI. On the left‑hand side, you see the option of organizing all the assets in the Workspace tab, manage the databases and tables through the Data tab, create and manage the clusters via Clusters tab, deploy the jobs in the Jobs tab, and manage your machine learning models via Models tab. Go to the Workspace tab, and here you can see the options for creating a notebook, library, folder, or MLflow experiment. You can import some code files, and you can even export all the files in the workspace. Let's create a folder, PluralsightDemo. We can create subfolders or any other asset in this folder, and you can perform common tasks, like cloning, renaming, moving, or deleting the folder, as well as import or export it.

Creating Cluster

The second step in setting up the environment is to create a cluster. In a Spark cluster, there are two types of nodes, worker nodes, the nodes that actually perform the data processing task, since data in Spark is processed in parallel, having more worker nodes may help in faster persisting, and driver node, which is responsible for taking the request, distributing the task worker nodes, and coordinating the execution. Now, there are two types of clusters you can create in Databricks, an interactive cluster and in automated cluster. Let's see the difference between the two. An interactive cluster is majorly used to interactively analyze the data using notebooks. These clusters are created by user directly or by calling cluster API, but remember, they do not automatically terminate. You'll be charged while they're running, even if you're not using them, but Databricks allow to auto terminate these clusters if they are inactive for a certain period of time. This can provide huge cost savings. Now, because they keep running, any queries submitted are quickly executed, and you can auto scale out and scale in based on demand. They are costly as compared to the second type of cluster, the automated cluster. An automated cluster is used to run automated jobs. This means you specify cluster configuration when setting up the job. As soon as the job starts, the cluster is created and terminates when the job ends. Since they terminate with the job, the auto‑terminate option is not applicable here. Make sense? But of course, there is an overhead involved for job to start a cluster, but it provides high throughput because all the resources are dedicated for the job. It can also scale on demand and are much cheaper than interactive clusters. Sounds good. Also, note that interactive cluster supports, two modes, standard mode and high concurrency mode. You'll see what's the difference between the two in just a moment. And it supports two types of auto scaling, standard auto scaling and optimized auto scaling. Again, you'll see how that works. On the other hand, the automated cluster only supports standard mode, and it's good enough, and it only uses optimized auto scaling, which helps to improve performance and save cost. Now, let's talk about custom modes. Standard mode clusters are meant for single users. It does not provide any fault isolation. This means if multiple users are working on a standard cluster, failure in code execution of one user may affect other users as well. It also does not provide any task preemption, so one running will flow may consume all the resources, thereby blocking queries from others. That's why it is recommended that each user work a on separate cluster. Lastly, standard mode supports all languages. On the other hand, high concurrency clusters support multiple users. They provide fault isolation by running each user code in separate processes. Even if some users are running heavy workloads, the others get a fair share of resources that allow their jobs to complete on time. This helps in maximum utilization of the cluster thus helping to save cost. On the downside, it only supports Python, SQL, and R, but does not support Scala. As you saw previously, interactive cluster supports both modes, but automated cluster only supports standard mode. This is because automatic cluster is dedicated to a job and there is no requirement of fault isolation, task preemption, etc. Makes sense? Now, let's see how we can create a cluster. To start creating a cluster, go to Clusters tab. Here, you'll see the list of interactive and automated clusters. But as you know, you can only create an interactive one. Let's click on Create Cluster. Provide a name here. Let's keep it as DemoCluster. You can select the Cluster Mode, Standard or High Concurrency, and by now, you very well know the difference between the two. I'm going to select the Standard mode since only I'm going to use this. We'll leave the Pool option for now, but we'll come back to it later. Next, you need to select the Databricks runtime version in the last module, we discussed in detail about it. Databricks Runtime is the VM image that comes with preinstalled libraries, which has a specific version of Spark, Scala, and other libraries. One thing you should notice, the different configurations of runtime. For building a streaming pipeline, you can select the version 6.6. If you want to enable machine learning, you can select 6.6 ML that will preinstall ML libraries. If you want to use GPU accelerated VMs with ML, you can select 6.6 ML with GPU, and it preinstalls GPU libraries. Or if you want to work on Genomics, there is a separate runtime to Genomics libraries. That's great. You'll also see how to build your own Databricks runtime image using Docker container in the last module. Now, you can go ahead and select the configuration of a single worker node. These are different VM sizes provided by Azure. Depending on your requirements of memory, cores, and hard disk, you can select the configuration. Remember, all the runtime libraries will be installed on each worker node, and then you can select the number of worker nodes you need for your cluster. Let me select three here. After selecting the worker node configuration, you can now select the configuration of the driver node. You may have also noticed DBUs mentioned with each configuration. So what's a DBU? DBU stands for Databricks unit, and is a unit of processing capability per hour. Each configuration tells you how much DBUs will be consumed if VM runs for 1 hour, and you pay for each DBU consumed. You'll see more on this in the pricing module. In case you are not sure about the load and how much worker nodes you need, you can enable auto scaling, providing the minimum and maximum number of worker nodes and let Databricks handle that. And finally, you can enable auto termination of cluster by providing the number of minutes. Let's select 30 minutes, and if there is no activity for 30 minutes, cluster will auto terminate. Hit the Create button to finish creating your cluster. Azure will now go ahead and provision the required VMs with specified configuration and libraries as specified by Databricks runtime. Once the cluster is up and ready, you can terminate, restart, or delete the cluster at any time. You can even edit the cluster by selecting it, clicking Edit, and changing the cluster configuration. Remember, changing the cluster configuration may require a restart of the cluster. Last two things which you should note for now is the Event Log and the Driver Logs. Event Logs shows you all the events that have happened in the cluster, for example, when the cluster was created, when it was terminated, if it's edited, or if it's running fine. This helps to track the activity on a cluster. And in the Driver Logs, you'll get the logs generated within the cluster, notebooks, and libraries.

Understanding Cluster Pools and Autoscaling

Now let's see how cluster pools and autoscaling works. While creating the cluster, you saw another option, pool. Let's see what is that. Now even though you can save cost by terminating a cluster when not in use, booting up a cluster might take a few minutes, and this can be annoying sometimes. So you can create a pool of idle ready‑to‑use instances, and you can attach multiple clusters to a pool by creating them. So if cluster 1 needs two instances, pool allocates the idle instances to cluster 1. If second cluster needs two instances, pool creates a new instance and then allocates two instances to cluster. After the cluster terminates, the instances are returned to the pool from where they can be allocated further. If cluster 1 scales and needs one more instance, it can be allocated from the idle instances in the pool, and that's why pool helps in reducing the cluster start and autoscaling times. Now there are four important properties of a pool. Idle instance auto termination. This means you can define if you want to terminate an idle instance after a certain number of minutes. Minimum idle instances. Using this, a certain minimum number of instances will always be running as part of the pool, regardless of auto termination minutes specified. Maximum capacity. This property can be used to define an upper limit of instances in the pool. If clusters demand more than this capacity, it will result in a failure. And finally, you can define instance type. If a cluster is attached to a pool, the worker and driver nodes has to use the same instance configuration specified in the pool. Let's see how you can create a pool and attach it to a cluster. In the Clusters tab, go to Pools and create a new one. Provide the values for properties you just saw, name, minimum idle instances, max capacity, minutes for idle instance auto termination, and the instance type and create the pool. And you can notice the used and idle instances. Let's create a new cluster. Let's name it as DemoPoolCluster and select the pool DemoPool here. Notice that you no longer have the option to select worker and driver node configuration. It's the same as the instance type of pool. Fill out the rest of the properties and create the cluster. So whenever this cluster starts or autoscales, it will pick up driver and worker instances from the pool, reducing the cluster start and scale time. Very useful indeed, right? Let's see the last thing here, which is the two types of autoscaling. First one is standard autoscaling, and the second one is optimized autoscaling. Let's see how that works. Let's assume you have a driver node and two worker nodes. There is a driver process running on driver node and executor processes running on each worker node. Each executor has multiple slots for running tasks in parallel. Let's assume there are two per executor. In standard autoscaling, Databricks monitored the driver process. Now when you submit the job, driver breaks it into stages and tasks. Let's assume it created five tasks. Since autoscaling is enabled, Databricks will scale out to add one more worker node, and all five tasks can now run in parallel in different slots. Let's assume that two of them have finished. Even though a worker node is idle, cluster can not scale in since Databricks is only monitoring the driver. Only when all the tasks are finished and the job is complete, the cluster can remove a node. That's the limitation of standard autoscaling. On the other hand, let's take the same scenario. There is one driver and two worker nodes. But in optimized autoscaling, Databricks is not just monitoring the driver. It's also keeping information about the executors and the worker nodes. Taking the same example, if a job is submitted and divided into five tasks, Databricks will autoscale and add a new worker node. Tasks will start executing on different slots. Let's assume that two of them have finished. Since Databricks now has information of idle executors, it scales in more aggressively. Even though the job is still running, it will remove the idle worker node, therefore providing huge cost savings. Amazing, right? So autoscaling helps to run workloads faster as compared to fixed‑size cluster. It also helps to reduce the cost when a cluster is not in use. In standard autoscaling type, it takes slightly more time to scale out the worker nodes. And as you saw, it scales in only when the whole cluster is idle for 10 minutes. On the other hand, optimized autoscaling scales out faster. It will scale in if the nodes are idle, even if the job is still running. In case of interactive cluster, scale‑and will happen if the worker node is idle for 150 seconds. And in automatic cluster, it will do the same if worker nodes are idle for just 40 seconds. Note that optimized type is only available in premium tier. You'll see that in the pricing module. Alright, we have discussed a lot here. Let's summarize what you need when you want to set up a cluster. You need to specify the cluster mode, standard or high concurrency, Databricks runtime time version, configuration of driver and worker nodes, which can be same or different, minimum and maximum nodes for autoscaling, and based on the tier, you will have standard or optimized type, number of minutes for auto termination of cluster, and you can specify pool and its configuration for fastest start and scale times. Sounds good.

Working with Notebook

Now let's take a quick tour of the notebook and see how to work with it. Notebook is a place where you will actually write your code, and you can do that in any Spark‑supported language. So you can write code in Scala, Python, SQL, R, or Java. And the great part is that you can write code in multiple languages in one single notebook. We'll see that in the upcoming modules. Also, one notebook can invoke the other one and pass the data. This can help in building end‑to‑end workflows. Using an interactive cluster, you can run queries in the notebook, or you can run the complete notebook using jobs. These notebooks also support built‑in visualization. So if you have data in a tabular format, you can instantly visualize that using charts and graphs, or you can use the same dataset to quickly build a dashboard. And finally, it supports collaboration. Let's see how you can work with it. In Databricks workspace, start by creating a notebook under any folder. Provide a name, and you can select any language, Python, Scala, SQL, or R, in which you want to do development. And I'm going to select Python here. The cluster drop‑down will show a list of all running clusters, and you can decide on which cluster you want to execute your code. I'm going to select DemoCluster here, which we created earlier. You can see that notebook is attached to the cluster. You can detach the notebook from the cluster, and attach it to any other cluster. This allows you to run your code on different versions of Spark. Remember, it's not mandatory to have a running cluster for development. You can write code in the notebook and later attach it to a cluster to execute it. Now that you are all set, you can start writing the code in the cell. Let's check here if everything is working fine by writing the first string, Hello World. To execute this, you can use the run drop‑down and select Run Cell. It went ahead and executed the cell and printed the Hello World value. You can also insert a new cell. Here, let's define a sum variable, and print it. You can also execute this using keyboard shortcut Shift+Enter. So, a notebook is basically a collection of runnable cells. You can execute a single cell or the whole notebook at once. The notebook also preserves the state, so you can use the same variable at any other place in the notebook. Now let's look at some of the cool features of notebooks and get used to them. First one is Autocomplete. You can start writing and press Tab, this will bring the drop‑down from which you can select commands, variables, etc. Interesting, right? Next, in the Revision History, you can see all the changes that are being made to the code, and you can select a previous version and restore your notebook to that version. The other important thing you can do is to enable version control. You can either download the notebook and manually add it to your source control, or you can enable source control right inside your Azure Databricks workspace. Navigate to Account, User Settings, select Git Integration, and select either GitHub, Bitbucket, or Azure Devops. Once enabled, you can link your notebook to your Git repo. Very useful, right? And as I said, notebooks enable collaboration. Multiple people can write code in the same notebook at the same time, and you'll see this icon if someone else is also working on this notebook. And you can select code, use this icon, and post comments. This is extremely helpful in a shared environment. Let's see one more thing. Let's write a command in a new cell, %md triple hash Exploration Notebook, and let's mention, testing new commands. Come out of the cell, and magic. Now you have a cell where you can write detailed description. To edit the comment, double‑click on the cell. Any command starting with % sign are called magic commands in Databricks. %md is used for documentation, and you can even include text, images, or links. Extremely helpful to document the notebook. You'll see more magic commands as we go forward.

Configuring Security

Before you get down to writing any code, it's important that you understand and configure the security controls. I would categorize the security into three layers. First one is of the infrastructure that you are deploying. This is being taken care by Azure. We saw earlier how workspaces secured by deploying the sources in control plane and data plane virtual networks, which are securely connected. Traffic is managed by security groups, and resources are locked to prevent any changes. Second one is the identity control. Users are authenticated using Azure Active Directory Single Sign On. So to login, user must be a part of Azure Active Directory and must be added as a user in Databricks Admin console. You'll see that in a demo in just a minute. Third one is the fine‑grain user permissions on Databricks assets like clusters, folders, notebooks, jobs, and data. Let's see how we can set it up. In Databricks workspace, go to Account, Admin console. From here, you can add users who can access the workspace. Let's first add a user, user1@pluralsight.com. Because this user does not belong to our Azure Active Directory, this user will not be able to login to the workspace. Let's add another one, which is part of our Active Directory, demo@pluralsight.com. This user is successfully added and can now login to the workspace. Let me add few more quickly. After you are done adding users, you can define which users can have full permissions by making them admin, and who all are allowed to create a cluster, by setting this permission, and you can even organize users in groups and give access at the group level. Let's now set up the permissions at the folder level. In the workspace, open the drop‑down folder Go to permissions. Select a user and the permissions you want to assign to that user. These permissions are inherited down to the subfolders, and all the notebooks present inside them. And you can even setup permissions at individual notebook level. Select a notebook, and you can assign the permissions in the same way as folders. With Read permission, users can view the notebook and make comments. Run permission allows you to attach or detach a cluster to the notebook and run the commands. With Edit permission, you can make changes to the notebook, and of course Manage permission allows you to change the permissions at the notebook level. To give fine‑grained permissions at the cluster level, go to Clusters tab, and select the Edit Permission option for a particular cluster. There are three types of permissions, Attach To, Restart, and Manage. which are self explanatory. Manage allows complete control over the cluster. Setting permissions on jobs are similar, but you will see that when we'll start creating jobs. So you have seen that Databricks comes with extremely simple ,yet powerful security controls that allow you to manage fine‑grained access control, while it itself manages the infrastructure security and user authentication.

Scenario Walkthrough

Now that our environment is ready, let's walk through the scenario that we'll be addressing. Globomantics is an organization responsible for processing New York City taxi service data. They collect information related to rides like unique ride ID, pickup time, details of pickup location, cab licence, as well as driver license, passenger count as reported by the driver, and if it's a solo or a shared trip. The app on the taxis makes a call to an API and stores it in a database. This is the transactional process. At the end of every day, the analytical process starts, extracts the data from the database using a batch pipeline, process that, and stores it in a data warehouse. The reports are then built on top of it. So to summarize, the ride events are currently being captured in a database. In their on‑prem setup, they extract this data using an ETL tool, make transformations, and built dimensions and facts. Then they store this data in a data warehouse. Following that, the build monthly aggregated reports and KPIs like revenue collection by taxi types, or by their pickup location, or total trips being taken, or maximum trips per region, and much more. But this type of processing is no longer sufficient. Globomantics now want to ingest the data in real time and process the data as quickly as possible. They also want to combine this real‑time data with static data sets or with historical data for better analysis. And the team also wants to visualize the reports live so they can take quick actions. Also, it's an important requirement to store the raw data to continue doing batch processing and analysis, and they also want to store the process data so it can be exposed to downstream applications. To build this capability, Globomantics has decided to now ingest the ride data in a stream ingestion service, like Azure Event Hubs. Processing needs to be done using a stream processing service like Spark Structured Streaming running on Azure Databricks. The raw and processed data is to be made available in a file store, which is Azure Data Lake, and the live reporting and dashboards are to be built with Databricks itself.

Summary

So in this module, we went through many components of Databricks, and saw how to set up them. This will act as a premise for the following modules. We saw how to create a workspace and organize all the assets like folders and notebooks. We then went ahead and created an interactive cluster, and saw how it is different from a job cluster. It has two modes, Standard and High Concurrency. And we went through various configuration settings, Databricks Runtime, Auto scaling, Auto termination, etc. Then we saw pool, which contains a set of instances ready to be used by the attached clusters. And that helps in speeding up the cluster start and scale times. We also sort two types of auto scaling, Standard and Optimized, and how it can help us run workloads faster and reduce the cost. We then created and attached a notebook to the cluster and saw some cool features, Autocomplete, revision history, git integration, posting comments, and magic commands. We then went into configuring the security for all these assets. And finally, we walked through the scenario of New York City taxi service that we'll be working with. So, let's get down to building the streaming pipeline by configuring sources and sinks in the next module.

Configuring Source and Sink Stores

Module Overview

Hi, and welcome to this module on configuring source and sink stores. Now that we have set up the Databricks environment, it's time to configure the stores. But before we do that, it's very important to understand how Structured Streaming provides fault tolerance. Then we'll see various sources and sinks, Azure, as well as non‑Azure which are supported and how to configure them. We'll then work on setting up Azure Event Hubs as our source using Databricks libraries. Next, you will learn about Databricks file system in detail and how to mount Azure Data Lakes store as our sink. And finally, we'll prepare our sample application to simulate and send NYC taxi events. So let's get started.

Structured Streaming Fault Tolerance

Before we talk about streaming sources and sinks supported by Spark, it's important to understand how it enables fault tolerance. One of the key goals of Sparks Structured Streaming is to deliver end‑to‑end fault tolerance using exactly‑once semantics. What does this mean? Let's see that. First, this means that a source event should be reliably processed exactly once. Given a set of inputs, the system should always return the same results. Second, the output should be delivered exactly once. It should ensure that after processing the inputs, all outputs are delivered to the sink only once, so that there are no duplicates. So to do that, Spark engine tracks the exact progress of processing, and if there are any failures, it handles them by restarting the jobs and reprocessing the data, but still ensures that input is only used once and output is delivered only once. Make sense? Now, to enable fault tolerance, every component plays a key role. The streaming source should support offsets so engine can track the reposition in the stream, and it should be replayable, which means you can extract the same data on the source again. Structured Streaming engine supports fault tolerance by using checkpointing and write‑ahead logs, and the Streaming Sink should be idempotent, which means that if you write the same output data multiple times, it does not duplicate the data in the sink. Let's see all these points with the help of an example. So for processing streaming data, you need a replayable source. Spark Structured Streaming engine, which in our case, is running on Azure Databricks, and an idempotent sink. Spark engine uses a checkpoint directory to write logs so that you can keep track of progress. Let's assume there is streaming data arriving at the source. Source uses unique offsets to keep track of every event. In this case, it uses offsets 1 and 2 for the 2 events. Now, when you start the stream processing, Spark engine first checks the logs in the checkpoint victory. In this case, it does not find anything, so it goes ahead and writes a log that it is going to process offsets of 1 to 2. These are called as write‑ahead logs. Once written to log, it will then extract the data from the source using offsets 1 to 2 and start the processing. Meanwhile, there is more data coming into the source. Now, assume there is a failure during processing. When the stream job will start again, it would first check the logs for any incomplete process. Now, it finds 1, and because of that, it again extracts the data from the source for offsets 1 to 2. That's why sources should be replayable, which means they should be able to provide the same data when asked for. Sounds good. Now, once the processing is complete, engine starts to write the data of the sink. In this case, we're assuming that input is written as is to the sink. Assume that only partial output was written, which is the first event, and the processing fails. When it restarts, again, it will check the logs and again extract the data for offsets 1 to 2. But now, when it is writing to the sink, the sink sees the data related to offset 1 is already committed, so it only writes for offset 2; therefore, the idempotent nature of the sink prevents duplicate data being written. And finally, once the commit is complete, Spark engine updates the log that transaction is complete for offsets 1 to 2 and then picks up the next set of offsets due to the next trigger. Sounds good. So to summarize, in order to achieve fault tolerance, streaming source should have offsets and provide events for an offset range again when asked for thus it should be replayable data source. Structured Streaming engine uses checkpointing and write‑ahead logs, it stores the offset range for data on the process for every trigger interval, and finally, streaming sink should be idempotent, which means that it should support multiple writes of the same output, which is identified using offsets without duplicating the final data set.

Source and Sink Options

Now that you have seen how structured streaming provides fault tolerance, let's see which sources and sinks are supported and how you can configure them. Now, some of the sources and sinks are built right into the Databricks runtime, so you can use them without setting them up, and there are many others that can be configured using Databricks libraries. But first, if you want to check if a source is supported or configured, you can check for its driver. To do that, you can use Class.forName and pass the name of the driver. For example, if you want to check for Azure Event Hubs, you can pass the driving name and see if it is configured or not. Now, let's see some of the sources that can be used with Spark structured streaming. There are Azure services that can be used, for example, Azure Event Hubs, Azure IoT Hub, and Azure Cosmos DB. And then there are file‑based sources that could be used directly, like Azure Blob Storage, as well a Data Lake Store Gen1 and Gen2. On the other side, you can also use non‑Azure sources, like Apache Kafka, which is built into Databricks runtime, and so is Amazon Kinesis. You can also use Amazon S3, which is a file‑based store, and then you can use built‑in Delta Lake component, which is simply great. If you remember, we talked about Delta Lake briefly in the previous modules. And then there are many other replayable sources that can be configured using libraries. Now, let's see the sinks you can configure. On Azure site, you can use all the sources that you just saw as sinks. Along with that, you can also use Azure Synapse Analytics, which was earlier called Azure SQL Data Warehouse, as a sink for your applications. And all the non‑Azure sources that we talked about can also be used as sinks. Additionally, Spark supports two types of sinks for debugging, Consulting where you can write the output to the log and the memory sink where everything is written‑in memory and can be displayed on screen. You'll see memory sink in the demo as well. And then comes the interesting part. If there are data sources for which there is no driver available and which can be used as idempotent sinks, you can still use them by using foreach and foreachBatch sinks. For example, using this, you can configure SQL Server as a streaming sink. It's not tough to implement this, but note that it supports at least one semantics, not exactly one semantics therefore the output data could be duplicated on failures. Let's see how you can configure a source. First, you can use the variable sourceDF as a streaming data frame. On Spark, use readStream method, provide the type of format or the driver. For example, if you want to read from Azure Event Hubs, provide the type as eventhubs. Now, schema is mandatory while defining the source. You can define the schema in a variable and pass it using schema method. Some of the sources already have the schema defined, like Event Hubs, in those cases, you don't need to pass the schema, then pass various configurable options like connection string, source name, etc., and finally, apply the load method. Remember, this statement is only building up the DAG, and it does not start execution until the sink is defined. And finally, let's see how you configure the sink. Use the transformed streaming data frame, and use the method writeStream. Again, provide the format type, example, parquet file format. Provide the checkpoint directory location, like /mnt/datalake/chkptlocation. You'll see later in the module how to set it up. Provide the name of the query. This can be very useful for query identification and even to query the streaming data. Then provide sink information, like location of the output file. Next, specify that trigger interval. For example, you can specify 5 seconds as your trigger interval. And finally, executive the job using the start method. Remember, the stream processing will only start when it encounters the start method. Easy, right?

Setup Azure Event Hubs and Get Maven Coordinates

Alright, now it's time to setup our source, Azure Event Hubs, and get the library coordinates from Maven. In the previous module, we discussed about the scenario we are going to implement. Let's focus on the first part, configuring stream ingestion service. So, we are going to set up Azure Event Hubs as our streaming source. But first, let's see what is Azure Event Hubs. Azure Event Hubs is a fully‑managed and distributed‑streaming platform that can ingest, as well this process. millions of events per second. It provides very high throughput and low latency. Data can be ingested from a variety of sources, like in our case, we want to ingest data from taxis. And it can reliably distribute it to multiple downstream consumers. Azure Event Hubs sits between event publishers and event consumers. So basically, it acts as a buffer so that producers can send the event stream at their own pace, and on the other side, consumers can consume it at their own pace. Now to work with Azure Event Hubs in Databricks, we are going to use the library from an open‑source project on GitHub. It is Event Hubs connector to Spark and Databricks. And here is the GitHub URL for the project. Now let's see what are we going to setup here. First, let's get the Maven coordinates for Azure Event Hubs library. You can even download the library. Followed by this, let's create an Event Hubs namespace and Event Hub within the namespace. We will then go ahead and copy the connection string. Optionally, you can also configure new Access Policy or Consumer Group, but we'll not be covering that here. Alright, let's start by visiting the GitHub page. On the Github page, you can see this is the Azure Event Hubs connector for Apache Spark. Let's scroll down to the Releases section. You can see there are separate versions available for Spark local deployment, as well as for Databricks. These are the latest versions available at the time of recording this course. Since we are going to use Databricks Runtime 6, let's use this version. You can see library is available at Maven Central repository. It is a repository provided by Maven community, and it contains a large number of commonly‑used libraries. Let's click on the link, and it will take us to the library page. Now from this page, you can download the library and use it in your Databricks project, or you can copy the Maven coordinates, which consists of groupid, artifactid, and the version. Let's copy the coordinates, so you can use it later. Next, let's switch over to the Azure portal. To create an Event Hubs namespace, let's search for event hubs, and create a new one. Select the resource group that we created earlier, PluralsightDemoRG. Provide a globally‑unique name for the namespace. Let me keep it as TaxiSourceEventHubNamespace. Select the region, select the pricing tier as Basic, and throughput units as 1. Click on Review+Create, and create the namespace. Once it is ready, navigate to the namespace, and create a new event hub. Provide a name, I'm going to keep it as TaxiSourceEventHub. Notice the number of partitions are set to 2. Let's go ahead and create the event hub. There are two things that we need to copy. Copy the name of event hub that we just created. To copy the connection string, navigate to Shared Access Policies at the namespace level. Click on the default policy, and copy the primary or secondary connection string. That's it. We are now all set to configure Event Hubs in Databricks.

Source: Configure Azure Event Hubs Using Databricks Libraries

In the previous clip, we copied all the required entities. Let's now go ahead and configure Azure Event Hubs as source, using Databricks Libraries. But first, let's understand what are Databricks Libraries? These are third‑party libraries that can be installed in Databricks environment, and then they can be used either in notebooks or jobs, running on the clusters. Now libraries can be in any supported language ‑ Scala, Python, R, or Java. You can either upload them in the workspace, or point to their location in external repositories, like Maven, PyPI, and CRAN. In the previous clip, we copied the location of Azure Event Hubs library, present in Maven repository. There are three different library modes. First, a library can be scoped at the workspace level. In this, it is stored in the workspace. Remember, workspace scoped library can't be used directly. You have to either install it on a cluster or a notebook to use it. Second, a library can be scoped at the cluster level. This means it exists in the context of the cluster, and it can be used by all the attached notebooks; and finally, there is notebook scoped library. In this case, library can only be used by a single notebook. Interestingly, this also means that you can use different versions of the same library on a single cluster, but with different notebooks. Sounds good? All right. Let's see how we can configure a library. The libraries are published at external repos, like Maven or PyPI. You can copy the library coordinates, or download the library. You saw this step in the previous clip. Next, you can add the library to the workspace using any of the options. This is called as workspace scoped library. Using the same options, you can install it on a cluster. This is the cluster scoped library. Once installed, it can be used by all the notebooks. Library can also be installed on the cluster from the workspace. Also, you can configure a notebook scoped library using the workspace library, and from here you can use it in the notebook. Notice that this option is only available in PySpark. Now let's see how we can configure Azure Event Hubs. Switch over to the Databricks workspace. From the workspace tab, let's create a new folder, Libraries, to keep all our libraries at one place. In the Library's folder, let's now create a new workspace scoped library. Click on Create Library. You can see that there are many options. You can upload a library in any of these formats ‑ Jar, Python Egg, or Python Wheel format. You can refer to a library present in DBFS, or you can even point to an external library in any of these repositories ‑ PyPI, Maven, or CRAN. Since we have already copied the Maven coordinates for Azure Event Hubs, let's use that. From the Maven tab, paste the copied coordinates, and click on Create. Once imported, you can see all the dependent libraries have also been added. Great! Let's now go ahead and install it on a cluster. From the cluster's tab, go to cluster, DemoPoolCluster. Navigate to libraries. From here, you can install a cluster scoped library. Click on Install New to install a new one. Again, you can see very similar options. Upload a library, refer to a library present in DBFS, or point to any external repositories. But along with that, there is an option of workspace. As you saw earlier, you can install a workspace library on the cluster. Let's select the workspace, the Libraries folder, and the Event Hubs library that we created. Let's click on Install. This will install the library on the cluster, and it'll be available to all the attached notebooks. Simple, right? Now let's create a new notebook, TaxiStreamingPipeline, and select the language as Python. First, let's check if Event Hubs library is now available in the cluster or not. Since Class.ForName method can only be done in Scala or Java, let's write a Scala command, and since we are in a Python notebook, we need to use a magic command ‑ %scala. Now we can check for any class in the library. Let's use Class.ForName method, and provide a class ‑ EventHubsSource. Let's execute this, and you can see that class is now available on the cluster. To wrap up, let's add all the configuration that we copied earlier, in a new cell. Let's create a variable, namespaceConnectionString, and paste the connection string that we copied in the previous clip. Add another variable, eventHubName, and provide a value, which is the name of your event hub. In this case, it is TaxiSourceEventHub. Let's combine these two to create the Event Hub connection string. Put this information in a dictionary, eventHubConfiguration. Add the connection string parameter. For the latest Spark versions, the connection string should be encrypted in the dictionary. Let's do that by using EventHubsUtils.encrypt method. You can add more configuration here, like consumer group, offsets, etc. Execute this, and that's it. You have completed the Event Hub setup in Databricks. Easy, right?

Sink: Mount Azure Storage Services to DBFS

Now that we have set up the source, it's time to set of the sink, so you will see how to mount Azure Storage services to DBFS. Let's check the scenario again. In this clip, let's focus on setting up the last component. Let's configure Azure Data Lake Gen2 as our sink. To do that, let's mount it to Databricks File System using service principal authentication, and then we can store raw streaming, as well as process data in the form of Parquet files. In previous modules, we briefly talked about Databricks File System, or DBFS. Let's discuss that in a bit more detail. Spark requires a distributed storage system to work with. Since it's just in engine, it does not have its own file system. That's why Databricks comes with the DBFS, which is a layer of distributed file system attached to Databricks workspace. And as you saw earlier, it's backed by Azure Blob Storage. Now, the abstraction layer of DBFS is deployed on every cluster in the workspace, and they can access everything stored in the DBFS. Even after the cluster terminates, data is safe in Azure Storage, and you can store anything, temporary files, Databricks tables, mount external file stores, and store their credentials, and much more. Let's quickly recall the diagram that we saw earlier. Whenever you create a cluster in Databricks, it comes preinstalled with DFS. Remember, it's just an abstraction layer, and it uses Azure Blob Storage at the back end to persist the data. So users can store the files in DBFS, but those files will actually be persisted in Azure Storage. So even after the cluster is terminated, all the data is safe in Azure Storage. In fact, you can also mount external storage to DBFS. What does that mean? As you know, DBFS is the file system of Databricks, so mounting a file‑based storage to DBFS allows seamless access to data from the storage account without requiring credentials. Think of this as mounting a network drive to your computer. So if you mount an external storage, that's like a pointer to that storage. So you only need to provide credentials for the first time when mounting storage, and thereafter you can access it without credentials. And instead of using URLs for files, now you can use file semantics as if they are local files. But even though you are interacting with DBFS, the files are actually persisted to external storage, so they are safe, even if you delete the cluster or even the workspace. Azure Blob Storage can be mounted by using an access key or restricted shared access signature. And Azure Data Lake, be it Gen1 or Gen2, can be mounted using service principal, so you can mount as many addition storage accounts and use them like local drives. In the demo, you will see how to mount Data Lake Gen2 account, but for this, you need a service principal. So what's a service principal? Think of this like a service account. It's an identity which can be used by our applications to access Azure resources instead of using Azure Active Directory credentials for our user. In order to use it, you need to create an Azure AD application and then create a secret key that acts like a password. Next, grant access to this newly created Azure AD app on Azure resources, like Data Lake store, and then use the app ID and the secret of this app in your applications to access the resources. Sounds good. For this course, I'm assuming that you already have an Azure Data Lake Gen2 account. If you want to learn how to create one, the detailed steps are available in Setup Instructions document available in the exercise files. Now, let's create a service principal. In Azure portal, navigate to Azure Active Directory, App registrations to reach this page, register a new application. Let's provide the name as PluralsightServicePrincipal and click Register. Once registered, open the application and notice two important attributes here, the Application ID and the Directory ID. Copy these values because it'll be required for mounting. Next, go to Certificates & secrets and generate a new client secret, copy and save the generated value as you won't be able to retrieve it again. Now that the service principal is ready, let's give it access to Data Lake Gen2 account. Switch over to Data Lake account. In the account, go to Access control, click on Add role assignment, in the Role, select Storage Blob Data Contributor, then search for Azure AD app PluralsightServicePrincipal, select it, and save the permissions. This gives access to the app on Data Lake. Finally, go to Storage Explorer, right‑click on CONTAINERS,, and select Create file system, provide the name taxioutput, and create it. I'm also going to create two folders in the container, Raw and Processed. Now that the Data Lake is ready, let's switch back to Azure Databricks workspace. Let's create a Setup notebook to keep all setup‑related information, including the mounts, in one place. To mount Azure Data Lake Gen2, let's put all the information we have collected so far as the config. In client.id, add the Application ID. In client.secret, add the secret value. Remember, We are putting the secret as plain text. This is not recommended, and you should either use Azure Key Vault or Databrick secrets to avoid this and then add the directory ID in the URL. Now that the config is ready, let's use dbutils.fs.mount to mount the Data Lake. There are three things that you need to provide. First is the source. Here, add the name of the file system and the name of Data Lake account. Second, select a mount point. Let's put it as /mnt/datalake. Remember to access the files in Data Lake, you will now be able to directly use this path without any credentials. And third, provide the configs. Let's execute the cell, and that's it. Data Lake account has now been successfully mounted to DBFS, and this can now be accessed via any cluster or any user in the workspace. To check if this is successfully working or not, let's run the command dbutils.fs.ls and provide the mount path, /mnt/datalake, and you can see the two folders that we created. Awesome. This way, you can check for all files and folders without the need to pass any credentials. One last thing. What is this dbutils? These are Databricks utilities that can perform powerful tasks inside notebooks. There are a lot of things that you can do with dbutils, and you will continue to see that during the course.

Setup Sample App to Send NYC Taxi Events

Now it's time to set up a sample application that will simulate New York City taxis for us and will send taxi ride events. Back to our scenario, let's focus on sending the events to Azure Event Hubs. Our sample application will simulate the taxi rides. Internally, it will read the events from a file. Note that I have been of the data from NYC Taxi and Limousine Commission's open data release and have modified it for this demo. And of course, we also need to configure the app to send events to Azure Event Hubs. Sounds good. Here is the sample data for a single event when the ride starts. We have an ID that is unique for every ride, VendorId, indicating the provider associated with taxi, PickupTime, the time at which the ride started, followed by the PickupLocationId. Location details corresponding to IDs are available in a static file. Then there is CabLicense and DriverLicense numbers, number of passengers taking the ride, and the RateCodeId that indicates if it's a shared ride, solo ride, ride to the airport, etc. So as you have seen, all the events are in JSON format, and they will be sent one by one to Azure Event Hubs by the sample application. One very important point to note here is that all the events with the same PickupTime are sent at once, so if there are five rides starting at the same time, those events will be sent together from the application. And finally, the delivery gap between the two events is equal to the actual difference in their PickupTimes. So if one event starts at first second, and the second one starts at fourth second, the app will actually send the other one after 3 seconds of the first event. These things help in simulating the actual environment. So for the demo, you will need .NET Core 3.1 installed on your machine. If you want to check or debug the code, you may use Visual Studio Code. To configure the application, you will need the connection string of Event Hubs namespace and the name of the event hub. As I mentioned previously, events are presented in data file. It contains rides for one day and has around 300,000 events. Now, let's go ahead and set up the application. You can download the application from exercise files. Once downloaded, navigate to the app folder. The data file is present in the Data folder. Open appsettings file in any editor. Let me open it in VS Code. In the configuration settings, replace the values of Event Hubs namespace connection string, as well as event hub name that we copied earlier and save it. Now, open a terminal and navigate to the app folder, build the project using command dotnet build. And now that the build is complete, let's run the application using command dotnet run. Once you hit Enter, it will start sending out events to the Event Hub. Let's do that, and you can see there are multiple events getting sent every second, which corresponds to start of a taxi ride. You can also press Escape key to pause the events any time and press Escape again to resume it. This will be helpful to analyze streaming data later. Let's stop the application here. Easy, right?

Summary

In this module, we focused on configuring source and sink stores. We started by looking at how Structured Streaming provides fault tolerance using the replayable sources, idempotent sinks, and checkpointing. You then saw various sources and sinks, Azure, as well as non‑Azure. Some of them are built into a Databricks runtime, and others can be configured using Databricks libraries, and this is where you saw mandatory and optional properties to configure a store. Then we checked out the open‑source path connector for Azure Event Hubs and copied the Maven coordinates. We then used those coordinates to install the library. This is going to act as a source for our pipeline, and you also know the different library modes, scoped at Workspace, cluster, and notebook level. Next, you saw how to mount Azure Data Lake store Gen2 to Databricks file system or DBFS using service principal authentication. This is the sink of our pipeline. And finally, we set up the sample application to simulate and send taxi events. So let's get down to building the pipeline by extracting streaming data from Event Hub, processing it, and loading to Data Lake in the next module.

Building Streaming Pipeline Using Structured Streaming

Module Overview

Hi, and welcome to this module on Building Streaming Pipeline Using Structured Streaming. Now that we have configured the source and sync stores, let's build the streaming ETL pipeline. We'll start here by extracting the data from Azure Event Hubs in our notebook. At first, we'll see how to use memory sync, and display the streaming data on screen. We will apply schema to extract taxi data, and then apply transformations on top of it. You will also see how checkpointing, and partitioning works. Followed by this, we'll load the raw data in CSV format, and the processed data in Parquet format, in our data lake. You will also see how to mix two languages ‑ Python and SQL ‑ and that's where we'll be using Spark SQL, and at the end, we'll visualize the streaming data in notebooks, and build dashboards on top of it. So let's get going.

Extract and Process Source Data

Let's first start by extracting and processing source data. Back to the Databricks workspace let's open the notebook TaxiStreamingPipeline. Previously we added the Event Hubs configuration here. To read the data from a streaming source let's use spark.readStream. Since we are going to extract from Event Hubs, provide the format value as eventhubs. Next, pass in the configuration settings by using options and specify the load method. Let's execute this. And you can see a streaming data frame, inputDF, has been created. And here is the schema for this data frame. There are a couple of points to note here. First, schema must be provided for a streaming data frame. You can infer the schema for a bounded data frame, but not here. This is because the data size could be too small while streaming to infer the schema. Now, some of the data sources, like Event Hubs, provide the schema out of the box. But if you are using say, a file source, schema must be provided. You will see how to do that a bit later. The second point is that it has not yet started to read the data. All right. Now, can we check if this is a streaming data frame or not? Let's write a command to do that. On the data frame use the isStreaming property and execute this. And you can see that yes, it's a streaming data frame. Sounds good? Next, let's push the data to a sink. If you remember, there are two sinks available for debugging, memory sink and console sink. Let's push the data to memory sink. So on the input data frame let's use the writeStream method. You can even specify the name of the query using queryName method. Let's keep it as MemoryQuery. You will see how this name can be really useful. Provide the sink format. Since we are going to use memory sink so specify the format as memory. And finally, use the start method. Only when you call the start method the stream execution will start. Also, let's assign this to a variable, streamingMemoryQuery. Remember that this is not a data frame, it's a streaming query object. Let's execute this. And you can see it has started executing the query. Also notice while the query is running there is no new data right now. Remember, since memory sink is only for debugging it only looks for the latest data. If we check the Raw Data tab you can see the progress details of the query. It shows the query name, MemoryQuery. BatchId is 1. If there is no new data coming, batchId will not go forward. numInputRows is 0. And you can see the source is eventhubs and the sink is MemorySink. You can also check the same data by using streamingMemoryQuery.lastProgress. Sounds good? All right, let's stop the query. Before we start to pass in any data let's also add the trigger. If you don't specify the trigger query runs as soon as it can. As you saw previously, you need to add the trigger just before the start. Use the trigger method and provide the interval. Let's specify 10 seconds here and start the query again. This time also there is no data. Switch over to sample app and run it. And you can notice that the events are now flowing. Back to the notebook, and now you can notice the events streaming in. Raw data will continue to get updated. Now you can notice the changing batchId and InputRows. Since we are using 10 seconds trigger, inputRowsPerSecond is InputRows divided by 10. processedRowsPerSecond is the number of rows it can process per second. And it also tells you the duration of each step in micro batch execution. You can also notice that source has two start and two end offsets. Why? Because if you remember we specified two partitions for Azure Event Hub. It is reading the data parallelly from both the partitions. Make sense? Let's stop this query. Now let's check what is there in the inputDF. For this, we can use the display method. Display method is provided by Databricks and it internally uses memory sink. So we can provide the same properties as we did previously, streamName and trigger. Let's run the query. Along with the dashboard and raw data, it now also shows the actual data. This data is in the Event Hubs default format. The body contains the actual data while the other properties represent the metadata. Let's stop the query and transform the body to get the actual data. First, import the pyspark.sql.functions. Next, let's create a new streaming data frame, rawDF. Using the inputDF, create a new derive column in the data frame using the method withColumn. Let's keep the name of this new column as rawdata and define the formula. Use the body column and cast it to a string. Next, specify which columns you want to see in the output using select method. Here, we only want to see the taxi data so let's specify the newly derived rawdata column. You can even view the data by running the display query. Now execute this. And this creates rawDF data frame with only one column. And you can see now we have data in the JSON format in rawdata column. Simple, right? All right, now to extract the taxi data from this raw data string let's first define the schema. To do that, you need to import the pyspark.sql.types. Then use StructType method to start defining it. Whatever columns you need from the string you can define it by adding the name and its data type. Execute this in the schema variable that's ready. Let's now apply this to rawDF. On the rawDF let's convert the rawdata column from string to JSON using from_json method. Provide the column and then provide the schema that we just defined. You can also go and define a new name for this column with JSON data using alias method. Let's keep it as taxidata. And finally, let's select the columns individually. Use the select method and notice how you can select data from a JSON type by using taxidata column name and the attributes in JSON data. Once you execute this the data frame now has new column names. Now let's add transformations to rawDF and store it in a new data frame, transformedDF. Here, we are going to use RateCodeId. If the value of RateCodeId is 6 then it's a SharedTrip else it's a SoloTrip. Let's add a new derive column, TripType. Use the when clause to check if RateCodeId is 6. If it is 6 then mention SharedTrip, else mention SoloTrip. This is similar to if/else being applied to a column and is similar to a case statement in SQL. And finally, drop the RateCodeId column since it's no longer needed. Execute this and you get a TripType field in the data frame. Notice that the execution is very quick here. This is because these are transformation operations which only execute when the stream starts. This is what is called as chaining of operations. And finally, let's add one more transformation. In case you only want to get data where PassengerCount is greater than 0 you can use where or filter clause and provide the filter condition PassengerCount is greater than 0. Once executed, this operation will also be applied when stream starts. Okay, let's execute this now. And you can see the process data. This data will keep updating as the new data arrives. Awesome, right? Let me stop the body here, and also I'll pause the sample application. Now, do we always need to apply these transformations individually? The answer is no. You can chain these operations together. So, we started reading from Event Hubs by using spark.readStream method and provided format as eventhubs. We went ahead and converted the binary body data to a string. Next, we transformed the string into JSON data and named it as taxidata. Followed by this, we extracted data and created separate columns from the JSON. Then we derived a new column by using RateCodeId field to see if it's a Shared or a SoloTrip and after that removed the RateCodeId. And finally, we applied the filter on the PassengerCount. Very quick and easy to build, right?

Load Data to Files

Now that we have the raw and processed data ready, it's time to load it into files. Back to the Databricks workspace. Remember, we have a rawDF, and a transformedDF. First, let's load the raw data into the data lake. Again, it's a very similar query, just like the one we specified for memory sync. Let's use writeStream method on rawDF. Provide a query name ‑ RawTaxiQuery, but this time, provide the format as csv, since we will load the raw data in CSV format. Next, specify the path at which you want to store this data. Use /mnt/datalake mounted location, followed by a folder name ‑ Raw. In the previous modules, you saw what's a checkpoint directory. It helps to store the offset and status information. Specify the checkpoint directory location using the option, checkpointLocation, and then we have the trigger and start methods. Let's execute this. Since the sample application is paused, it's not receiving any data. Let me switch over to the app, and resume the events. After a few events, let me pause it again. All right, now you can see it has processed one batch here. Now switch over to the data lake account, go to containers, and select taxioutput container, and you can see, there is the Raw folder for storing raw data and checkpointRaw folder for storing checkpoint information. First, let's navigate to checkpointRaw folder. In that, go to offsets, and there you have two files. File 0 has the start offsets, and file 1 for the first batch that has been processed. Open file one and notice the offsets. These are the end offsets for batch one, and will be used as start offsets for batch two. Sounds good? All right, let's now go to the Raw folder and check the data. You can see there are three files here. Why? One of the files is an empty file for batch 0, and the other two are for batch 1. This is because, you have seen, Azure Event Hubs has two partitions, and Spark Structured Streaming is processing both partitions parallely. That's why there is one file for every partition. So if you run one more batch, there will be two more files here. Interesting, right? Let's try it out. Let's go back to sample app, resume it, and pause it again, and few events have now been generated. Switch over to Azure portal. If you open the checkpoint folder, a new file for batch two has been generated. On the other hand, two new files have been generated in the Raw folder, corresponding to two partitions. Makes sense, right? So to summarize, checkpoint folder stores a separate file for every batch that is executed. Each batch file contains a set of offsets that are ending offsets of the batch, and it takes the start offsets from the file of previous batch. In case of a failure, it will read the offsets again from these files, and reprocess their data. Now, Spark Structured Streaming process every partition of Azure Event Hubs in parallel. That's why defining more partitions while setting up an Event Hub may help in faster processing of data, and finally, one file is written corresponding to every partition, to the output folder. Now that we have stored the raw data in CSV format, let's store the processed data in Apache Parquet format, which is widely used, and provides great performance. Unlike CSV or JSON, which are row based formats, Parquet stores the data in a columnar format. Just like JSON, it can store complex data structures, as well as nested data. The great thing about Parquet is that it also stores the schema of the data in the file itself. So you can read the data from Parquet without the need for defining or inferring the schema. It also supports efficient compression and encoding. That's why they are much smaller in size than CSV files. Data in Parquet is in binary format, and it takes more time to write to Parquet than to CSV files. But reading from Parquet is extremely fast, especially more when you are accessing only a subset of the columns. Let's see how you can save the streaming data frame to the Parquet format. Back to the Databricks workspace. Let's add a query similar to the previous one. Carefully notice, we are going to use transformedDF, which has been created from rawDF, but that does not matter, because it's just a chain of operations that has been defined. Keep a different name for the query ‑ ProcessedTaxiQuery, and this time, specify the format as Parquet. Specify a different output location and checkpoint directory. Let's keep a different trigger interval for this query. Let's execute this as well; and now you can see, both the queries are executing in parallel. They both are extracting data separately from Event Hub, based on their checkpoint information. Also, since they have separate trigger intervals, the number of micro‑batches and volume of data in every batch is different. But ultimately, they are working with the same set of data, without interfering with each other. One is storing raw data, and the other, the processed one. Interesting, right?

Working with Spark SQL and Visualizing Data

So far, we have been working with Python, and by now you very well know that you have different language options to work with. Databricks provides an interactive way to work with multiple languages together. So let's see how to work with SQL, along with Python. But before we do that, let's understand what is an execution context? An execution context is an isolated environment in which the code is executed and the state of all variables, objects, and functions is maintained. In Databricks, a new execution context is created on the cluster for every combination of language and notebook. This means if you're writing Python code in one notebook, it runs in an execution context, and if you write code in another notebook, it's a different execution context. Because they are isolated, objects in one execution context cannot be shared with another one. For example, variable created in one notebook cannot be accessed in other notebooks. That also means if you use Python and SQL in the same notebook, the objects created in one language can't be used in another language. So to work with multiple languages and notebooks, you need to pass around the data from one context to another. Let's see how that works. But first, go to Data Explorer in Data Lake account. I have created a new folder here, StaticData, and uploaded a file, TaxiZones.csv. We'll be using this file here. The detailed instructions to upload the file are available in the setup document. Back to our Databricks workspace. In our notebook, TaxiStreamingPipeline, we have already created a DataFrame, transformedDF. Since it is in Python's execution context, you can't use this in SQL. To use it in SQL, you can use a method createOrReplaceTempView on the DataFrame, and provide the name with which you want to refer that in SQL. Lets keep it as ProcessedTaxiData. Execute this,and see how quick it is. This creates an in‑memory temporary view, which is in the execution context of SQL, and it's only valid in this notebook. Think of this like a pointer to the DataFrame. You can now go ahead and use spark.sql method to write a SQL query. and the temporary view, ProcessedTaxiData, can be used as a streaming table. Execute this, and now you're running a full SQL query on a streaming DataFrame Spark. Very useful if you're coming from SQL development background. But now comes the interesting bit. Databrick goes one step ahead and allows you to use magic command %sql. You'll no longer need to create SQL is a string and pass to spark.sql. You can directly write the same SQL query in a Python notebook. Execute this, and you get the same result. Wow! That opens up a lot of possibilities. Now let's combine the streaming data with some static data using SQL. Of course, you can do that in Python as well. Let's extract the data from the TaxiZones file that you saw. This is going to be a static DataFrame. To do that, use spark.read method. I believe you have already noticed this. To read streaming data, you use readStream method, and for static data, it is the read method. Next, specify an option that file contains headers, and let it infer the scheme as well. Use the CSV method, and provide the path of TaxiZones file. And let's use display method to visualize the data. Executed this, and you can see the file has four columns, Location ID, Borough, Zone, and ServiceZone. Sound good? Alright, first, let's create a temp view called TaxiZones from taxiZonesDF, And finally, join the two DataFrames. Here, let's join on the LocationId, filter by Manhattan borough, and group it by zone. The idea to show you the query is that you can easily use DataFrames like SQL tables. Let's execute this. As we discussed previously, DataBricks support built‑in visualizations. You can convert this dataset into a chart quickly, and use the plot options to customize it according to your needs. Since it has a streaming DataFrame, the chart will change with every execution. Great. Let me show you another interesting thing. You can even build dashboards here. Click on this link to add this streaming chart to a new dashboard. This opens up a new page. Provide the name of the dashboard, Taxi Pickups, and present the dashboard. Very neat. Every dashboard is attached to one notebook, and you can put charts and tables from that notebook. So every time notebook executes, data is updated, and if it's streaming data, it's continuously updated. Interesting, right?

Summary

All right. In this module, we started by extracting the data from Event Hub. We then converted the binary data to JSON string, and JSON string to taxi columns. We then saw two ways to use memory sync, one by using format as memory, and second by using display function in Databricks, and we also checked the status of streaming queries using lastProgress method. Followed by this, we applied business transformations using common methods like select, withColumn, drop, where, etc. We then added the checkpoint directory, and saw that one file is created for every batch having offset information, and this helps in restarting the jobs after failure, enabling fault tolerance. Then we loaded the raw data into CSV, and processed data into Parquet format in the data lake, and as you know, Parquet provides higher read performance and great compression. Then we saw the great feature of using Spark SQL, along with Python. We can use spark.sql method, or use %sql magic command to write SQL queries. To pass around the data from Python to SQL, we registered the data frame as a temporary view, and not just this, we joined the streaming data frame with a static data frame, and then visualized the data in the form of charts, and we can even build dashboards on top if it. Sounds good? Now let's see how to make our pipeline production ready by parameterizing it, and scheduling it as a job, in the next module.

Making Streaming Pipeline Production Ready

Module Overview

Hi, and welcome to this module on making streaming pipeline production ready. Now that we have built the streaming pipeline it's time to make it ready for production. Since we wrote a lot of code to explore the data, we will start by cleaning up and organizing the code. We will then add parameters to the notebook so all the hard‑coded values can be made configurable. After we are done with that we will then schedule the notebook using Databricks jobs and see what are the options available. And finally, we will talk about some of the best practices for our streaming pipeline. Sounds good? So let's continue.

Parameterize Streaming Pipeline

To make our by pipeline production ready, let's start by parameterizing the pipeline. Since we wrote a lot of code to understand the features of Databricks, let's clean up some of the code and add it to a new notebook, add relevant comments, and log information. Switching back to our Databricks workspace, I have created a new notebook, Taxi Streaming Pipeline production. Then I have added some custom log information, using print command, throughout the notebook. Once the notebook runs, this information will show up in the log. And then we have the same code that we wrote in previous modules. At the end, there are three streaming queries. First one is for the raw data with the raw folder path and the checkpoint location. The second one is for process taxi data with the transformations applied. It has separate folders to store data and checkpoint information. And finally, there is a SQL query. The output of this will be in the form of charts. Now instead of hard‑coding the event hub information, let's pass it as parameters. To create a parameter for the notebook, we're going to use another Databricks feature called Databricks widgets. To create a widget, let's run a command, dbutils.widgets.text. This creates a text‑based input. Provide the name of the widget, EventHubNamespaceConnectionString, a default value, and a display label value. Add one more for the name of Event Hub. Let's execute this. And you can see two widgets have been added at the top of the notebook. This will act as parameters to the notebook. There are four types of widgets you can define: text widget, that we just created, allowing text as an input. You can create a drop‑down widget, where you can select a value from the list of available values. Next is a combobox widget, which acts as a combination of text and drop‑down widgets. You can either select the value or type in a new value. Last one is the multiselect widget, where you can select one or more values from the list. Let's add the Event Hub connection information in the widgets. To extract and assign the value of this widget, let's use another command, dbutils.widgets.get. Provide the widget name, EventHubNamespaceConnectionString. Do the same for Event Hub name, and let's print the values as well. Once you execute this, you can see they have picked up the new values from the widgets. Easy, right? Finally, as an exercise, create a new drop‑down parameter for Borough, use it in SQL, and see the effect.

Scheduling with Databricks Jobs

Now that we are done cleaning up the notebook and setting up the parameters, it's time to run it using Databricks jobs. So, what are Databricks jobs? Jobs allow the execution of a notebook or any existing JAR file. It can run immediately, or it can be scheduled. And by now you know that jobs can run on automated clusters. They are created and terminated with the job. But you can even use an interactive cluster to run them. Each job can also have different cluster configuration. This allows to use a small cluster for smaller jobs and a large cluster for the bigger ones. And finally, you can fully monitor these job runs, retry on failures, and set up alerts for notification. So, to configure a job, you need to provide configuration for a new automated cluster or select an existing interactive cluster. You can specify a schedule for the job using CRON syntax. Next, you can configure alerts that can send out emails on job start, success, and failure. You can define how many maximum concurrent runs you want, a timeout value for the job, and if you want job to retry on failures, use the retry policy. And finally, you need to select the job task type. Let's see what is that. There are three task types, and you have to select one of them for the job. First one is the notebook type. Here, you can select an existing notebook from the workspace that you want to execute. And, of course, you can define parameters and any dependent libraries. Second, you can use your own JAR files or any third party ones. You can upload it in the job and then provide the main class name and the arguments. This will help you run your existing applications on Databricks. Another way to run the third party JAR files, or even Python scripts, is by using the spark‑submit command. Again, provide the file location and other arguments to run it as a job. Now let's see how we can run our notebook as a job. Back to the Databricks workspace, setting up a job is very intuitive and straightforward process. On the left‑hand side, there is a Jobs tab. Click on it to access the list of jobs. Click on Create Job to set up a new one. Let's fill up the properties here. Provide the name of the job. Let's specify Taxi Streaming Job. As you have seen, there are three ways to run the workflow or the task, by using notebook, setting up a JAR file, or configuring parameters for spark‑submit command. Let's use the notebook task type. Select the notebook TaxiStreamingPipelineProduction. Remember, you can only kickstart one notebook from here, but you can build a master notebook, which can internally call other notebooks. But doing this is not recommended in streaming cases. Next, add the parameters for the notebook in the key‑value format. Let's add EventHubNamespaceConnectionString and EventHubName parameters and.provide their values, and confirm the information. Since our code is dependent on Event Hubs library, let's add it here from the workspace. This will be installed on the cluster. Provide the cluster information, click on edit, and you can see that there are two options. Select the existing interactive cluster option and choose one from the list. Or you can select New Automatic Cluster option and provide the same information as you did while creating an interactive cluster. Pool information, Databricks runtime, auto scaling option, and configuration of worker in driver nodes. The only difference here is that there is no auto‑termination option. Why? Because automated clusters are created and terminated with the job. Next, you can specify the schedule to run this job. Since we have streaming queries that are going to run continuously, we don't need to add a schedule here. There are a few more options available. In the Advanced section, you can specify email alerts. Let's specify for the job start, success, and failure. Next, keep the Maximum Concurrent Runs as 1, and don't specify the timeout value, since the streaming queries will keep running. If you want your streaming jobs to restart automatically on failure, specify the retry value. Finally, specify the job permissions, just like we specified permissions for clusters and notebooks. And that's it. Our job is now fully configured. Let's trigger the job manually by clicking on Run Now. And you can see the new job has triggered. You can monitor this job run by clicking on Run ID. It will create an automatic cluster and start running the streaming pipeline. Simple, right?

Best Practices

Now let's look at some of the best practices. While doing development, always consider specifying trigger interval. Of course, if you don't specify that, the batch will execute as quickly as possible. But not specifying any interval or a too small interval causes unnecessary checks at the source. So avoid that by providing an appropriate interval. Next, while using display function of Databricks, it's recommended to provide optional properties as well. As you know, display function uses memory sync. Here, provide a stream name, or else display function generates a new name. Again, if you don't provide trigger interval, it runs as quickly as possible, so provide the trigger interval to avoid unnecessary checks at the source. And it's recommended to provide checkpoint location as well. All right, let's now see some of the best practices for performance. When you are creating a cluster, plan to use cluster pool. As you have seen previously, pool helps in reducing cluster start and scale times. Then, enable autoscaling on the cluster. This helps in maximum utilization of the cluster and can also handle unexpected load. Next, try to align the cluster cores with the Event Hub partitions. What does this mean? Now there is a 1:1 match between Event Hub partition and DataFrame partition. So when you read the data from Event Hubs with two partitions, two DataFrame partitions are created, and each DataFrame partition is processed using one core. So either set appropriate Event Hub partitions, or if it's more, increase the number of cores to improve parallel processing. Sounds good? Next, use scheduler pools to improve performance. Let's see what is that. If you remember, we ran two streaming queries in our demo. Now those queries were running in the same scheduler pool because we did not specify any pool there, and they were following first in, first out, or FIFO order. This means if micro‑batch of one query is executing, the other query will be blocked and will have to wait. This causes delay in executing the query. To prevent that and run queries concurrently, you can use fair scheduler pools. Let us see how to do that. First, in a new cell, set a local property, spark.scheduler.pool, and create a new pool, pool1, and then start the first query. Followed by this, in a different cell, define another pool, pool2, and then run the second query. This ensures that both the queries can run concurrently. Awesome, right? Now let's see how we can improve stability of our applications. As you have seen, while building pipeline, always enable checkpointing. First, this helps in ensuring exactly‑once processing. This means that an event will be processed only once, and output will not be duplicated. Second, it also helps in enabling fault tolerance. Next, when you are setting up a job, set that retries property to unlimited. This means that even if there is a failure, the job will automatically start again and run your streaming pipeline. And the great thing is it will create a new automated cluster. Makes sense? And of course, always set up alerts while creating a job. So if there is a failure in job, you will receive emails, and you can take action on that. And let's look at the cost part. Now, even though you can run jobs on interactive cluster as well, always plan to use automated clusters. First of all, as you have seen, you get optimized autoscaling. The cluster scales in and scales out more aggressively, and this can help to save cost. And second, you will be using data engineering workload, which is cheaper than using interactive cluster. You will see more on this in the next module. And finally, if you don't have real‑time processing requirement, you can still create streaming jobs and run them periodically using RunOnce trigger. Let's see why would you do that. You can use it to provide recommendation to users. If you don't want to do that in real time, you can process that every hour. Or instead of processing logs instantly, you can do that say every 6 hours. But you might think, can we do that using batch pipeline? Of course, you can. But even time is important in these use cases. Then it always helps to process new data using checkpoints. It also provides exactly‑once guarantee and provides fault tolerance for processing out of the box. That's why it's better to use streaming APIs. But if you don't want to run it continuously, running it periodically can provide huge cost savings. Now, to use RunOnce trigger, use the trigger method and mention once=True. Great. But once you plan to put that in production, create a Databricks job, but this time define a schedule for execution. And lastly, set the retries to none. Very useful, right?

Summary

In this module we worked on making our pipeline production ready. We started by cleaning up all the exploratory work and organized multiple streaming queries in the notebook. We then went ahead and parameterized the notebook by using Databricks widgets. There are four types of widgets available, text, dropdown, combobox, and multiselect. And we can extract data from widgets using dbutils.widgets.get method. You then saw how to schedule the notebook using Databricks jobs. You can either use an existing interactive cluster or a new automated cluster. The automated cluster starts and terminates with the job. You also saw various ways of submitting the job, using notebook, using JAR files and using spark‑submit option. So to configure a job you have to provide cluster configuration and job task type. You can set up alerts, define schedule, provide timeout value, retry policy, etc. And finally, we talked about some of the best practices related to development, performance, stability, and cost, like adding checkpoint directory, setting appropriate trigger interval, aligning partitions, using RunOnce trigger, configuring autoscaling and scheduler pools, etc. Sounds good? Let's see different workloads and pricing and how structured streaming compares to other services in the next module.

Understanding Pricing, Workloads, and Competition

Module Overview

Hi, and welcome to this module on understanding pricing, workloads, and competition. In this module, we'll start by learning about different Databricks workloads and tiers. Then we'll see what features are available for various combinations of workloads and tiers. Followed by this, we'll go in detail and see how pricing works for Azure Databricks. And lastly, we'll compare Spark Structured Streaming running on Azure Databricks with other streaming services hosted in the cloud, so let's get going.

Workloads, Tiers, and Pricing

Now let's look into different workloads and tiers in Azure Databricks, the features they offer, and how it affects the pricing. There are three distinct workloads available. First is the Data Analytics workload. This workload is used to interactively explore and analyze the data using notebooks. In simple words, using an interactive cluster means that you are using Data Analytics workload. When you are done doing development and want to run your code as an automated job, you are going to use Data Engineering workload. Simply put, when you are using automatic clusters to run notebooks or libraries, you are using this workload. But remember, as we discussed earlier, you can run a notebook as a job also on an interactive cluster, but that would mean you're using Analytics workload and not Engineering workload. The last workload is Data Engineering Light. You can only create jobs using libraries, and not notebooks. Remember, this workload does not provide optimized Databricks environment, so if you have existing library, you can use Databricks to provision a cluster, attach the library, and schedule it as a job. This helps organizations who have existing investments in Spark but want to leverage infrastructure‑related features offered by Databricks. We'll compare all three workloads in just a moment. If you remember, while creating Azure Databricks service from Azure portal there were two options for tiers, a Standard tier and a Premium tier, so we select a tier while creating the service itself. Both the tiers offer a set of features, and all the three workloads are available across these tiers. Let's see what features are offered by Standard tier across different workloads. In the Standard tier, managed Spark environment is available for all workloads. Next one is the job scheduling, but before that, let's recall for a job there are three task types, notebook task where you can use an existing notebook present in the workspace, JAR task where you can use your own or a third‑party library for the job, and the sparks‑submit type, which is a command line option for running JAR files. Now you can use any task type and any Databricks runtime, but when you are using an interactive cluster to run your code, you are using Data Analytics workload. In the same way, you can use any task type and any Databricks runtime, but when you are using automated cluster, this means you are using Data Engineering workload. But as you saw previously, if you want to use Data Engineering Light workload you can only use JAR or spark‑submit types on an automated cluster, and while setting up the cluster, you have to select Databricks Runtime Light environment. So as you saw, job scheduling with libraries is available across all the workloads, but scheduling with notebooks is only available in Analytics and Engineering workloads. And the great features of Databricks like auto‑scaling and auto‑terminating clusters, Databricks Runtime ML, Managed MLflow, and Delta Lake are available in both Analytics and Engineering workloads. And as we just talked about, interactive clusters are only available in Analytics workload, and all the Databricks interactive features like notebooks, data visualization, collaboration, etc., are only available here. And finally, you can do interactive development in R language using RStudio, and every Databricks cluster also runs a JDBC and ODBC server, which BI tools can connect to query the data present in Databricks tables. Now let's look at the Premium tier features. The Premium tier includes all the features of Standard tier, but the fine‑grained, role‑based access control on all the assets that you saw while configuring the security is only available in Premium tier. Also, the optimized auto‑scaling that you saw earlier is only available here, but this is not applicable to Engineering Light workload since it does not support auto‑scaling. And finally, you can also capture detailed audit trail of the activities performed in your workspace by using Audit Logs. This is also called as Diagnostic Logs in Azure. Note that there are more features available in Premium tier, but we haven't discussed them in the course. Alright, let's now understand what you pay for, of course, through a unified bill when you work with Azure Databricks. You pay for Azure resources, which includes virtual machines provisioned as part of the cluster, disks associated with those virtual machines, usage of Azure storage account as part of DBFS, and the public IP addresses that are provisioned. Also, you pay for Databricks service based on DBUs consumed. But what's a DBU? DBU, or Databricks Units, is a unit of processing capability per hour. You saw while creating the cluster that each VM configuration has an associated DBU value. It tells you how much DBUs will be consumed if VM runs for one hour, and you pay for each DBU consumed. Now the price of DBU depends on the combination of workload and tier. Let's take an example. So as you know, if you are using an interactive cluster, you are using Data Analytics workload. In Standard tier, you will need to pay $0.40 per hour for every DBU, and if you have created a Premium tier workspace, you will be paying $0.55 per hour per DBU, so you are paying extra per DBU for features like role‑based access control, etc., and the price is significantly lesser for Engineering and Engineering Light workloads. But remember, these prices may change from time to time. Now you know the pricing of DBU all depends on the combination of workload and tier. That's why change in workload or tier only affects the DBU price, but it does not affect the VM price. Make sense? So let's see how to calculate the cluster price. First you provision VMs for worker and driver nodes, multiply it with per‑hour price of VM, this gives you the cost of Azure VMs. Then for the DBU cost, multiply the number of VMs with the number of DBUs per VM. And finally, multiply this with the per‑hour price of DBU, which is based on workload and tier. The combination of VM price and DBU price gives you the final price of the cluster. Let's try to understand that with the help of an example. Let's say you have three worker nodes and one driver node. All these nodes or VMs are of DS4 v2 size with price per VM as $0.458 per hour. This VM configuration has 1.5 associated DBUs. This means cost of all four VMs multiplied by per‑hour charge of $0.458 is $1.83 per hour and a total of 6 DBUs are consumed. On the other hand, depending on the workload and tier used, DBU price is calculated. If you select Data Analytics workload on a Standard tier, you pay $0.40 per hour, which is the charge of DBU multiplied by 6 DBUs, which is $2.40 per hour. But if you change the tier to Premium, you pay $0.55 per hour DBU charge for a total of 6 DBUs equaling $3.30 per hour, so in total you pay $1.83 VM cost plus $3.30 DBU cost per hour. Sounds good?

Comparison with Other Streaming Services

Now that you have seen Spark Structured Streaming running on Azure Databricks, let's compare it with other streaming services. But before that, let's see the features against which we're going to compare different services. First one is the true unification of batch and streaming APIs. Then we look into the end‑to‑end delivery guarantees of services. Followed by this, we'll see if we can run interactive queries when streaming data. Next, the languages. which are supported by each service, then if the support is available to join with static data. And finally, if they are hosted on a managed platform with any cloud provider. But remember, every service has its own great features, but here we are going to compare these services against a few of them only. Also, new features have been released categorically in these services. Alright, the first one we are going to compare against is the Apache Flink. It is hosted on AWS as Kinesis Data Analytics. The second one is Apache Storm. It is available as a hosted solution on Azure within HDInsight where you can create a storm cluster. Next one is Azure Stream Analytics, which is a proprietary service from Azure, and finally, there is Apache Beam, which is available on Google Cloud Platform as Dataflow. So let's compare these services. As you know, Sparks Structured Streaming provides a unified set of batch and streaming APIs. It's not just the dataframe API that are common that internally uses the same runtime. On the other hand, Apache Flink's streaming API can handle bounded datasets as well, but the APIs and underlying runtime are different for both. Then Apache Storm and Stream Analytics does not provide unified APIs, but Apache Beam does provide that, but internally, it has different runtimes for batch and streaming. That's why it is not a true unification of APIs. Now let's compare these services for end‑to‑end delivery guarantees. The first type of delivery guarantee is exactly once. This means that streaming service needs to ensure that each incoming event affects the final result exactly once. Now when the job restarts after a failure, it guarantees that there is no duplicate data in the outlook and no data is left unprocessed. Make sense? The second type is at‑least once guarantee. Here, streaming service ensures that each incoming event is processed at least once, but it does not provide any guarantee on the output. This means output may receive duplicate data. As you have seen in previous modules, structured streaming provides exactly once delivery guarantee, and Flink also provides the same guarantee. but Storm provides at least once guarantee so output data may get duplicated and then Stream Analytics, as well as Beam also provides exactly once delivery guarantee. Sounds good. A very important feature for a streaming service is to allow running interactive queries on the streaming data. This is required to get the current status, see the progress. and the metrics of a running query. You can also use it to check what data was processed, the rate of data processing, check latencies, and much more. And you would also want to run ad‑hoc queries against the current data in the stream. While building the streaming pipeline, you saw that structured steaming does allow you to run interactive queries on the streaming data, but none of the other services provide this functionality. This makes Spark a very useful platform to build streaming pipelines. Next one is the language support for services. As you already know, you can use multiple languages in Spark to build pipelines and query streaming data. So you can use Scala, Python, Java, R, and SQL. Interestingly, there is an open source project being built to support .NET languages, C# and F#. Then you have Flink, which supports Java and Scala, and Java is the primary language for Storm, but it has a great feature called Bolts, using which multiple other languages are supported like Ruby, Python, and Fancy. Then there is Stream Analytics, which has a SQL‑like syntax. It's a very feature‑rich language, and SQL developers can easily start using that. And finally, Beam has support for three languages; Java, Python, and Go. Great! Let's see if these services have support for static data, which is very important. You should be able to join the streaming data with static or reference data and enhance your final output. In Spark Structured Streaming, you can read any source data as a DataFrame, and then you can join it with streaming data. This is what you saw when we joined taxi streaming data with static Zoom data. So we joined a streaming DataFrame with a static DataFrame. This kind of support is neither available in Flink, nor it is available in Storm. It is available in Stream Analytics, but it's limited to Azure Blob Storage and Azure SQL Database. So you can only join with static data present in these services. Similarly, Apache Beam on Google Cloud only supports joining with data presented in Google BigQuery. So let's summarize all the features you have seen across different streaming services. All the services are hosted on one cloud or the other. While many services provide batch and streaming APIs, only Spark provides truly unified APIs, while Apache Storm provides at least once guarantee, the rest of the services do provide exactly once delivery guarantee. And Spark is the only one that allows to run interactive queries on the streaming data. Then we saw the languages supported on all of them, and finally, most of the services have no to limited support for static data. But structured streaming allows you to use static data from variety of later sources. Again, remember, there are many other factors that influence the decision to use the services, but we have just compared a few of them. But I believe this would have given you an idea that Spark Structured Streaming is a great service to use. Right?

Summary

So in this module we looked at various features offered by Azure Databricks and how the pricing works. You first saw that there are three types of Databricks workloads, Data Analytics when you are working with interactive cluster, Data Engineering when you are using automated cluster, and Engineering Light workload when you use non‑optimized Databricks runtime light environment with automated cluster. Then you saw tiers, which you select while setting up the workspace. There is Standard and Premium Tier, and the features are available across combination of workload and tier. Next, we looked into the pricing aspect. You pay for Azure resources that have been created, like VMs, storage, etc., and the total number of Databricks units, or DBUs consumed, but the price of DBU depends on the combination of workload and tier again. Then we compared Sparks Structured Streaming with other streaming services hosted on the cloud like Apache Flink on AWS, Storm on Azure HDInsight, Azure Stream Analytics, and Beam on Google Cloud, and compared against features like unified API, end‑to‑end delivery guarantees, static data support, etc. Let's continue and see how you can customize the cluster using initialization scripts and Docker containers in the next module.

Customizing the Cluster

Module Overview

Hi, and welcome to this module on customizing the cluster. Let's see different ways to control the libraries and settings on your cluster. We'll look at two different ways. We'll start by using initialization scripts. These are shell scripts that can run on a cluster before it is active, and we will also verify if it has successfully deployed or not. Then we'll see what are Databricks Container Services. Here, we'll do a quick walk‑through on the basics of Docker and how it can be used to customize the cluster. And finally, you will see how to create your own runtime using Docker, push it to Container Registry, and deploy it to Databricks cluster. So let's get started.

Working with Initialization Scripts

At first, let's use initialization scripts to customize the libraries deployed on the cluster. An initialization script is a shell script that runs during start‑up process on each node in the cluster before the driver or worker JVM starts. This means your script will execute first before your cluster is ready. But the question is, why would you do that? You already know how to deploy a library on the cluster, but in case you have a lot of libraries and packages to deploy, you can use InitScript. Now Databricks runtime comes with preinstalled libraries. If you want to upgrade version of any specific library, you can do it. Here you can change Spark configuration properties, and you can even modify system properties and environment variables on JVM. Sounds good? Now let's see the steps in order to use initialization scripts. First, of course, is creation of InitScript. Then you have to upload the script in DBFS. You can do it by running command on a cluster, use REST APIs or use Databrick CLI to do that. You must attach the script to the cluster. This runs for a specific cluster. You can use global scripts that transform every cluster, but be very careful if you do that. And finally, start the cluster, which will execute the script before it's ready. Easy, right? To see InitScripts in action, we are going to use an instance of Azure Application Insights. App Insights allow to store the logged data. The instructions to create it are available in the setup document. In this demo, we'll first create a script that will add Azure Application Insights library. Since scikit‑learn is available for Databricks runtime, we'll upgrade it to a newer version, and then we'll add one more library with a specific version. Once a script is executed on the cluster, you will see how to use App Insights from Databricks, so let's see it in action. Back to the Databricks workspace, I have created a new notebook, InitScripts, to work with scripts. Let's first check the version of scikit‑learn library. You can see it is 0.20.3. Now since we need to put the script in DBFS, let's first use the command dbutils.fs.put and provide the filepath in DBFS where the script will be stored. We are writing this command to say that it's going to be a bash script. Next, add some log output using echo, and then use the pip, which is a utility to work with Python packages. Here, let's use pip install for applicationinsights. This is a library present in the PyPI repository. To add a library with a specific version, provide specific version after library name. And finally, use pip Install upgrade and provide the version number to upgrade to a new scikit‑learn version. Set fs.put parameter to True to override any existing file. Let's execute this, and this has saved the script in DBFS. Alright, now let's attach the script to the cluster. Go to Clusters tab and select the cluster where you want to run the script. Let me select DemoPoolcluster, click on Edit, at the end, go to the Advanced Options and select the Init Script tab. Here, provide the path for initscript. Make sure that the path starts with dbfs. Click on Add. Once you confirm it, the cluster will now restart. Let's go to Event Log tab, and you can notice the events for start and finish of Init Scripts. If there are any problems with the script, cluster will fail to start. Now that the cluster is ready, let's go back to the notebook. First of all, let's check the library version of scikit‑learn again. Once you execute this, you can see that version has now been upgraded to 0.22.2. This is what we specified in the script. Great. Let's now use the App Insights library, but before that, head over to the instance of App Insights in the Azure portal and copy the Instrumentation Key shown here. Alright, back to the workspace. Let's first import TelemetryClient from App Insights library. Add the instrumentation key, and send the events. That's it. Switch back to the App Insights in Azure portal to see if this worked or not. It may take some time for Logs to show up. Click on Logs, type the query as search * to see all the logged events, and you can notice the logs are showing up. Awesome, right? So using initialization scripts can help you install multiple packages, upgrade versions, and change environment variables.

Understand Databricks Container Services

The second way of customizing the cluster is by using Databricks Container Services. Databricks Container Services allows you to customize the Databricks Runtime environment by using Docker container, and this means you can create a fully customized Databricks Runtime. This goes much deeper than initialization scripts. But the question is, why would you do that? You could use it in case you want to fully control all the system, as well as user libraries that are installed on the cluster. Just like initialization scripts, you can deploy multiple packages and libraries. And, as you know, Databricks Runtime comes with preinstalled libraries. But here you can start from scratch and install only the specific versions of libraries that you want. And the great part is, it's a fully locked down environment. That will never change. Once you create it, you can always work with that, irrespective of newer versions that are coming out. Make sense? Now there are many things that you need to know before you start creating a custom runtime. Let's do a quick walkthrough of that. First, there is container. Container is a unit of software that packages up code and all its dependencies. Let's assume you have an application. It has certain dependent libraries, like log4j, and it runs on .NET Core framework. Then, your application container would include your application code, dependent libraries like log4j, .NET Core framework, as well as system tools, libraries, and settings, so it includes everything required to run your application. This means once created, it can be run reliably in any environment. And Docker is a widely used container platform. That's why you hear the term Docker containers. Let's try to understand this by comparing virtual machines and container models. In the case of the VM model, there is physical infrastructure, then there is Hypervisor, which virtualizes physical resources, and allow to create VMs on top it. Each VM has its own operating system. Then you deploy application runtime and the application runs on top of it. And you can create multiple virtual machines on the same physical infrastructure. On the other hand, in container model, there is again physical infrastructure. On top of that, you deploy an operating system. Then, deploy Docker Runtime, which can help you run Docker containers. And finally, you can deploy the application containers that include all the dependencies. Sounds good? Now there are a few more things that you must know. First is the Docker container image. When you create the isolated package with all the dependencies, that's called a Docker container image. As you saw previously, it includes your application code, runtime, system tools, system libraries, and settings. The next one is container registry. It's a place where you store and manage your Docker container images, and from here it can be shared with others. Think of this like a storage account. There are many registries available, like Docker Hub and Azure Container Registry. Now, let's see, how can we deploy a custom Docker image in the Databricks cluster? At first, there is Docker Hub Registry, where many pre‑built Docker images are available. Now when you are doing development on your local machine, you can download any public images. For example, a pre‑built Databricks runtime image. To get an image from a registry, you need to use docker pull command. Next, you're going to write your code or a set of instructions you want to perform. For example, what libraries you want to install in your Databricks Cluster. Once ready, you can use docker build command to create your own custom container image. This is the Databricks runtime that will be deployed on the cluster. Make sense? Now, how can we deploy it? First, let's store this container image in a registry. Let's create an instance of Azure Container Registry. To store your image there, you need to use docker push command. Now, while creating the Databricks cluster, you can provide details of the image, like the location of image and credentials to access it. When you start the cluster, Databricks will use docker pull command to download the image and install it on the cluster. Amazing, right? Now there are few Databricks Runtime images available on Docker Hub. FIrst one is databricksruntime/minimal. This includes all the mandatory components required by Databricks, like JDK, bash, iproute2, coreutils, procps, sudo, and Ubuntu or Alpine Linux. So, either you can use this or create a fresh one by adding all these components. Adding this will only allow you to work with Scala Notebooks and JAR jobs, but nothing else. So you can't even run Python Notebooks here. Then there is one more image, databricksruntime/standard. This is built on top of databricksruntime/minimal. Along with Scala Notebooks and JAR jobs, it supports Python Notebooks and Jobs, Spark‑submit Jobs, Shell scripts, DBFS, and SSH. So you can see it includes a lot more components, but it does not include all other packages, ML libraries, delta lake, et cetera, that comes with Databricks Runtime. So now you can build your own custom runtime by using either minimal or standard images. And then you can install your own libraries, Azure connectors, specific ML libraries, et cetera. Interesting, right? Now, some sample Docker files, which include minimal and standard files as well, are available on GitHub, and you can use this link to study them.

Build and Deploy Custom Docker Image on Cluster

Now it's time to build and deploy custom Docker image on the cluster. To do that, you must first have Docker Desktop installed on your machine. Since I'm using of a Windows machine, I have installed Docker Desktop for Windows. This is the SDK for Docker, and you must be running Linux containers. Now in this demo, we are going to create Azure Container Registry to store our custom Docker image, and copy the details like login server, username, and password. Let's see what steps we'll perform. We'll start by creating a Docker file using Databricks runtime standard image. Then we'll add a few libraries and build the image. This will be the custom Databricks runtime. We'll then upload this Docker image into Azure Container Registry. Next, you will see how to enable Container Services in Databricks, and finally we'll use our custom Databricks runtime to create a new cluster. So let's see it in action. Let's start by creating Azure Container Registry instance. In the Azure portal, search for Container registry, and add a new one. Fill up the properties, the source group as PluralsightDemoRG, a unique name for the registry; I'm keeping it as Pluralsight Container Registry. Specify location, and select the SKU as Basic, and then create the registry. Once it has deployed, open the container registry. Navigate to Access keys. Here, make sure Admin user is set to Enabled, and then copy and store the values, Login server, Username, and Password, and that's it. Before you create the Docker file, ensure that Docker Desktop is running. It should show an icon like this in the system tray. If we right‑click on the icon, it should show Switch to Windows containers. This means you're currently running Linux containers. Great! Now open the command prompt. Because we are going to use Databricks runtime standard image, let's run a command to download it from Docker Hub, docker pull databricksruntime/standard, and you can notice it is downloading a big file on the machine. Once downloaded, you can also verify that by using docker images command, and here you can see the standard image. Alright, now let's create the Docker file. Let's create a folder, CustomRuntime, and navigate to that. Let's create a new file, DockerFile, and open it in VS Code. Of course, you can use any ID to create this file. You can use the command code .\DockerFile. This opens up VS Code. Now in the Docker file, start by using Databricks runtime standard image. Next, add the libraries that you want to install on the cluster, and take a pause here and see what we are doing. Databricks already has a conda set up. Conda is an open source package management system that can help you install, run, and update packages as well as their dependencies, from any language. There is already an environment, dcs‑minimal, that you can use. Here, let's use pip install commands to install certain libraries. I'm going to install numpy, urllib, and Azure applicationinsights library. Finally, let's clean up the environment. Remember, standard image only has limited libraries, and from here we are writing very few libraries. Of course, you can do a lot more, but I believe you got the idea. How can you add your own custom libraries? Once done, save the file, let's switch back to the command prompt. To build the Docker image, use a command, docker build, and let's name our image as mydatabricksruntime:v1. Once you hit Enter, you can see that it is running the same instructions as we specified in the Docker file. Sounds good. Let's see if the image has been created successfully. Let's use the same command, docker images, and you can see, our image is ready. Awesome! Let me clean up the prompt. Now to store the image, let's first log into Azure Container Registry. Use the command, docker login, provide the username and login server name that you copied earlier, and then add the password. Next, there are two commands we are going to use. First, use docker tag local image name, followed by remote image name. Notice the former carefully. This is your full Docker image URL, and finally use docker push command and provide the remote image name, and you can see, it has started to upload our image to Azure Container Registry. Simple, right? Now back to the Databricks workspace. Let's make sure the Databricks Container Services are enabled. Go to the workspace, Admin console, navigate to Advanced tab, and verify that Container Services are set to enabled. And finally, let's create a new cluster. From the Clusters tab, create a new cluster. Let's name it as MyCustomCluster and fill out the rest of the properties as you have done before, and then select the option, Use your own Docker container. Here, add the Docker image URL, select the authentication option as username and password, and provide the Azure Container Registry credentials as you copied earlier. And create the cluster. Wow! You have just created a cluster with your own Databricks runtime image. So now you have a locked environment with only the libraries you want. Just like you verified the libraries in the initialization script section, you can do it here in the same way, and by this time, I believe you already know how to do that. Am I right?

Summary

In this module, we solved various ways of customizing the cluster. We started by using the initialization scripts. These are Shell scripts that can be executed before the cluster starts. You saw how to create a script and deploy it to DBFS. This is required for init scripts. We then attached it to a cluster where it was executed as part of cluster start process. And then we verified the changes we deployed via scripts and also saw how to use Azure App Insights library. Next, we looked into customizing the cluster using Databricks Container Services. First, we did a quick walkthrough of some concepts, like container, Docker, images, and container registry. Then we talked about Databricks Runtime images and used one of them, databricksruntime/standard to create our own custom runtime. We then pushed the image to Azure Container Registry and put it in Databricks to create a cluster. This brings us to the end of this course, where we started by learning about basics of Sparks Structured Streaming and its processing model. Then we talked about Azure Databricks, its background, features, competence, and architecture. We then saw how to set up the environment. And then we worked on the extract, transform, and load steps of our streaming pipeline. We then made it production‑ready and saw how to secure it using jobs, You also saw how pricing works in Azure, then compared Sparks Structured Streaming with other streaming services, and saw some best practices. And finally, here you learned about how to customize your cluster to suit your own requirements. I hope you had a good learning experience. Thanks for watching, and continue learning!