Connect Azure Databricks to Microsoft Purview

by Jan 16, 2023

Connect and Manage Azure Databricks in Microsoft Purview

This week the Purview team released a new feature, you’re now able to Connect and manage Azure Databricks in Microsoft Purview.

This new functionality is almost the same as the Hive Metastore connector which you could use earlier to scan an Azure Databricks Workspace. This new connector is an easier way to setup scanning for your Azure Databricks Workspace.

Note that this feature is currently in Public Preview.

The connector supports or will support:

  • Extracting technical metadata including:
    • Azure Databricks workspace.
    • Hive server.
    • Databases.
    • Tables including the columns, foreign keys, unique constraints, and storage description.
    • Views including the columns and storage description.
  • Fetching relationship between external tables and Azure Data Lake Storage Gen2/Azure Blob assets.
  • Fetching static lineage on assets relationships among tables and views.

Let’s have a look how to setup this connector, before you can start make sure you have the following Prerequisites in place:

  • Microsoft Purview account with Data Source Administrator and Data Reader permissions.
  • Self-Hosted Integration Runtime.
  • Personal access token in Azure Data Bricks.
  • Cluster in Azure Data Bricks.

Register the Azure Databricks Workspace

  • Select Data Map on the left pane and select Sources.
  • Select Register.
  • In Register sources, select Azure Databricks and click on  Continue.
  • On the Register sources (Azure Databricks) screen, do the following:
    • Enter a name that Microsoft Purview will list as the data source.
    • Select the subscription and workspace that you want to scan from the dropdown list.
  • Select a collection. 
    • Azure Databricks setup in Microsoft Purview

 

 Setup the Integration Runtime

  • Select Data Map on the left pane and select Integration Runtime.
  • Click on New.
  • Select the Self-Hosted.

Self-Hosted IR Setup in Microsoft Purview

  • Enter a name and description, click on create.

SHIR configuration in Microsoft Purview

  • Copy the authentication key.

SHIR Authentication Key

Configure the Self-Hosted Integration Runtime

On an Virtual Machine in Azure:

After rebooting, Select Data Map on the left pane and select Integration Runtime and check if the SHIR is running.

Databricks-shir-running

Setup the Scan

The last step to configure is the scan.

  • Select Data Map on the left pane and select Sources and select the Azure Databricks you just created.
  • Select New Scan.
    • Name, create a logical name for your scan. Weekly, Monthly, Once or a different name. TIP, add your clustername or id to the scanname. You need to create a scan for every cluster in an Azure Databricks workspace. This way you can see the difference between the clusters.
    • Connect via IR, select the SHIR you just created.
    • Credential, select the Personal Acces token, which is stored in de Azure KeyVault.
    • Cluster ID, Specify the cluster ID that Microsoft Purview need to connect to, to perform the scan.
    • Mount Point, if you have external storage manually mounted to Databricks, you provide the locations here. Use the following format /mnt/<path>=abfss://<container>@<adls_gen2_storage_account>.dfs.core.windows.net/.
    • Maximum memory available: Specify the maximum memory available in GB to be used by scanning processes. If the field is left blank, 1 GB will be considered as a default value.

Setup Databricks scan

The default location of the cache in your VM is C:\Windows\ServiceProfiles\DIAHostService\AppData\Local\Microsoft\AzureDataCatalog\Cache. Unselect the checkbox if you want cache to be stored in a different location.

Click on continue.

Select the trigger you want. Click on save and run.

Check if the scan starts, be aware that the scan will trigger your Azure Databricks cluster to start.

Browse and search assets

Once the data is scanned you can browse and search the Metadata.

  • Select Data Catalog on the left pane and select Browse Assets.

Data Catalog with Databricks overview

From the Databricks workspace asset, you can find the associated Hive Metastore.

Select the Azure Databricks and click on edit details on the right side.

Databricks details

Click on Hive Metastore, on the Related tab you can see the Hive DB and the assets. Click on one of the assets to see the lineage when applicable.

databricks lineage

Conclusion

The first steps towards a Native integration of Azure Databricks is now available in Microsoft Purview, but we’re not there yet.
If you want to have a more extensive lineage and can read more details from the Notebooks execution including Delta Lake than, I advise you to use the
Azure Databricks to Purview Lineage Connector.

In the notes of this Solution Accelerators, is noted “With native models in Microsoft Purview for Azure Databricks, customers will get enriched experiences in lineage such as detailed transformations.” So hopefully we can expect more in the future.

Be aware that lineage is available at the asset level not at column level, hopefully that will arrive soon.

In the notes of the above Solution Accelerators, is noted “With native models in Microsoft Purview for Azure Databricks, customers will get enriched experiences in lineage such as detailed transformations.” So hopefully we can expect more in the future.

Like always in case you have questions, do not hesitate to contact me.

More details on above topic can be found here:

Connect to and manage Azure Databricks

Microsoft Purview Data Map supported data sources and file types

Microsoft Purview data governance documentation

Feel free to leave a comment

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

12 − nine =

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Azure Synapse Analytics Power BI Integration

Creating a Linked Service for Power BI Open your Synapse Studio and select the Management Hub. Add a new Linked Service If you haven't connect to Power BI before, you will see the screen above. If you want to add another Power BI Linked Service(Workspace). Search for...

Exploring Azure Synapse Analytics Studio

Azure Synapse Workspace Settings In my previous article, I walked you through "how to create your Azure Synapse Analytics Workspace". It's now time to explore the brand new Synapse Studio. Most configuration and settings can be done through the Synapse Studio. In your...

SSMS 18.xx: Creating your Azure Data Factory SSIS IR directly in SSMS

Creating your Azure Data Factory(ADF) SSIS IR in SSMS Since  version 18.0 we could see our Integration Catalog on Azure Instances directly. Yesterday I wrote an article how to Schedule your SSIS Packages in ADF, during writing that article I found out that you can...

SSMS 18.1: Schedule your SSIS Packages in Azure Data Factory

Schedule your SSIS Packages with SSMS in Azure Data Factory(ADF) This week SQL Server Management Studio version 18.1 was released, which can be downloaded from here. In version 18.1 the Database diagrams are back and from now on we can also schedule SSIS Packages in...

Azure Data Factory Let’s get started

Creating an Azure Data Factory Instance, let's get started Many blogs nowadays are about which functionalities we can use within Azure Data Factory. But how do we create an Azure Data Factory instance in Azure for the first time and what should you take into account? ...

Create Virtual Machines with Azure DevTest Lab

A while ago I had to give a training. Normally I would roll out a number of virtual machines in Azure. Until someone brought my attention to an Azure Service, Azure DevTest Labs. With this Azure service you can easily create a basic image and use this image to roll...

Scale your SQL Pool dynamically in Azure Synapse

Scale your Dedicated SQL Pool in Azure Synapse Analytics In my previous article, I explained how you can Pause and Resume your Dedicated SQL Pool with a Pipeline in Azure Synapse Analytics. In this article I will explain how to scale up and down a SQL Pool via a...

Change your Action Group in Azure Monitoring

Change a Action GroupPrevious Article In my previous artcile I wrote about how to create Service Helath Alerts. In this article you will learn how to change the Action Group to add, change or Remove members(Action Group Type Email/SMS/Push/Voice) Azure Portal In the...

Azure Purview Pricing example

Azure Purview pricing? Note: Billing for Azure Purview will commence November 1, 2021. Updated October 31st, 2021 Pricing for Elastic Data Map and Scanning for Other Sources are changed and updated in the blog below. Since my last post on Azure Purview announcements...

How to setup Code Repository in Azure Data Factory

Why activate a Git Configuration? The main reasons are: Source Control: Ensures that all your changes are saved and traceable, but also that you can easily go back to a previous version in case of a bug. Continuous Integration and Continuous Delivery (CI/CD): Allows...