Connect Azure Databricks to Microsoft Purview

by Jan 16, 2023

Connect and Manage Azure Databricks in Microsoft Purview

This week the Purview team released a new feature, you’re now able to Connect and manage Azure Databricks in Microsoft Purview.

This new functionality is almost the same as the Hive Metastore connector which you could use earlier to scan an Azure Databricks Workspace. This new connector is an easier way to setup scanning for your Azure Databricks Workspace.

Note that this feature is currently in Public Preview.

The connector supports or will support:

  • Extracting technical metadata including:
    • Azure Databricks workspace.
    • Hive server.
    • Databases.
    • Tables including the columns, foreign keys, unique constraints, and storage description.
    • Views including the columns and storage description.
  • Fetching relationship between external tables and Azure Data Lake Storage Gen2/Azure Blob assets.
  • Fetching static lineage on assets relationships among tables and views.

Let’s have a look how to setup this connector, before you can start make sure you have the following Prerequisites in place:

  • Microsoft Purview account with Data Source Administrator and Data Reader permissions.
  • Self-Hosted Integration Runtime.
  • Personal access token in Azure Data Bricks.
  • Cluster in Azure Data Bricks.

Register the Azure Databricks Workspace

  • Select Data Map on the left pane and select Sources.
  • Select Register.
  • In Register sources, select Azure Databricks and click on  Continue.
  • On the Register sources (Azure Databricks) screen, do the following:
    • Enter a name that Microsoft Purview will list as the data source.
    • Select the subscription and workspace that you want to scan from the dropdown list.
  • Select a collection. 
    • Azure Databricks setup in Microsoft Purview

 

 Setup the Integration Runtime

  • Select Data Map on the left pane and select Integration Runtime.
  • Click on New.
  • Select the Self-Hosted.

Self-Hosted IR Setup in Microsoft Purview

  • Enter a name and description, click on create.

SHIR configuration in Microsoft Purview

  • Copy the authentication key.

SHIR Authentication Key

Configure the Self-Hosted Integration Runtime

On an Virtual Machine in Azure:

After rebooting, Select Data Map on the left pane and select Integration Runtime and check if the SHIR is running.

Databricks-shir-running

Setup the Scan

The last step to configure is the scan.

  • Select Data Map on the left pane and select Sources and select the Azure Databricks you just created.
  • Select New Scan.
    • Name, create a logical name for your scan. Weekly, Monthly, Once or a different name. TIP, add your clustername or id to the scanname. You need to create a scan for every cluster in an Azure Databricks workspace. This way you can see the difference between the clusters.
    • Connect via IR, select the SHIR you just created.
    • Credential, select the Personal Acces token, which is stored in de Azure KeyVault.
    • Cluster ID, Specify the cluster ID that Microsoft Purview need to connect to, to perform the scan.
    • Mount Point, if you have external storage manually mounted to Databricks, you provide the locations here. Use the following format /mnt/<path>=abfss://<container>@<adls_gen2_storage_account>.dfs.core.windows.net/.
    • Maximum memory available: Specify the maximum memory available in GB to be used by scanning processes. If the field is left blank, 1 GB will be considered as a default value.

Setup Databricks scan

The default location of the cache in your VM is C:\Windows\ServiceProfiles\DIAHostService\AppData\Local\Microsoft\AzureDataCatalog\Cache. Unselect the checkbox if you want cache to be stored in a different location.

Click on continue.

Select the trigger you want. Click on save and run.

Check if the scan starts, be aware that the scan will trigger your Azure Databricks cluster to start.

Browse and search assets

Once the data is scanned you can browse and search the Metadata.

  • Select Data Catalog on the left pane and select Browse Assets.

Data Catalog with Databricks overview

From the Databricks workspace asset, you can find the associated Hive Metastore.

Select the Azure Databricks and click on edit details on the right side.

Databricks details

Click on Hive Metastore, on the Related tab you can see the Hive DB and the assets. Click on one of the assets to see the lineage when applicable.

databricks lineage

Conclusion

The first steps towards a Native integration of Azure Databricks is now available in Microsoft Purview, but we’re not there yet.
If you want to have a more extensive lineage and can read more details from the Notebooks execution including Delta Lake than, I advise you to use the
Azure Databricks to Purview Lineage Connector.

In the notes of this Solution Accelerators, is noted “With native models in Microsoft Purview for Azure Databricks, customers will get enriched experiences in lineage such as detailed transformations.” So hopefully we can expect more in the future.

Be aware that lineage is available at the asset level not at column level, hopefully that will arrive soon.

In the notes of the above Solution Accelerators, is noted “With native models in Microsoft Purview for Azure Databricks, customers will get enriched experiences in lineage such as detailed transformations.” So hopefully we can expect more in the future.

Like always in case you have questions, do not hesitate to contact me.

More details on above topic can be found here:

Connect to and manage Azure Databricks

Microsoft Purview Data Map supported data sources and file types

Microsoft Purview data governance documentation

Feel free to leave a comment

2 Comments

  1. Diego Poggioli

    Hi, very useful guide. Do you know if Managed Table are supported and visible as scanned asset in Purview?
    Thanks
    Diego

    Reply
    • Erwin

      Hi Diego,

      Looks like it is currently not supported. The following items are currently supported, possibly when the unity catalog is supported that more is possible

      Extracting technical metadata including:

      Azure Databricks workspace
      Hive server
      Databases
      Tables including the columns, foreign keys, unique constraints, and storage description
      Views including the columns and storage description
      Fetching relationship between external tables and Azure Data Lake Storage Gen2/Azure Blob assets (external locations).

      Fetching static lineage between tables and views based on the view definition.

      Reply

Submit a Comment

Your email address will not be published. Required fields are marked *

two × 5 =

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Azure Purview announcements and new functionalities

This week the Azure Purview Product team added some new functionalities, new connectors(these connectors where added during my holiday), Azure Synapse Data Lineage, a better Power BI integration and the introduction of Elastics Data Map. Slowly we are on our way to a...

Introduction to Real-Time Analytics in Microsoft Fabric

Introduction to Real-Time Analytics in Microsoft Fabric Real-Time Analytics is one of the data and analytical workloads/experiences available in Microsoft Fabric, the new platform currently in Public Preview at Microsoft. With Real-Time Analytics, companies and...

Connect Azure Synapse Analytics with Azure Purview

How do you integrate Azure Purview in Azure Synapse Analytics? This article explains how to integrate Azure Purview into your Azure Synapse workspace for data discovery and exploration. Follow the steps below to connect your Azure Purview account in your Azure Synapse...

Scale your SQL Pool dynamically in Azure Synapse

Scale your Dedicated SQL Pool in Azure Synapse Analytics In my previous article, I explained how you can Pause and Resume your Dedicated SQL Pool with a Pipeline in Azure Synapse Analytics. In this article I will explain how to scale up and down a SQL Pool via a...

Azure Data Factory: Generate Pipeline from the new Template Gallery

Last week I mentioned that we could save a Pipeline to GIT. But today I found out that you can also create a Pipeline from a predefined Solution Template.Template Gallery These template will make it easier to start with Azure Data Factory and it will reduce...

Provision users and groups from AAD to Azure Databricks (part 1)

Blog Serie: Provisioning identities from Azure Active Directory to Azure Databricks. Instead of adding users and groups manual to your Azure Databricks environment, you can also sync them automatically from your Azure Active Directory to your Azure Databricks account...

Azure Data Factory updates June

Azure Data Factory updates There have been quite a few updates in Azure Data Factory and Azure Synapse Analytics in the last few days.Below is a summary of these updates:   Time-To-Live (TTL) on Integration Runtime with managed virtual network enabled The new TTL...

Azure Synapse Analyics costs analyis for Integration Runtime

AutoResolveIntegrationRuntime! The last few days I've been following some discussions on Twitter on using a separate Integration Runtime in Azure Synapse Analytics running in the selected region instead of auto-resolve. The AutoResolveIntegrationRuntime is...

How to setup Code Repository in Azure Data Factory

Why activate a Git Configuration? The main reasons are: Source Control: Ensures that all your changes are saved and traceable, but also that you can easily go back to a previous version in case of a bug. Continuous Integration and Continuous Delivery (CI/CD): Allows...

Updated Microsoft Purview Pricing and Applications

Microsoft Purview Pricing and introduction of Purview Applications The Microsoft Purview pricing page has been updated. Below I have listed most of the changes. The most important changes are the introduction of the Microsoft Purview Applications and the pricing of...