Create an Azure Synapse Analytics Apache Spark Pool

by Erwin | Jun 17, 2020 | Azure, Azure Synapse Analytics

Erwin

Adding a new Apache Spark Pool

There are 2 options to create an Apache Spark Pool.
Go to your Azure Synapse Analytics Workspace in de Azure Portal and add a new Apache Spark Pool.

Or go to the Management Tab in your Azure Synapse Analytics Workspace and add a new Apache Spark Pool.

Create an Apache Spark Pool

Apache Spark pool name

Note that there are specific limitations for the names that Apache Spark Pools can use. Names must contain letters or numbers only, must be 15 or less characters, must start with a letter, not contain reserved words, and be unique in the workspace.

Node size

Small(4vCPU)

Medium(8vCPU)

Large(16vCPU)

Autoscale

Enabled: Based on your workloads the Spark Pool will scale up or down.

Disabled: You have to define a fix number of nodes.

Number of nodes.

You can select 3 up to 200 nodes

Make sure that

Contact an Owner of the storage account, and verify that the following role assignments have been made:

Assign the workspace MSI to the Storage Blob Data Contributor role on the storage account
Assign you and other users to the Storage Blob Data Contributor role on the storage account

Once those assignments are made, the following Spark features can be used: (1) Spark Library Management, (2) Read and Write data to SQL pool databases via the Spark SQL connector, and (3) Create Spark databases and tables.

If you haven’t assign the Storage Blob Data Contributor role to your user, you will get the following error when you want to browse the date in your Linked Workspace.

Currently you can only select Apache Spark version 2.4.

Make sure you enable the Auto Pause settings. If will save you a lot of money. Your cluster will turn off after the configured Idle minutes.

Python packages can be added at the Spark pool level and .jar based packages can be added at the Spark job definition level.

If the package you are installing is large or takes a long time to install, this affects the Spark instance start up time.
Packages which require compiler support at install time, such as GCC, are not supported.
Packages can not be downgraded, only added or upgraded.

How to install these packages can be found here.

Review your settings and your Apache Spark Pool will be created.

You have now created your Apache Spark Pool.

Thanks for reading, in my next article I will explain how to create a SQL Pool, the formerly Azure SQL DW Instance

Exploring Azure Synapse Analytics Studio

by Erwin | Jun 16, 2020 | Azure Synapse Analytics

Erwin

Azure Synapse Workspace Settings

In my previous article, I walked you through “how to create your Azure Synapse Analytics Workspace”. It’s now time to explore the brand new Synapse Studio.

Most configuration and settings can be done through the Synapse Studio. In your Workspace you need to set the SQL Active Directory Admin, like you have to do for a Logical Server.

SQL Active Directory Admin

Firewall

Change the IP-address to you own IP-address or to one of your employer if you work from the office. Make sure that you enable the option “Allow Azure service and Resources to access this Workspace” is enabled. Every trusted Azure Service or Resource can connect to this Workspace. Not all Public Preview Azure Services or Resources are Trusted yet.

Private Endpoint Connections

Define your Private Endpoint Connections to your Services to use a Private IP-address form your Virtual Network. More details on how to setup a Private Endpoint Connection can be found here.

Launch Azure Synapse Analytics Studio

After you have opened the portal, the following screen will appear. Personally I am very charmed of this brand new Portal, you now get 1 place where you can access all your data. But also an integration with your Power BI reporting. But more about that later. Let’s walk through each tab below.

Data

The data Tab is Divided into 2 different parts Linked and Workspace

Linked

All Dataset likes you’re used to create in Azure Data Factory are stored here.

Your can now directly browse files within your Azure Data Lake Storage.

But you can also connect to External Data.

Workspace

Here you will find all your databases which you have created with Sparks, SQL on Demand or SQL Pools. How to create these database I will explain later in another article.

Develop

The Develop tab, is the location, where your SQL Scripts, Notebooks, Dataflows, Spark Job Definitions and Power BI Reports are stored. In a later stage you can commit your work to Azure Dev Ops or GitHub.

Orchestrate

What do we see here , nothing more then you were used to see in Azure Data Factory except the addition of the Synapse Activities.

Monitor

In the monitor tab you will find similar things to ADF except for the SQL requests and the Apache Spark Applications

Orchestration:	Pipeline runs	Overview of all Executed Pipelines
	Trigger Runs	Overview of all Executed Trigger Runs
	Integration runtimes	Overview of all created Integration Runtimes
Activities:	Apache Spark applications	Monitor your Apache Spark Executions
	SQL requests	Monitor your SQL on Demand or SQL Pool queries

SQL request

All running(currently) SQl request for your SQL on Demand and your SQL Pools

Apache Spark Applications

All Apache Spark request.

A detailed explanation of the Apache Spark Application monitor can be found here.

Management HUB

The Synapse Analytics Management HUB offers the following options:

Analytic Pools:	SQL pools	Here you can manage(Scale up or Down) your previously created SQL pools or create new ones.
	Apache Spark pools	Create multiple instances of Spark pools depending on the workload requirements. Once you have your Instance created you also change Auto Scaling, Node Size and Number of Nodes from here.
External Connections:	Linked Services	Create and manage connections to different services, same as in Azure Data Factory
Orchestration:	Triggers	Create and manage Triggers for your Pipelines, same as in Azure Data Factory
	Integration runtimes	Create and manage your different types of Integration Runtime: Azure: execute workloads between Azure services or Azure Data Factory Mapping Data Flows. Self-Hosted: execute workloads between on-premises environments and Azure. Azure-SSIS: execute SQL Server Integration Services packages in Azure Data Factory.
Security:	Access control	Azure Synapse Analytics comes with role-based access control. The available roles: Workspace admin Apache Spark admin SQL admin
	Managed Private Endpoint	Private link enables you to access Azure services from your Azure VNet securely. More details on Managed Private Endpoints can be found here

At the beginning of this article, I indicated that I was very charmed of this new Portal. Microsoft has ensured that we can now approach almost all our services seamlessly from 1 Portal and there will be much more to come in the future. And as you can see you don’t have to use the SQL Pools which is a reasonably expensive solution for customers who use a lot of data. But you can use Azure Synapse Analytics for almost every customer. I would say if you haven’t started trying Azure Synapse Analytics yet, start today and see how you can help your customers with this. If there are any questions I would like to hear them.

Thank you for reading.

How to create a Azure Synapse Analytics Workspace

by Erwin | Jun 15, 2020 | Azure, Azure Synapse Analytics

Creating your Azure Synapse Analytics Workspace

In the article below I would like to take you through, how you can configure an Azure Synapse Workspace and not the already existing Azure Synapse Analytics SQL Pool(formerly Azure SQL DW):

In de Azure Portal search for Azure Synapse Analytics. Make sure you select Workspaces Preview.

Click on Create to start the configuration of your Workspace.

Make sure you select the correct:

Subscription

Resource Group

The seconds part of this configuration is to setup your Workspace:

Workspace Name	I’m using <customername><environment><wsas><department> wsas=WorkSpace Azure Synapse prevwsdvlmwsasoxgn01
Region	The desired Region <West Europe>
Data Lake Storage Gen 2	Select an existing Data Lake Storage Gen 2
Data Lake Storage Gen 2 File System	I’m creating a new container here temp. This directory is used to store temporary files and workspace settings and I don’t want to mix this temporary data in 1 of my other containers.

The above option will assign the managed identity of the workspace the Storaqe Blob Data Contributor role with full access on the selected Data Lake Storage Gen2 file system. Leave the option to on unless you want to grant access manualy.

You can set the sqladmin user. But it can set be in a later stage as well.

Leave this option empty, I will explain in a later article how this will work.

Normally you do not want to allow all IP addresses, for our initially setup we leave is as is. We can also adjust these settings in a later stage.

Review all your settings and click on create. And you deployment is underway.

After the deployment is finalized, the workspace will be available in your selected Resource Group.

You have now created your Azure Synapse Analytics Workspace and you can start using the new functionalities which are currently in Public Preview

Public Preview Features

Azure Synapse studio
Unified Security Model
Private endpoints
Power BI integration
Azure Machine Learning integration
Data lake exploration
Apache Spark integration
Data Movement
Pipeline Orchestration
On-demand query
Notebooks
SQL Script editor

In my next article I will walk you through the new Azure Synapse Studio. Stay tuned!

Azure Synapse Analytics

by Erwin | Jun 15, 2020 | Azure, Azure Synapse Analytics

Erwin

by Erwin | Jun 15, 2020

Azure Synapse Analytics

Insights for all

Azure Synapse provides a breathtaking view of your data across data warehouses and big data analytics systems. Bringing these two worlds together into a single service is challenging as it requires unifying similar concepts that operate differently in each world such as security, privacy, and performance. With Azure Synapse, this seamless unification of data warehousing and big data not only simplifies a business’s analytics platform, but also breaks down silos that exist today because of teams, data, and skills. (source Azure blog)

Azure Synapse Analytics Workspace

During Ignite 2019 we already saw the first announcement about Azure Synapse Analytics. The first Public Preview was announced during Build 2020.

Immediately after Build 2020, I started playing and exploring with Azure Synapse Analytics Workspace.
Fortunately, I was off for a few days and was able to use this free time to dive a little bit into Azure Synapse.

A few days later during the Analytics in a Day workshops that I gave for my employer InSpark in collaboration with Microsoft, I immediately took the time to give a Live demo. I found the inspiration for this Live demo during a YouTube session presented by Simon Whiteley.

For many participants it is more imaginative, to walk through the product Live than to tell a story via PowerPoint Slides.

Upcoming Articles

In the coming days I will try to write a number of articles so that you become more familiar with the various possibilities of Azure Synapse Analytics.

I have the following articles in mind:

✅ Creating your Azure Synapse Analytics Workspace

✅ Exploring the new Azure Synapse Analytics Studio

✅ Creating an Apache Spark Pool

✅ Creating a SQL Pool

✅ Integration with Power BI

And if you have more subject which needs to be explained feel free to leave them in the comments.

Happy reading!

Latest FMD Releases

Feel free to leave a comment

Azure Data Factory: New functionalities and features

by Erwin | May 22, 2020 | Azure Data Factory

Erwin

by Erwin | May 22, 2020

New functionalities and features

Last week, a number of great new functionalities and features were added within Azure Data Factory. I would like to take you in some details in the blog below:

Customer key

With this new functionality you can add extra security to your Azure Data Factory environment. Where the data was first encrypted with a randomly generated key from Microsoft, you can now use the customer-managed key feature. With this Bring Your Own Key (BYOK) you can add extra security to your Azure Data Factory environment. If you use the customer-managed key functionality, the data will be encrypted in combination with the ADF system key. You can create your own key or have it generated by the Azure Key Vault API.

You can read more in this Article which I wrote.

Pipeline Consumption Report

Last week the Azure Data Factory added the Pipeline Consumption Report.

The report can be used for your Triggered runs, just go to your Triggered runs and click on the new Icon.

The consumption of the selected Pipeline will be displayed. The data shown is only from this Pipeline and not from other Pipelines fired by this Pipeline. Would be a nice addition if the report shows the aggregation of the complete Triggered Run.

For your debug run, click on right site of your Output pane:

The ADF consumption report is only surfacing Azure Data Factory related units. There may be additional units billed from other services that you are using and accessing which are not accounted for here including Azure SQL Database, Synapse Analytics, CosmosDB, ADLS, etc. More detailed can be found here.

Parameters from Execute Pipeline Activity

When calling a Pipeline you first had to add the parameters yourself, now they are automatically taken over from the Pipeline you select. Very handy and saves time again if you use a lot of parameters.

Define a Parameter in one of your Pipelines:

Create another Pipeline and add the Execute Pipeline activity. On the settings tab where you have to select the Pipeline you want to execute, you will discover that the option to add Manually the parameters is not there anymore. But, all the Parameters you had defined in your Pipeline are directly shown. Very handy and it reduces errors.

Old Situation:

New Situation:

General Tab moved to new Properties Pane

Your General tab is now moved to the right site of the Canvas.

To edit it your properties, click on the pane icon located in the top-right corner of the canvas.

So these were some nice and useful addition to Azure Data Factory. I am very happy with it and what do you think?

Latest Posts

Feel free to leave a comment

Create an Azure Synapse Analytics Apache Spark Pool

Erwin

Adding a new Apache Spark Pool

Create an Apache Spark Pool

Exploring Azure Synapse Analytics Studio

Erwin

Azure Synapse Workspace Settings

SQL Active Directory Admin

Firewall

Private Endpoint Connections

Launch Azure Synapse Analytics Studio

Data

Linked

Workspace

Develop

Orchestrate

Monitor

SQL request

Apache Spark Applications

Management HUB

How to create a Azure Synapse Analytics Workspace

Creating your Azure Synapse Analytics Workspace

Public Preview Features

Azure Synapse Analytics

Erwin

Azure Synapse Analytics

Insights for all

Azure Synapse Analytics Workspace

Upcoming Articles

Latest FMD Releases

Feel free to leave a comment

Azure Data Factory: New functionalities and features

Erwin

New functionalities and features

Last week, a number of great new functionalities and features were added within Azure Data Factory. I would like to take you in some details in the blog below:

Customer key

Pipeline Consumption Report

Parameters from Execute Pipeline Activity

General Tab moved to new Properties Pane

Latest Posts

Categories

Feel free to leave a comment

Categories