How to use concurrency in Azure Synapse pipelines?

How to use concurrency in Azure Synapse pipelines?

How to prevent concurrent pipeline execution?

Concurrency

This week I had a discussion with a colleague about how we can now make sure that a Pipeline does not start when it's already started.

He then indicated, have you ever thought of the concurrency option?  I've seen this option before but never paid attention to it.

How does the concurrency work?

If you read the Microsoft documentation it says the following:
The maximum number of concurrent runs the pipeline can have. By default, there is no maximum. If the concurrency limit is reached, additional pipeline runs are queued until earlier ones complete.

The concurrency option is working in Azure Synapse Analytics and in Azure Data Factory.

I started to test this functionality and there are certainly some nice use cases for that:

  • If the Pipeline was started via a Schedule and someone else triggers this Pipeline Manually, the Pipeline is placed in a queue.
  • Sometimes it happens that there is a delay in the processing of data or that more data is delivered. If you process this data every 30 minutes and the 1st run is not yet ready and the 2nd starts again, this could result in incorrect data. Also in this case the to be executed run is placed in a queue and only starts when the previous one is ready.

It is a fairly simple process but can be quite useful especially in the case of short loading windows.

Azure-Synapse-Concurrency

Please pay attention, running the pipeline in a Debug modus has no effect on this and will run directly.
Check the monitoring regularly to check if this situation is not happening all the time, if so,  you better change the recurrence ​of your Triggered Pipeline. You still have the option to cancelled a queued pipeline.

How to enable concurrency?

 

To enable concurrency in an Azure Synapse pipeline, you can use the Concurrency property in the pipeline settings. The default value is 1, which means that only one copy of the pipeline will run at a time. By default, there is no maximum. If the concurrency limit is reached, additional pipeline runs are queued until earlier ones complete. Setting the concurrency level to a higher value will cause multiple copies of the pipeline to run concurrently, which can improve performance if the pipeline is CPU-bound or if the data source can handle the increased load. If you leave the property blank the pipeline will not be queued. 

Enable-concurrency-Azure-Synapse

When you have any questions regarding concurrency, please let me know.

Azure Data Factory and Azure Synapse Analytics Naming Conventions

Naming ConventionsAzure Naming Conventions

More and more projects are using Azure Data Factory and Azure Synapse Analytics, the more important it is to apply a correct and standard naming convention. When using standard naming conventions you create recognizable results across different projects, but you also create clarity for your colleagues. In addition to that, it is easier to add these projects to other services such as Managed Services, Azure DevOps,  etc etc,  because standards are used.

To start with these naming conventions, I have made a list of suggestions with most common Linked Services. The list is not exhaustive, but it does provide guidance for new Linked Services.

There are a few standard naming conventions that apply to all elements in Azure Data Factory and in Azure Synapse Analytics.

  • *Names are case insensitive (not case sensitive).  For that reason I’m only using CAPITALS.
  • *Maximum number of characters in a table name: 260.
  • All object names must begin with a letter, number or underscore (_).
  • Following characters are not allowed: “.”, “+”, “?”, “/”, “<”, ”>”,”*”,”%”,”&”,”:”,””

These rules are also defined on the following link

This post has been updated on Feb 2nd, 2023 with the latest connectors.

Azure

Abbreviation Linked Service Dataset
Azure Blob Storage ABLB_ LS_ABLB_ DS_ABLB_
Azure Cosmos DB SQL API ACSA_ LS_ACSA_ DS_ACSA_
Azure Cosmos DB MongDB API ACMA_ LS_ACMA_ DS_ACMA_
Azure Data Explorer ADEX_ LS_ADEX_ DS_ADEX_
Azure Data Lake Storage Gen1 ADLS_ LS_ADLS_ DS_ADLS_
Azure Data Lake Storage Gen2 ADLS_ LS_ADLS_ DS_ADLS_
Azure Database for MariaDB AMDB_ LS_AMDB_ DS_AMDB_
Azure Database for MySQL AMYS_ LS_AMYS_ DS_AMYS_
Azure Database for PostgreSQL APOS_ LS_APOS_ DS_APOS_
Azure File Storage AFIL_ LS_AFIL_ DS_AFIL_
Azure Search ASER_ LS_ASER_ DS_ASER_
Azure SQL Database ASQL_ LS_ASQL_ DS_ASQL_
Azure SQL Database Managed Instance ASQM_ LS_ASQM_ DS_ASQM_
Azure Synapse Analytics (formerly Azure SQL DW) ASDW_ LS_ASDW_ DS_ASDW_
Azure Table Storage ATBL_ LS_ATBL_ DS_ATBL_
Azure DataBricks ADBR_ LS_ADBR_ DS_ADBR_
Azure Cognitive Search ACGS_ LS_ACGS DS_ACGS_
Azure Synapse Analytics  ASA_ LS_ASA DS_ASA
Azure Cognitive Service ACG_ LS_ACG_ N/A

 

Database

     
  Abbreviation Linked Service Dataset
SQL Server  MSQL_ LS_SQL_ DS_SQL_
Oracle ORAC_ LS_ORAC_ DS_ORAC_
Oracle Eloqua ORAE_ LS_ORAE_ DS_ORAE_
Oracle Responsys ORAR_ LS_ORAR_ DS_ORAR_
Oracle Service Cloud ORSC_ LS_ORSC_ DS_ORSC_
MySQL MYSQ_ LS_MYSQ_ DS_MYSQ_
DB2 DB2_ LS_DB2_ DS_DB2_
Teradata  TDAT_ LS_TDAT_ DS_TDAT_
PostgreSQL POST_ LS_POST_ DS_POST_
Sybase SYBA_ LS_SYBA_ DS_SYBA_
Cassandra CASS_ LS_CASS_ DS_CASS_
MongoDB MONG_ LS_MONG_ DS_MONG_
Amazon Redshift ARED_ LS_ARED_ DS_ARED_
SAP Business Warehouse SAPW_ LS_SAPW_ DS_SAPW_
SAP Cloud for Customer (C4C) SAPC_ LS_SAPC_ DS_SAPC_
SAP Table SAPT_ LS_SAPT DS_SAPT_
SAP HANA HANA_ LS_HANA_ DS_HANA_
Drill DRILL_ LS_DRILL_ DS_DRILL_
Google BigQuery GBQ_ LS_GBQ_ DS_GBQ_
Greenplum GRPL_ LS_GRPL_ DS_GRPL_
HBase HBAS_ LS_HBAS_ DS_HBAS_
Hive HIVE_ LS_HIVE_ DS_HIVE_
Apache Impala IMPA_ LS_IMPA_ DS_IMPA_
Informix INMI_ LS_INMI_ DS_INMI_
MariaDB MDB_ LS_MDB_ DS_MDB_
Microsoft Access MACS_ LS_MACS_ DS_MACS_
Netezza NETZ_ LS_NETZ_ DS_NETZ_
Phoenix PHNX_ LS_PHNX_ DS_PHNX_
Presto (Preview) PRST_ LS_PRST_ DS_PRST_
Spark SPRK_ LS_SPRK_ DS_SPRK_
Vertica VERT_ LS_VERT_ DS_VERT_
Snowflake SNWF_ LS_SNWF_ DS_SNWF_
MongoDB Atlas MONG_ATLAS_ LS_MONG_ATLAS_ DS_MONG_ATLAS_
Amazon RDS for Oracle RDSORAC_ LS_RDSORAC_ DS_RDSORAC_
Amazon RDS for SQL Server RDSSQL_ LS_RDSSQL_ DS_RDSSQL_

 

Files

     
  Abbreviation Linked Service Dataset
File System FILE_ LS_FILE_ DS_FILE_
HDFS HDFS_ LS_HDFS_ DS_HDFS_
Amazon S3  AMS3_ LS_AMS3_ DS_AMS3_
FTP FTP_ LS_FTP_ DS_FTP_
SFTP SFTP_ LS_SFTP_ DS_SFTP_
Google Cloud Storage GCS_ LS_GCS_ DS_GCS_
Oracle Cloud Storage OCS_ LS_OCS_ DS_OCS_
Amazon S3 Compatible Storage SMS3C_ LS_SMS3C_ DS_SMS3C_

 

Generic

     
  Abbreviation Linked Service Dataset
Generic ODBC ODBC_ LS_ODBC_ DS_ODBC_
Generic OData  ODAT_ LS_ODAT_ DS_ODAT_
Generic REST REST_ LS_REST_ DS_REST_
Generic HTTP HTTP_ LS_HTTP_ DS_HTTP_

 

Services and Apps

Abbreviation Linked Service Dataset
Salesforce SAFC_ LS_SAFC_ DS_SAFC_
Salesforce Service Cloud SAFCSC_ LS_SAFCSC_ DS_SAFCSC_
Salesforce Marketing Cloud SAFOMC_ LS_SAFOMC_ DS_SAFOMC_
GitHub GITH_ LS_GITH_ DS_GITH_
Jira JIRA_ LS_JIRA_ DS_JIRA_
Web Table (table from HTML)  WEBT_ LS_WEBT_ DS_WEBT_
Amazon Marketplace Web Service AMSMWS_ LS_AMSMWS_ DS_AMSMWS_
Xero XERO_ LS_XERO_ DS_XERO_
SharePoint Online List SHAREPOINT_ LS_SHAREPOINT_ DS_SHAREPOINT_
ServiceNow SERVICENOW_ LS_SERVICENOW_ DS_SERVICENOW_
Dynamics (Microsoft Dataverse) DATAVERSE_ LS_DATAVERSE_ DS_DATAVERSE__
Dynamics 365 D365_ LS_D365_ DS_D365_
Dynamics AX DAX_ LS_DAX_ DS_DAX_
Dynamics CRM DCRM_ LS_DCRM_ DS_DCRM_
Microsoft 365 M365_ LS_M365_ Ds_M365__
SAP Cloud for Customer (C4C) SAPC4C_ LS_SAPC4C_ DS_LS_SAPC4C_
SAP ECC SAPE_ LS_SAPE_ DS_SAPE_

If your connector is not described(mostly connectors which are in Preview), please let me know. For more details for all the different connectors, check the connector overview

Pipeline

Even for Pipeline you can define naming conventions. I think the most important thing is that you always start your pipeline with PL_ followed by a Logic Name for you. You can for example use:

TRANS: Pipeline with transformations

SSIS: Pipeline with SSIS Packages

DATA: Pipeline with DataMovements

COPY: Pipeline with Copy Activities

Divers

NB: Notebook 

DF: Mapping Dataflows

SQL: SQL Scripts

KQL: KQL Scripts

JOB: Spark job definition

Once again these naming conventions are just suggestions. The most important thing is that you start using naming conventions and that you use the folder structure within the Pipelines (categories). Like the picture below as an example.

FolderStructure

If you have suggestions just let me know by leaving a comment below.

Azure Data Factory: How to assign a Customer Managed Key

Azure Data Factory: How to assign a Customer Managed Key

Customer key

With this new functionality you can add extra security to your Azure Data Factory environment. Where the data was first encrypted with a randomly generated key from Microsoft, you can now use the customer-managed key feature. With this Bring Your Own Key (BYOK) you can add extra security to your Azure Data Factory environment. If you use the customer-managed key functionality, the data will be encrypted in combination with the ADF system key. You can create your own key or have it generated by the Azure Key Vault API

Be careful,  this new feature can only be enabled on an empty Azure Data Factory environment.  Make sure your Azure Active Directory, Azure Data Factory and Azure KeyVault are all in the same region. If you use an Azure Landing Zone consisting of different subscriptions, this is also possible, as long as the services exist in the same region.

Please follow the steps below how to enable this new feature:

I assume that you already have an existing Azure KeyVault. If not, you will have to create one first. You can read how to do that here.
With an existing Azure KeyVault, it is important that you enable the options Soft Deletes and Purge protection.

Enable Soft Deletes and Purge protection

Purge option

If you want to enable this via Powershell use the following command:

[code lang="ps"] ($resource = Get-AzResource -ResourceId (Get-AzKeyVault -VaultName 'YOURKEYVAULTNAME').ResourceId).Properties | Add-Member -MemberType 'NoteProperty' -Name 'enableSoftDelete' -Value 'true'

Set-AzResource -resourceid $resource.ResourceId -Properties $resource.Properties

($resource = Get-AzResource -ResourceId (Get-AzKeyVault -VaultName 'YOURKEYVAULTNAME').ResourceId).Properties | Add-Member -MemberType 'NoteProperty' -Name 'enablePurgeProtection' -Value 'true'

Set-AzResource -resourceid $resource.ResourceId -Properties $resource.Properties

Define Access policy

The next step is to enable your Grant Data Factory access to Azure Key Vault, you have to enable  the following permissions: Get, Unwrap Key, and Wrap Key

ADF Policy

Search for Data Factory Instance and Select the correct one:

ADF Principal

Create KEY

Once you have done that it’s time to create your Keys. Keep in mind that only RSA 2048-bit keys are supported by Azure Data Factory encryption.

Create Keys

Very important step your key name must be in only letters. KEYADFNAMECUSTOMER will work, but KEY-ADFNAME-CUSTOMER isn’t and you will get an error in your Azure Data Factory Instance. It took me a while to figure this out. So it can saves you a lot of time.

After your KEY is created, copy the Key Identifier.

Assign Customer Key

The last step in this article is to assign the key to your Azure Data Factory Instance.

CMK

Customer key ADF

Paste the selected key in your Azure Data Factory Instance and save.

Errors

If your get an error “Invalid key Vault URL”

-Check if the Soft Deletes and Purge protection on your Key Vault is set.

-Check if your Key consists only of letters.

-Check if you enabled your Grant Data Factory access to Azure Key Vault.

-Check if Azure DataFactory, Azure KeyVault and your Azure Active Directory are in the same region.

 

If you still have errors, please send me a message and I will try to help you out.

Hopefully, this article has helped you to secure your environment.