How to setup Code Repository in Azure Data Factory

How to setup Code Repository in Azure Data Factory

ADF

by Erwin | Nov 5, 2020

Why activate a Git Configuration?

The main reasons are:

  1. Source Control: Ensures that all your changes are saved and traceable, but also that you can easily go back to a previous version in case of a bug.
  2. Continuous Integration and Continuous Delivery (CI/CD): Allows you to Create build and release pipelines for easy release to other Data Factory instance, manually or triggered(DTAP).
  3. Collaboration: You have the ability to easily collaborate in the same Data Factory with different colleagues.
  4. Performance: Your Data Factory from Git is 10 times faster then loading directly from the Data Factory Service.

So enough reasons to start enabling your Git Configuration.

How to setup your Code Repository in Azure Data Factory!

During the configuration/set up of your Data Factory you have the possibility to select either Azure DevOps or GitHub as your Git Configuration. If you haven't done that, you can still configure this integration in Azure Data Factory. The procedure for both options are the same.
Create ADF Version Control
In my previous article, Creating an Azure Data Factory Instance, I skipped the Git Configuration. In this article I will explain how to do this in an already created Data Factory.

Azure Data Factory Source Control

On the right of your splash screen when opening your Data Factory select the Setup Code Repository. Other options to start configuring your Code Repository are through the Management Hub or in the UX on the top left in the authoring canvas. If you don't see the option, Code Repository is already configured. You can check this in the Management Hub or UX.

We have the option to configure Azure DevOps or GitHub.

Azure DevOps integration

First I will take you through the configuration of Azure DevOps and then also create a similar configuration in GitHub. If you want to start directly in GitHub, click here.

Select Azure DevOps Git:

Azure Dev Ops Config

  1. Azure Active Directory: Select the AAD where your Azure DevOps environment is located. If you use another AAD, make sure that this account has rights to that environment.
  2. Azure DevOps Account: Select your Account.
  3. Project Name: Select the Project Name where you want to store your repository in.
  4. Git Repository: Create a new Project.
  5. Collaboration Branch: Change this to Main.
  6. Publish Branch: Leave this on adf_publish.
  7. Root folder: If you want to create a complete project with SQL,Azure Analysis Service, Azure DataBricks etc etc, you define a root folder and create your repository into that folder.
  8. Import: When this is a blank Data Factory, you can disable this option. When you have create already resources in your Data Factory, you should enable this so already created resources are committed to the repository.

Click on apply and you will see that you repository is connected.

Repo Connected

When you log in to your Azure Dev Ops Environment, you will see that a new Repository is created Main Branch. Azure Dev Ops Main

Go back to your Data Factory and click on Publish.

Data Factory Publish

In Azure DevOps the adf_publish Branch is now also created.Azure Dev Ops Publish

GitHub Integration

In the repository screen, select GitHub:

Github login

The first time you connect with your Data Factory you need to login in GitHub.

Github authorize

Once connect you to need to Authorize your Data Factory.

Github Configuration

All the settings are almost the same as in Azure DevOps:

  1. Use GitHub Server Enterprise: If enabled fill the The GitHub Enterprise root URL.
  2. GitHub Account: Select your Account.
  3. Project Name: Select the Project Name where you want to store your repository in.
  4. Git Repository: Create a new Project.
  5. Collaboration Branch: Leave this on Main.
  6. Publish Branch: Leave this on adf_publish.
  7. Root folder: If you want to create a complete project with SQL, Azure Analysis Service, Azure DataBricks etc etc, you define a root folder and create your repository into that folder.
  8. Import: When this is a blank Data Factory, you can disable this option. When you have create already resources in your Data Factory, you should enable this so already created resources are committed to the repository.

Click on apply and you will see that you repository is connected.

Repo Connected

Log in to your GitHub, a new Repository is created Main Branch. If you go back to your Data Factory and click on Publish.

Data Factory Publish

In GitHub the adf_publish Branch is now also created.

GitHub Publish

As you can see the Setup for Azure Dev Ops and GitHub are mostly the same. You have now learned how to connect your Data Factory to a Code Repository. You're now ready to start building your Release and build pipeline's.

Thanks for reading and in case you have some questions, please leave them in the comments below.

Feel free to leave a comment

Azure Data Factory Let’s get started

Azure Data Factory Let’s get started

Creating an Azure Data Factory Instance, let’s get started

Many blogs nowadays are about which functionalities we can use within Azure Data Factory. 
But how do we create an Azure Data Factory instance in Azure for the first time and what should you take into account?  In this article I will take you step by step on how to get started.

First we have to login in the Azure Portal.

Azure Data Factory

Search for Data Factories and select the Data Factory  service.

Create ADF

Secondly we have to create a Data Factory Instance.

Create ADF names

Fill in the required fields:

  1. Subscription => Select your Azure subscription in which you want to create the Data Factory.
  2. Resource Group =>Select Use existing, and select an existing resource group from the list or click on Create new, and enter the name of a resource group(a new Resource Group will be created)
  3. Region => Select the desired Region/Location, this is where your Azure Data Factory meta data will be stored and has nothing to do where you create your compute or store your Data Stores.
  4. Name = > Create a unique name in Azure.
  5. Version => Always select V2 here, this contains the very latest developments and functionalities. V1 is only used for migration from another V1 instance.

Select Next: Git configuration

Azure Data Factory Git Configuration

Enable the option to configure Git later,  we will configure this later in Azure Data Factory.

Select Next: Networking:

Create Azure Data Factory Networking

Leave the options as is. I will explain the Connectivity Method in one of my next articles.

Select Next: Review + Create:

Create Azure Data Factory Validation

Your Azure Data Factory Instance will be created. Once you have created your Azure Data Factory, it is ready to use and you can open it from selected Resource Groups above:

Open Azure Data Factory

Select Author & Monitor:

Azure Data Factory let's get started

Encrypt your Azure Data Factory with customer-managed keys

Azure Data Factory encrypts data at rest, including entity definitions and any data cached while runs are in progress. By default, data is encrypted with a randomly generated Microsoft-managed key that is uniquely assigned to your data factory. But you also Bring Your Own Key (BYOK) more details can be find in my previous written article “Azure Data Factory: How to assign a Customer Managed Key

Please be aware that you have to assign this key on an empty Azure Data Factory Instance.

Roles for Azure Data Factory

Data Factory Contributor role:

Assign the built-in Data Factory Contributor role, must be set on Resource Group Level if you want the user to create a new Data Factory on Resource Group Level otherwise you need to set it on Subscription Level.

User can:

  1. Create, edit, and delete data factories and child resources including datasets, linked services, pipelines, triggers, and integration runtimes.
  2. Deploy Resource Manager templates. Resource Manager deployment is the deployment method used by Data Factory in the Azure portal.
  3. Manage App Insights alerts for a Data Factory.
  4. Create support tickets.

Reader Role:

Assign the built-in reader role on the Data Factory resource for the user.

User can:

  1. View and monitor the selected Data Factory, but user can not edit or change it.

More on how to assign roles and permissions can be found here.

Thanks for reading, I my next blog I will describe how to Set up your Code Repository.

Parameterize Linked Services in ADF

Parameterize Linked Services in ADF

ADF

by Erwin | Jul 9, 2020

Parameterize Linked Services 

For my Azure Data Factory solution I wanted to Parameterize properties in my Linked Services. Not all properties are Parameterized by default through the UI. But there's another way to achieve this.

 

 

Linked Service

Open your existing Linked Services.

Linked Services ADF

In this situation I want to Parameterize my FTP connection so that I can change the Host name based on a Azure Key Vault Secret.

By default is this not possible through the UI but in the Bottom of your Linked Service there is a Advanced box

Linked Services Advanced

 

If you enable this box you can start building your own connection, but also create your own Parameters for this connection.

How to start:

As a base we will use the default code or our connection

Linked Services Code

{
    "name": "LS_FTP_SOURCE",
    "properties": {
        "annotations": [
            "stage: none",
            "scenario: demo",
            "environments: dvlm"
        ],
        "type": "FtpServer",
        "typeProperties": {
            "host": "ftp.erwindekreuk.com",
            "port": 21,
            "enableSsl": true,
            "enableServerCertificateValidation": true,
            "authenticationType": "Basic",
            "userName": "ftp_down",
            "password": {
                "type": "AzureKeyVaultSecret",
                "store": {
                    "referenceName": "LS_AKV_OXGN",
                    "type": "LinkedServiceReference"
                },
                "secretName": "secretname"
            }
        }
    },
    "type": "Microsoft.DataFactory/factories/linkedservices"
}
Copy this code to your advanced box and enable the option Specify dynamic contents in JSON format/

Now you can start adding new parameters.

Linked Services Advanced

If you want to Parameterize your HOST name  connection you have to add in the top of the code a new Parameter, under the type of your connection

    "properties": {
        "type": "FtpServer",
        "parameters": {
            "ConnectionKeyvaultSecret": {
                "type": "string"
            }

After you have done this, you need to specify for which properties you want to use this parameter. In my case I want to read the parameter form my Azure Key Vault for my HOST propertie.

The JSON code below will now use above parameter as an input.

        "typeProperties": {
            "host": {
                "type": "AzureKeyVaultSecret",
                "store": {
                    "referenceName": "LS_AKV_OXGN",
                    "type": "LinkedServiceReference"
                },
                "secretName": {
                    "value": "@linkedService().ConnectionKeyvaultSecret",
                    "type": "Expression"
                }
            }

Save your connection and you will see that your UI is changed and that you have to define all your setting through the Advanced Editor.

If you test your connection you will now see that you have to fill in a parameter.

Linked Service Test Connection

And now you can create parameters of every TypeProperties within your connection.

The code below will create Parameters for your Host, Username and Password entries with Azure Key Vault enabled. For the authenticationType you have to choose between Basic and Anonymous. But can also at this to your Azure Key Vault.

{
    "name": "LS_FTP_SOURCE",
    "properties": {
        "type": "FtpServer",
        "parameters": {
            "ConnectionKeyvaultSecret": {
                "type": "string"
            },
            "UsernameKeyvaultSecret": {
                "type": "string"
            },
            "PasswordKeyvaultSecret": {
                "type": "string"
            },
            "authenticationType": {
                "type": "string"
            }
        },
        "annotations": [ ],
        "typeProperties": {
            "host": {
                "type": "AzureKeyVaultSecret",
                "store": {
                    "referenceName": "LS_AKV_OXGN",
                    "type": "LinkedServiceReference"
                },
                "secretName": {
                    "value": "@linkedService().ConnectionKeyvaultSecret",
                    "type": "Expression"
                }
            },
            "port": 21,
            "enableSsl": false,
            "enableServerCertificateValidation": false,
            "authenticationType": "@linkedService().authenticationType",
            "userName": {
                "type": "AzureKeyVaultSecret",
                "store": {
                    "referenceName": "LS_AKV_OXGN",
                    "type": "LinkedServiceReference"
                },
                "secretName": {
                    "value": "@linkedService().UsernameKeyvaultSecret",
                    "type": "Expression"
                }
            },
            "password": {
                "type": "AzureKeyVaultSecret",
                "store": {
                    "referenceName": "LS_AKV_OXGN",
                    "type": "LinkedServiceReference"
                },
                "secretName": {
                    "value": "@linkedService().PasswordKeyvaultSecret",
                    "type": "Expression"
                }
            }
        }
    }
}

Thanks for reading my blog post and have fun with Parameterization of your Linked Services in ADF.

Feel free to leave a comment

Azure Data Factory and Azure Synapse Analytics Naming Conventions

Naming ConventionsAzure Naming Conventions

More and more projects are using Azure Data Factory and Azure Synapse Analytics, the more important it is to apply a correct and standard naming convention. When using standard naming conventions you create recognizable results across different projects, but you also create clarity for your colleagues. In addition to that, it is easier to add these projects to other services such as Managed Services, Azure DevOps,  etc etc,  because standards are used.

To start with these naming conventions, I have made a list of suggestions with most common Linked Services. The list is not exhaustive, but it does provide guidance for new Linked Services.

There are a few standard naming conventions that apply to all elements in Azure Data Factory and in Azure Synapse Analytics.

  • *Names are case insensitive (not case sensitive).  For that reason I’m only using CAPITALS.
  • *Maximum number of characters in a table name: 260.
  • All object names must begin with a letter, number or underscore (_).
  • Following characters are not allowed: “.”, “+”, “?”, “/”, “<”, ”>”,”*”,”%”,”&”,”:”,””

These rules are also defined on the following link

This post has been updated on Feb 2nd, 2023 with the latest connectors.

Azure

Abbreviation Linked Service Dataset
Azure Blob Storage ABLB_ LS_ABLB_ DS_ABLB_
Azure Cosmos DB SQL API ACSA_ LS_ACSA_ DS_ACSA_
Azure Cosmos DB MongDB API ACMA_ LS_ACMA_ DS_ACMA_
Azure Data Explorer ADEX_ LS_ADEX_ DS_ADEX_
Azure Data Lake Storage Gen1 ADLS_ LS_ADLS_ DS_ADLS_
Azure Data Lake Storage Gen2 ADLS_ LS_ADLS_ DS_ADLS_
Azure Database for MariaDB AMDB_ LS_AMDB_ DS_AMDB_
Azure Database for MySQL AMYS_ LS_AMYS_ DS_AMYS_
Azure Database for PostgreSQL APOS_ LS_APOS_ DS_APOS_
Azure File Storage AFIL_ LS_AFIL_ DS_AFIL_
Azure Search ASER_ LS_ASER_ DS_ASER_
Azure SQL Database ASQL_ LS_ASQL_ DS_ASQL_
Azure SQL Database Managed Instance ASQM_ LS_ASQM_ DS_ASQM_
Azure Synapse Analytics (formerly Azure SQL DW) ASDW_ LS_ASDW_ DS_ASDW_
Azure Table Storage ATBL_ LS_ATBL_ DS_ATBL_
Azure DataBricks ADBR_ LS_ADBR_ DS_ADBR_
Azure Cognitive Search ACGS_ LS_ACGS DS_ACGS_
Azure Synapse Analytics  ASA_ LS_ASA DS_ASA
Azure Cognitive Service ACG_ LS_ACG_ N/A

 

Database

     
  Abbreviation Linked Service Dataset
SQL Server  MSQL_ LS_SQL_ DS_SQL_
Oracle ORAC_ LS_ORAC_ DS_ORAC_
Oracle Eloqua ORAE_ LS_ORAE_ DS_ORAE_
Oracle Responsys ORAR_ LS_ORAR_ DS_ORAR_
Oracle Service Cloud ORSC_ LS_ORSC_ DS_ORSC_
MySQL MYSQ_ LS_MYSQ_ DS_MYSQ_
DB2 DB2_ LS_DB2_ DS_DB2_
Teradata  TDAT_ LS_TDAT_ DS_TDAT_
PostgreSQL POST_ LS_POST_ DS_POST_
Sybase SYBA_ LS_SYBA_ DS_SYBA_
Cassandra CASS_ LS_CASS_ DS_CASS_
MongoDB MONG_ LS_MONG_ DS_MONG_
Amazon Redshift ARED_ LS_ARED_ DS_ARED_
SAP Business Warehouse SAPW_ LS_SAPW_ DS_SAPW_
SAP Cloud for Customer (C4C) SAPC_ LS_SAPC_ DS_SAPC_
SAP Table SAPT_ LS_SAPT DS_SAPT_
SAP HANA HANA_ LS_HANA_ DS_HANA_
Drill DRILL_ LS_DRILL_ DS_DRILL_
Google BigQuery GBQ_ LS_GBQ_ DS_GBQ_
Greenplum GRPL_ LS_GRPL_ DS_GRPL_
HBase HBAS_ LS_HBAS_ DS_HBAS_
Hive HIVE_ LS_HIVE_ DS_HIVE_
Apache Impala IMPA_ LS_IMPA_ DS_IMPA_
Informix INMI_ LS_INMI_ DS_INMI_
MariaDB MDB_ LS_MDB_ DS_MDB_
Microsoft Access MACS_ LS_MACS_ DS_MACS_
Netezza NETZ_ LS_NETZ_ DS_NETZ_
Phoenix PHNX_ LS_PHNX_ DS_PHNX_
Presto (Preview) PRST_ LS_PRST_ DS_PRST_
Spark SPRK_ LS_SPRK_ DS_SPRK_
Vertica VERT_ LS_VERT_ DS_VERT_
Snowflake SNWF_ LS_SNWF_ DS_SNWF_
MongoDB Atlas MONG_ATLAS_ LS_MONG_ATLAS_ DS_MONG_ATLAS_
Amazon RDS for Oracle RDSORAC_ LS_RDSORAC_ DS_RDSORAC_
Amazon RDS for SQL Server RDSSQL_ LS_RDSSQL_ DS_RDSSQL_

 

Files

     
  Abbreviation Linked Service Dataset
File System FILE_ LS_FILE_ DS_FILE_
HDFS HDFS_ LS_HDFS_ DS_HDFS_
Amazon S3  AMS3_ LS_AMS3_ DS_AMS3_
FTP FTP_ LS_FTP_ DS_FTP_
SFTP SFTP_ LS_SFTP_ DS_SFTP_
Google Cloud Storage GCS_ LS_GCS_ DS_GCS_
Oracle Cloud Storage OCS_ LS_OCS_ DS_OCS_
Amazon S3 Compatible Storage SMS3C_ LS_SMS3C_ DS_SMS3C_

 

Generic

     
  Abbreviation Linked Service Dataset
Generic ODBC ODBC_ LS_ODBC_ DS_ODBC_
Generic OData  ODAT_ LS_ODAT_ DS_ODAT_
Generic REST REST_ LS_REST_ DS_REST_
Generic HTTP HTTP_ LS_HTTP_ DS_HTTP_

 

Services and Apps

Abbreviation Linked Service Dataset
Salesforce SAFC_ LS_SAFC_ DS_SAFC_
Salesforce Service Cloud SAFCSC_ LS_SAFCSC_ DS_SAFCSC_
Salesforce Marketing Cloud SAFOMC_ LS_SAFOMC_ DS_SAFOMC_
GitHub GITH_ LS_GITH_ DS_GITH_
Jira JIRA_ LS_JIRA_ DS_JIRA_
Web Table (table from HTML)  WEBT_ LS_WEBT_ DS_WEBT_
Amazon Marketplace Web Service AMSMWS_ LS_AMSMWS_ DS_AMSMWS_
Xero XERO_ LS_XERO_ DS_XERO_
SharePoint Online List SHAREPOINT_ LS_SHAREPOINT_ DS_SHAREPOINT_
ServiceNow SERVICENOW_ LS_SERVICENOW_ DS_SERVICENOW_
Dynamics (Microsoft Dataverse) DATAVERSE_ LS_DATAVERSE_ DS_DATAVERSE__
Dynamics 365 D365_ LS_D365_ DS_D365_
Dynamics AX DAX_ LS_DAX_ DS_DAX_
Dynamics CRM DCRM_ LS_DCRM_ DS_DCRM_
Microsoft 365 M365_ LS_M365_ Ds_M365__
SAP Cloud for Customer (C4C) SAPC4C_ LS_SAPC4C_ DS_LS_SAPC4C_
SAP ECC SAPE_ LS_SAPE_ DS_SAPE_

If your connector is not described(mostly connectors which are in Preview), please let me know. For more details for all the different connectors, check the connector overview

Pipeline

Even for Pipeline you can define naming conventions. I think the most important thing is that you always start your pipeline with PL_ followed by a Logic Name for you. You can for example use:

TRANS: Pipeline with transformations

SSIS: Pipeline with SSIS Packages

DATA: Pipeline with DataMovements

COPY: Pipeline with Copy Activities

Divers

NB: Notebook 

DF: Mapping Dataflows

SQL: SQL Scripts

KQL: KQL Scripts

JOB: Spark job definition

Once again these naming conventions are just suggestions. The most important thing is that you start using naming conventions and that you use the folder structure within the Pipelines (categories). Like the picture below as an example.

FolderStructure

If you have suggestions just let me know by leaving a comment below.