How to use concurrency in Azure Synapse pipelines?
How to prevent concurrent pipeline execution?
This week I had a discussion with a colleague about how we can now make sure that a Pipeline does not start when it’s already started.
He then indicated, have you ever thought of the concurrency option? I’ve seen this option before but never paid attention to it.
How does the concurrency work?
If you read the Microsoft documentation it says the following:
The maximum number of concurrent runs the pipeline can have. By default, there is no maximum. If the concurrency limit is reached, additional pipeline runs are queued until earlier ones complete.
The concurrency option is working in Azure Synapse Analytics and in Azure Data Factory.
I started to test this functionality and there are certainly some nice use cases for that:
- If the Pipeline was started via a Schedule and someone else triggers this Pipeline Manually, the Pipeline is placed in a queue.
- Sometimes it happens that there is a delay in the processing of data or that more data is delivered. If you process this data every 30 minutes and the 1st run is not yet ready and the 2nd starts again, this could result in incorrect data. Also in this case the to be executed run is placed in a queue and only starts when the previous one is ready.
It is a fairly simple process but can be quite useful especially in the case of short loading windows.
Please pay attention, running the pipeline in a Debug modus has no effect on this and will run directly.
Check the monitoring regularly to check if this situation is not happening all the time, if so, you better change the recurrence of your Triggered Pipeline. You still have the option to cancelled a queued pipeline.
How to enable concurrency?
To enable concurrency in an Azure Synapse pipeline, you can use the
Concurrency property in the pipeline settings. The default value is 1, which means that only one copy of the pipeline will run at a time. By default, there is no maximum. If the concurrency limit is reached, additional pipeline runs are queued until earlier ones complete. Setting the concurrency level to a higher value will cause multiple copies of the pipeline to run concurrently, which can improve performance if the pipeline is CPU-bound or if the data source can handle the increased load. If you leave the property blank the pipeline will not be queued.
When you have any questions regarding concurrency, please let me know.