Bulk Delete Documents in Azure Cosmos DB Container using Azure Data Factory

If you are already working on building an Azure Data Engineering solution using Azure Data Factory as an orchestration tool and Azure Cosmos DB in a scenario where you may have to delete documents from a particular SQL container programmatically, then you might have already figured out that there is no easy way to do it.

  • Azure Data Factory, out of the box, provides only data movement activities to/from Cosmos DB
  • Data Factory does not (yet) have the activities to execute Cosmos DB stored procedures or delete documents within a SQL container.

This post is an attempt to help all the Azure data engineers who love to play with data in Cosmos DB using ADF.

Before we dive deep down into the how to, I am assuming that you already know how to provision Azure Data Factory, Azure Cosmos DB (SQL API) and Azure Logic Apps. If not, it’s pretty easy to get started.

High Level workflow to get this done

  1. Create a stored procedure in the Cosmos DB container that runs a select query and deletes the required data. There’s an excellent git hub source (here) that tries to bulk delete documents in a container and it only makes sense to reuse the code and modify it to your needs.
  2. Configure an Azure Logic App to execute the stored procedure when an HTTP request is received.
  3. Configure a Data Factory pipeline to use the web activity to invoke the Logic App.

Create and Setup sproc in Cosmos DB

  • We need this stored procedure to bulk deletes documents for a given query
  • As an example, I will use some sample data in the container with ‘ProductID’ as the partition key
  • Create a stored procedure, Let’s name it as ‘BulkDelete. We will try to delete data from partition key value {“ProductID” : “715”}
  • The response of the sproc when executed would be something similar to below. we will use this response to determine if all the documents were deleted for the given query.
{
"deleted": 0,
"continuation": false
}

*   deleted - contains a count of documents deleted<br/>
*   continuation - a boolean whether you should execute the sproc again

Configure Azure Logic App to invoke the stored procedure

  • Open the Logic Apps Designer and select the Blank logic app template
  • Add the trigger “When a HTTP request is received”
  • Add a New step and Chose “Execute stored procedure” action
  • Browse current subscription and connect to the cosmos db. Enter ‘Database ID’, ‘Collection ID’, ‘Sproc ID’, ‘Partition key value’, ‘parameters’
  • Add a new step “Response” request action and add dynamic content in the Body. This body will return the response of the sproc when executed (refer to the response sample above)
  • Now save the app. After saving, a HTTP POST URL will be generated in the trigger, copy that URL for usage in ADF

Build Azure Data Factory Pipeline

  • Provision an Azure Data Factory Service and create a pipeline
  • To run the logic app from Azure Data Factory we have to send a HTTP request to the app and this is possible using the web activity.
  • You may need to execute BulkDelete sproc multiple times (depending whether the sproc is able to delete every document within the execution timeout limit)
  • To ensure we delete all the documents, we will run this web activity inside an Until activity till the condition (response of sproc should be “deleted : 0”) is met
  • Add an Until activity to the pipeline and name it “UntilAllDocsDeleted”
  • Add a Web activity inside the Until activity and name it “Delete”
  • Paste the URL copied from logic app in the settings of the web activity, use the POST method and the body contains {}
  • Now let’s add the condition in the Until activity so that web activity executes multiple times till the condition is met.
@equals(0,activity('Delete').output.deleted)
  • Now you can publish the pipeline and run it.

Validate the pipeline

  • Before we run the pipeline, let’s get the count of the documents in the container for ProductID = 715
  • Start the pipeline and observe the execution if all documents are deleted within execution time or the sproc is required to run multiple times to delete all.
  • As you can see, the delete activity is executing multiple times as the first delete activity was able to delete only 624 records within the execution time. It will continue to execute till the deleted : 0
  • To be sure, check the count in the Cosmos DB data explorer after successful execution of the pipeline

While there are many ways to delete documents through Azure Data Factory such as using Azure Functions or using custom activities, I found using Logic App was the simplest of all.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s