Hello Readers!!

My recent blog, Querying Cybersecurity Data Stored in Amazon S3, generated questions from some of you looking for an equivalent approach with Blob Storage, Azure’s object storage service. Your inquiries are excellent inspiration. So, here we are…

SOC teams in companies that use Microsoft Azure as their primary cloud provider are starting to store their cybersecurity data in Azure Blob Storage because of its cost efficiencies with long-term data retention. They are sending more critical or urgent alerts into Microsoft Sentinel, and high-volume lower fidelity log data is being dumped into Azure Blob Storage. The Blob data is not in continuous use, but when a security incident requires deeper investigation, this data immediately becomes a priority for the analyst. See the official documentation at Azure Blob Storage and pricing at Azure Blob Storage pricing

In this blog, we will cover an example scenario of how to query Azure Blob Storage cybersecurity data to investigate a malware investigation use-case. For the purposes of this blog, we will limit ourselves to not use any specific third-party commercial solutions. This means I am not talking about my company Query, which offers a federated search solution for security data, but I will talk generally about open federated search.

Walkthrough Scenario – Investigating Cuba Ransomware

On January 5th, the FBI and CISA updated their guidance on IOCs for Cuba ransomware, which has impacted 100+ organizations and extorted $60M+ in ransom payments. Its detailed analysis is beyond the scope of this blog, but we will do the basic work of searching the IOCs (Indicators of Compromise) taken from associated file hashes and IPs from Table 1 and Table 4 respectively in the CISA advisory at https://www.cisa.gov/uscert/ncas/alerts/aa22-335a

For our example scenario, our analyst needs to investigate whether any of these IOCs have been observed in the last year of EDR and firewall log data stored in Azure Blob Storage. We will search file hashes from the EDR data and IPs from the firewall data. For our example, we assume EDR data sources to be Microsoft Defender for Endpoint and Azure Firewall, though the article is relevant for any other cybersecurity data source.

NOTE: If your organization doesn’t store data in Azure Blob Storage, continue with Section-1 below to understand, plan, and store in it. If you are already storing in Azure Blob, you can skip to Section-2 for the data querying and investigation part of our example scenario.

Section-1: Collecting and storing cybersecurity data in Azure Blob Storage

Before we talk about querying the above EDR and firewall data, let’s first talk about the platform engineering one would have done to store the above high-volume data sources in Azure Blob Storage.

1A: Create Azure Blob Storage Account and Containers

First, you would create a storage account in Azure and then create a container within that storage account to store the log data. The storage account provides a namespace for your data and includes settings for data redundancy, access control, and network access. A container is a logical grouping of blobs, and it can be used to organize your data.

To create a storage account, follow these steps:

  1. Log in to the Azure portal at https://portal.azure.com
  2. Click on the “Create a resource” button.
  3. Search for “Storage account” and select it from the list of options.
  4. Click on the “Create” button and enter a name for the storage account.
  5. Select the subscription, resource group, and location for the storage account.
  6. Choose the performance tier and replication options for the storage account.
  7. Click on the “Review + create” to review the settings and then click “Create”.

Once the storage account has been created, we can create containers within it to store the log data. To create a container in Azure Blob Storage, repeat these steps for each data source:

  1. Navigate to the storage account in the Azure portal.
  2. Click on the “Containers” option in the left-hand menu.
  3. Click on the “+ Container” button and enter a name for the container.
  4. Set the access level for the container as private.
  5. Click on the Create button.

1B: Intermediaries for forwarding from data source to above Blob Storage Container

Depending upon your EDR and firewall vendors, you would follow their instructions to forward data into the above Blob Storage Container. However, only a few cybersecurity data sources, like our example – Microsoft Defender for Endpoint – can directly forward like this. So what do you do for other vendors?

We will have to do some devops/platform engineering to setup and maintain forwarding using one or more of these intermediary tools:

  1. Azure Event Hubs (optionally with Apache Kafka)
  2. Azure Log Analytics workspace via Azure Monitor
  3. Azure CLI or Blob Storage REST APIs
  4. Azure Storage Explorer
1. Using Azure Event Hubs (optionally in combination with Apache Kafka):

You can use Azure Event Hubs to receive data streams and ingest into the above created containers. Azure Event Hubs is a real-time data streaming/message bus service that can receive and process millions of events per second. Client libraries exist for different programming languages to integrate and send to Event Hubs, though there aren’t many out-of-box cybersecurity vendor integrations. But since Event Hubs can also act as an Apache Kafka consumer, that really opens up the possibility to move data into above Blob Storage from anywhere! 

Learn more on Event Hubs here What is Azure Event Hubs? – a Big Data ingestion service, and its pricing here Event Hubs pricing.

2. Cloud Infrastructure Data – Using Azure Monitor to store in Azure Log Analytics workspace:

If your infrastructure is in Azure or you are using a service like Azure Firewall, you can directly collect logs via Azure Monitor which stores into Azure Log Analytics workspace. You can then configure the Data Export option in the Log Analytics workspace, choosing the Azure Blob storage account, container, and format of the data to store it long-term.

NOTE: While we are talking about Azure Monitor from a collection perspective here we will talk about its Log Analytics workspace in the query section further in this blog (Section 2B Item 3), including  pros and cons and pricing.

3. Developer Options – Using Azure CLI or Azure Blob Storage REST APIs:

If you have developer resources, you can script Azure CLI or use Blob Storage REST APIs as well. Discussing them probably necessitates a couple more blogs, so we will skip their details for now!

NOTE: While we are talking about these APIs from a collection perspective in this section, we will again talk about it later (Section 2B Item 6), where we will discuss pros and cons and pricing.

4. Manual Upload/Validate option with Azure Storage Explorer:

If you want to manually upload your cybersecurity log data to Azure Blob Storage for testing or one-off scenarios, you can use Azure Storage Explorer. It is a free standalone cross-platform GUI app that enables you to manage your Azure Blob Storage accounts, drag and drop files or folders from your local machine, or use the Upload button to move to the Azure Blob Storage container.

NOTE: We will talk about Azure Storage Explorer again (in Section 2A) to validate the collected data, irrespective of which collection method(s) we used.

1C: Organize Data into Premium, Hot, Cool, and Archive

Azure Blob Storage lets you organize your blobs into Premium, Hot, Cool, and Archive choices where you pay for the read/write performance. Based upon your needs, you can do “pay-as-you-go” or buy “reserved capacity” at a cheaper price. These options are detailed at the Azure Blob Storage pricing page.

Section-2: Querying cybersecurity data from Azure Blob Storage

Azure Blob Storage is a vanilla low-cost object storage. The data is not indexed for fast searching. There is no enforced object schema for query consistency. These limitations make it difficult/not-straightforward to query. And yet, its low cost makes it relevant for storing high-volume low fidelity data that doesn’t require constant low-latency searching.

2A: View and Validate your Cybersecurity Data in Azure Blob Storage

We covered in Section-1 how to collect cybersecurity data and store them in Azure Blob Storage. Assuming you/your platform engineering has been successfully storing data, it is time to validate that you have the data.

Azure Storage Explorer, as we discussed earlier, is a free standalone cross-platform GUI app that enables you to manage and view your Azure Blob Storage accounts, and their files and folders.

We would expect that the data forwarded by your EDR and firewall vendors is in JSON files, though the JSON structure would be vendor specific. You can download a file using Azure Storage Explorer and use a suitable text/json editor to view and validate whether your data is showing correctly or not.

Reference: Azure Storage Explorer – cloud storage management | Microsoft Azure

2B: Exploring possible options for querying the data

Below are some broad approaches we could explore to query the data:

  1. Azure Data Lake Analytics
  2. Azure Cognitive Search (formerly Azure Search)
  3. Azure Log Analytics (via Microsoft Sentinel or Azure Monitor)
  4. Azure Data Explorer (ADX with Kusto database)
  5. Azure Data Factory (ETL)
  6. Azure Blob Storage REST API
  7. Cybersecurity Open Federated Search solutions

Let’s discuss these approaches at a high level, and understand their suitability based upon our above use-case scenario.

1. Azure Data Lake Analytics

In my previous blog on Querying Cybersecurity Data Stored in Amazon S3 we discussed Amazon Athena at length. Well, the closest Azure equivalent of it would be Data Lake Analytics | Microsoft Azure. Azure Data Lake Analytics is a serverless analytics service that allows you to run big data batch processing jobs written in U-SQL, a SQL-like language where U stands for Universal. For those folks familiar with Microsoft SQL Server, U-SQL extends SQL Server’s T-SQL language.

While Azure Data Lake Analytics can query from Blob Storage, it has another storage built on top of Blob called Azure Data Lake Storage (ADLS). ADLS is more optimized for hierarchical storage and faster big data analytics queries, and is now at its Gen2. For faster and scalable big data analytics queries, I would recommend storing in ADLS Gen2 vs. the base Blob Storage. The storage pricing is the same between the two, so the difference really comes to the pricing of specific big data operations you would perform. See Pricing—Data Lake Analytics | Microsoft Azure, and Azure Data Lake Storage pricing for pricing options.

How would the U-SQL look like for our scenario? Here is the condensed version for querying the IP IOCs from the Blob Storage files containing Azure Firewall logs. We are skipping a lot of setup and execution details in the interest of space, but I would encourage you to review Get started with U-SQL language in Azure Data Lake Analytics | Microsoft Learn

Here is the simplified query that can search for IP addresses associated with Cuba Ransomware (taken from Table 4 referenced in the Walkthrough section) from our Azure Firewall logs in the past 3 months. Disclaimer: I assumed CSV instead of JSON here, since U-SQL has built-in extractors and outputters for CSV. Making this work with JSON is more work!

@firewalllog =
    EXTRACT Timestamp       DateTime,
            SourceIP        string,
            SourcePort      int,
            Target          string,
            TargetPort      int,
            Action          string,
            Policy          string
    FROM "/SOC/Firewall/logs.csv"
    USING Extractors.Csv();
@results =
    SELECT Timestamp, SourceIP, Target, TargetPort
    FROM @firewalllog
WHERE SourceIP in (‘193.23.244.244’, ‘144.172.83.13’, ‘216.45.55.30’, …);
AND Timestamp >= DateTime.Parse("2023/01/01") AND Timestamp <= DateTime.Parse("2023/04/01");
OUTPUT @results
    TO "/output/cuba-ransomware-hits.csv"
    USING Outputters.Csv();

As we see, the Azure Data Lake Analytics option is relevant if your team has database and SQL query expertise. However, you don’t get an “end-user” interface that is directly usable by the cybersecurity analyst. Let’s continue our quest to other alternatives.

2. Azure Cognitive Search (formerly Azure Search)

Azure Cognitive Search is a managed search service that allows users to search through data that has been indexed into the search service. It has various search features  that analysts like, such as full-text search and faceted navigation. We can draw parallels between it and Elasticsearch.

You can consider Cognitive Search, but be aware that you will have to take your Azure Blob Storage data, or a relevant subset of it, and then index it in Azure Cognitive Search. In my mind, that’s yet another data moving effort. Further, Azure Cognitive Search can treat our JSON log data as generic documents and index those logs, but it won’t have any semantic understanding of cybersecurity data schema. Nevertheless, here are detailed steps for indexing from Blob Storage into Azure Blob indexer - Cognitive Search. Here is the pricing information at Azure Cognitive Search pricing.

3. Azure Log Analytics (via Microsoft Sentinel or Azure Monitor)

The above SQL-expertise issue could be addressed if you stored into Azure Log Analytics Workspace instead. Log Analytics lets you write queries in Kusto Query Language (KQL), which is built for easier query-only use-cases. Note that Log Analytics Workspace is not standalone and is available only as the underlying log storage platform of Azure Monitor or Microsoft Sentinel. They both store into Log Analytics workspace.

Here is the query than can search for the above mentioned IP addresses associated with Cuba Ransomware (taken from Table 4 referenced in the Walkthrough section) and seen in our Azure Firewall in the past 3 months:

AzureDiagnostics
| where Category == "AzureFirewallNetworkRule" or Category == "AzureFirewallApplicationRule" 
| where TimeGenerated >= ago(3m)
| where SourceIP in ("193.23.244.244", "144.172.83.13", "216.45.55.30", …)

Here is the query that can search for that ransomware’s file hashes (taken from Table 1 referenced in the Walkthrough section) and associated with any security event data in the past 3 months:


DeviceFileEvents 
| where Timestamp >= ago(3m)
| where SHA256 in ("f1103e627311e73d5f29e877243e7ca203292f9419303c661aec57745eb4f26c", "a7c207b9b83648f69d6387780b1168e2f1eabd23ae6e162dd700ae8112f8b96c", "141b2190f51397dbd0dfde0e3904b264c91b6f81febc823ff0c33da980b69944", …)

Above approach is relevant only if you move data into Log Analytics workspace. That could come at a huge price tag when compared with the base storage costs of Blob Storage. See the Log Analytics price at Azure Monitor pricing or at Microsoft Sentinel Pricing. Note that enabling Microsoft Sentinel app adds further pricing on top of the Log Analytics storage part. The obvious advantage of adding Sentinel is that it gives you the cybersecurity analyst interface which will make writing the above KQL easier.

4. Azure Data Explorer (ADX with Kusto database)

Azure Data Explorer (ADX) is the service that lets you explore big data with clustering for fast performance and without having to index data upfront. You can create a just-in-time Kusto database and query from it. You can take a subset of data blobs from the Blob Storage, create the database, and then query it via KQL, Kusto’s query language we had discussed earlier. More detailed example is at Query exported data from Azure Monitor by using Azure Data Explorer.

This option is suitable for one-off, ad-hoc needs, and the pricing listed at Azure Data Explorer pricing is based upon that temporary cluster’s VMs pricing. I like this option better than the above option 1 of Azure Data Lake Analytics, because ADX is more focused on real-time querying while Data Lake Analytics is more focused on batch querying. Though your platform engineer could hook and script ADX service via a PowerShell script, that is still not a cyber analyst interface and there isn’t any implicit cybersecurity schema recognition.

5. Azure Data Factory (ETL)

To get to a solution that implicitly understands cybersecurity schema and provides an analyst interface, you could move the data into a third party security vendor tool,  like a SIEM that you might be using and already paying for. You will then need an ETL solution if the vendor didn’t provide you with a connector.

Azure Data Factory is Azure's cloud ETL service for scale-out serverless data integration and transformation. It comes with an interface for authoring, monitoring, and managing your ETL jobs. You can perhaps do the ETL on a subset of data on a need-basis, like our above ransomware investigation use-case.

The issue though is still that bulk big-data moves are cumbersome and expensive. See Data Factory pricing at Data Pipeline Pricing and FAQ – Data Factory | Microsoft Azure

6. Azure Blob Storage REST API

By this time you have probably realized that moving data over and over is not what you want to do. Your executive management too may not be supportive of the associated infrastructure costs. One thought would be to access the Blob Storage data in-place via APIs as you need it. Not surprisingly, Azure Blob Storage has its REST API. Please refer to Query Blob Contents (REST API) - Azure Storage | Microsoft Learn for detailed steps on how to query.

The REST API option is suitable for scenarios where you have developer resources and a home-grown interface or portal where you can expose your specific use-cases. This would need upfront development and regular engineering spend.

7. Cybersecurity Open Federated Search Solutions

What if the above approaches don’t directly fit into our scenario because:

  1. they require bulk data move, or
  2. they require programming, or
  3. they are end-user unfriendly SQL interfaces, or
  4. they are generic platforms that don't recognize a cybersecurity schema?

Open federated search addresses these four challenges. 

Open federated search solutions can query information from data silos and investigate across toolsets that reside in cloud, SaaS, and on-premises environments. Such solutions use API integrations with the data source – Azure Blob Storage in our case – to run multiple parallel queries, normalize and correlate results from those queries, do dependency lookups transparently, and finally present in a cybersecurity use-case focused visual. The open federated search solution would not move data out of the data sources - the multiple Blob Storage containers in our case.

It is important to use an open federated search vendor that transparently understands and applies the OCSF (Open Cybersecurity Schema Framework) cybersecurity schema. Please refer to my previous blog Need to model Cybersecurity Data? Let’s walk through OCSF! to understand OCSF.

Summary and Further from here

In this blog, we looked at several techniques and options for both collection and storage into Azure Blobs and subsequent querying of that data. We did a high-level overview of storing cybersecurity data in Azure Blob Storage, for its cost benefits with long-term big data storage. Then we used EDR and Firewall logs as examples of data we would want to store. Finally, we looked at use-cases like the Cuba Ransomware investigation that would require scanning the data for IOCs like malicious files and IPs. It was dizzying to see so many options. Hopefully, this article helped you understand your choices, their relevance, and their pros and cons. 

CISO organizations typically want to avoid approaches that require heavy internal engineering and maintenance projects. With this in mind, and guided by factors like licensing and data costs, team skill-sets, and team productivity aspects suitable for the CISO’s organization, an open federated search approach seemed to be the most appropriate for our use-case. 

We only covered a couple of basic IOC query examples. Go any further, and complexity increases quickly with pages of query text for advanced querying. Analysts do need to follow the chain in any investigation, figuring out what devices were impacted, what was the business use of those devices, which users owned them, etc. Using direct SQL based approaches make it very impractical to investigate, so a purpose-built open cybersecurity federated search interface is needed. Did I mention my company Query?

Happy Querying…