Amazon S3 has been a commonplace where organizations have stored their cybersecurity data. Often this is being done for S3’s cost efficiency with long-term retention necessitated by compliance needs. In the best case, analysts don’t need to interact with this data regularly. But when a security incident requires investigating, let’s say one year of data, then querying this data immediately becomes a priority for the analyst. In this blog, we will cover an example scenario of how to query long-term archived cybersecurity data in S3 to investigate ransomware. For the purposes of this blog, we will limit using commercial query solutions, which means I am not talking about my company Query here!.

Example Scenario: Cuba Ransomware CISA Update (Dec, 22)

The FBI and CISA recently released updated IOCs for Cuba ransomware that has impacted 100+ organizations that have paid $60M+. Its detailed analysis is beyond the scope of this blog, but we will analyze it to take the associated files and hashes provided in Table 1 IOCs at . The analyst needs to investigate whether any of these malicious file hashes have been observed in the last year of EDR, alerts, and other security related data that has been archived in AWS S3.


The home-grown S3 Cybersecurity Data

Most cybersecurity teams using AWS have been storing data in Amazon S3. This happened organically, as their vendors’ platforms or even their SIEMs were only retaining data for 1-3 months. (Side note: Everyone is aware of the cost, infrastructure, and performance reasons for why this is happening, so I am not going to talk about it here.)

Also, yes, I am aware of the newly released Amazon Security Lake product announced at re:Invent 2022. That holds great promise, but for now, most security teams have existing home-grown cybersecurity data buckets in S3, so that’s what we assume for the purposes of this blog. 

For our scenario, we assume that the S3 buckets hold some of these common sources and/or equivalents:

I intend to write a later blog on how to configure schema, partitions, and indexes for optimally storing the above into S3 buckets, but for now we assume that your platform team had followed vendors’ / integrators’ documentation and relevant AWS tech (such as AWS Glue) to replicate from the source platforms, and that data is already in place in your S3 in gzipped json files in daily batches. They would have followed your retention needs, and ‘warm’ and ‘cold’ would typically be in S3, with the ‘hot’ staying in the source EDR, SIEM, or other relevant platform. The ‘cold’ would be offlined with S3 Glacier.

You can navigate from the S3 console to see how your S3 buckets are already structured. For example, based upon your environment, you might find Crowdstrike data files like:

s3://secops-crowdstrike/2022/12/13.json.gzip, i.e. organized into gzipped json files in a directory hierarchy by date.

Querying for Malicious File Hash

From the Cuba Ransomware IOCs discussed above, let’s pick the kavsa32.exe executable that has the MD5 hash 236f5de8620a6255f9003d054f08574b

In the above long-term archive, we don’t have the luxury of a SIEM or Kibana UI to simply search since the data above is raw and not indexed anywhere. The analyst is left with AWS CLI or Console with relevant IAM user/role permissions to access these S3 buckets. With that scope, the way we would query is by using Amazon Athena; a common way to query from S3. Athena is serverless, so we don’t need to create or manage infrastructure, and you pay only for the SQL that you run. We will use its console at

STEP 1: Registering Dataset in Amazon Athena

If this is the first time you are querying this data, we have to do this one-time task of running the DDL to create the table schema with desired columns to register with Athena; i.e. to make Athena SQL understand the S3 files and data schema. The table definition also needs to understand the time-based data partitioning via something called partition projection. Taking an example of CarbonBlack data, this would look like:

    computer_name STRING,
    event_type STRING,
    md5 STRING,
    path STRING,
    pid STRING,
    type STRING,
    username STRING,
    logon_type INT,
    timestamp TIMESTAMP,
    -- add additional columns you care about
   timestamp TIMESTAMP)
  'serialization.format' = '1'
LOCATION 's3://secops-carbonblack/'

Talking of the above data schema, most home grown S3 cybersecurity buckets would have stayed with the native schemas in the vendors’ format discussed above. A future improvement would be to adopt OCSF, a community-driven cybersecurity data schema that Amazon Security Lake adopted. I’ve written a previous blog on OCSF that you can view here. For the pre-existing environments today, we need to configure the vendors’ native schema in Athena as above to be able to query it.

STEP 2: Query to Validate Setup and Data Access

Let’s query to see the most recent rows of data we have available. We do this to ensure that we have access, all configuration is correct, and the Athena table definition worked: 

SELECT * FROM secops.carbonblack 
WHERE timestamp BETWEEN DATE_ADD('day',-7,NOW()) AND NOW() 
ORDER BY timestamp DESC

If you got any results out, great 👍! 

If something is going wrong, it may be because of an incorrect table schema. You will have to redo the table schema configuration step after deleting the above schema. Note that this line doesn’t drop the data, as you might expect from the syntax, rather it only drops the table definition:

DROP TABLE secops.carbonblack

STEP 3: Query for our IOC

Congratulations for reaching the stage where you can now actually do the IOC analysis work you wanted to do in the first place! Let’s query for the file MD5 we had decided to investigate upon from above Cuba Ransomware IOCs table. Here we are querying only for the fields I am currently interested in:

    computer_name STRING,
    event_type STRING,
    md5 STRING,
    path STRING,
    type STRING,
    username STRING,
    timestamp TIMESTAMP,
FROM secops.carbonblack 
WHERE md5 = '236f5de8620a6255f9003d054f08574b' AND timestamp BETWEEN from_iso8601_timestamp('2022-01-01T12:00:00') AND from_iso8601_timestamp('2022-12-13T12:00:00')
ORDER BY timestamp DESC

This query could be long-running based upon your data size, so be ready to take your coffee break. We hope it can finish within 30 minutes, which is the default timeout:

Looks like you got a hit on the IOC…ouch 🥹:


We see that the file was observed on computer_name DHIRAJDESKTOP-WIN81 on March 5th.

STEP 4: Deeper Investigation

We really only queried one table here, and for just one IOC. Some followup work would be to query other tables to investigate this host’s alerts, users, tickets, etc. With some SQL studying (see Athena SQL reference), we can write more complex WHERE clauses, do JOINs for lookups, script the sequence, etc. I’ll leave it as an exercise for the analysts who want to get their hands dirty with Athena SQL.

Further from here

Summarizing the above, we did a high-level overview of storing cybersecurity data in S3 buckets and querying it for IOCs from alerts we need to investigate.

To understand and use Athena SQL with greater detail, please refer to its AWS documentation.

While we only ran a couple of example IOC queries above, advanced users can do more querying for the hashes, emails, and IP addresses in the IOC tables associated with Cuba Ransomware. You can run a sequence of queries to follow the chain, figuring out what devices were impacted, what was the business use of those devices, which users owned them, etc. If you have engineering skills and want to build further, you can parameterize and save your query sequence to run later. With AWS development resources, you can create and schedule/run workflows that can do conditional query execution with step functions, and send you a notification when complete using Amazon SNS.

Querying and consuming raw results from the Athena SQL command line can quickly become tedious,  complex, and a time-suck. Any further from there turns into an engineering project, but hey, no one said threat hunting was easy! For security analysts who need to do the above frequently, I would suggest using a commercial solution that can directly query or hunt from your S3 buckets without duplicating/replicating your data. Did I mention my company Query 😉!

Happy Querying…