May 12, 2025

Optimizing Delta Lake Security Data Lakehouses

Introduction

Delta Lake is one of the three popular open table formats data lakehouses, and is an ideal choice for managing security telemetry at scale. It brings ACID compliance, schema enforcement, and performance optimizations to cloud storage, making it a strong fit for security data operations. When writing data to Amazon S3 using Delta Lake, optimizing performance is crucial to ensure cost efficiency and query speed.

In this blog, we will explore techniques for efficiently writing Delta Lake tables onto S3 using a variety of mechanisms. You will learn about data types, partitioning, bloom filters, compression, writes modes, and normalization. Ensuring that you are efficiently writing Delta Lake tables will ensure that ad-hoc and scheduled queries and analytics against the tables will be faster, and potentially cheaper.

Previously, we explored how Delta Lake can help SecOps and SecDataOps teams manage security data by decoupling storage from compute, reducing SIEM costs, and enabling faster detections. If you haven’t yet, check out our previous Delta Lake blog to understand the fundamentals of why open table formats, specifically Delta Lake, is an attractive option for a security data lakehouse.

Why Performance Optimization Matters

When handling security logs at scale, performance tuning is essential, that much is obvious. Writing data inefficiently to S3 can lead to excessive storage costs, slow queries, and increased processing times. Common causes of slow queries include nonexistent or inefficient partitioning, excessive small file writes leading to metadata bloat, lack of indexing or bloom filters, suboptimal compression settings, and over-reliance on full-table scans.

Optimizing writes ensures:

Faster ingestion speeds for high-velocity security data, reducing processing bottlenecks in security data pipelines.
Reduced S3 API calls, helping to mitigate throttling and manage AWS API cost overhead.
Improved query performance for security investigations, allowing analysts to retrieve relevant data quickly without scanning unnecessary files.
Efficient data layout for analytical workloads, enabling detection engineers to iterate on rule creation and federated detections with minimal computational overhead.

By addressing these bottlenecks, security teams can ensure that their data pipelines remain performant, cost-efficient, and scalable. Effective table design in Delta Lake directly impacts security workflows, from real-time anomaly detection to forensic investigations. Efficiently written tables enable detection engineers to iterate on rules quickly, analysts to run investigations without query delays, and response teams to access full-fidelity security logs without excessive cost or compute overhead.

The next section explores specific techniques such as optimal data formats, partitioning strategies, and indexing mechanisms to maximize these benefits for SecDataOps use cases.

Key Performance Tweaks for Writing to S3 Delta Tables

To maximize efficiency, security teams can implement the following optimizations when writing to Delta Lake tables on S3. Again it’s worth mentioning, while there is a focus on delta-rs, Amazon S3, and Delta Lake these principles are interchangeable in many ways with other data engineering tools, object storage, and open table formats.

Data Types and Formats

Security telemetry comes in many forms, including structured JSON logs, CSV exports, and raw text data. Delta Lake uses Apache Parquet as its underlying file format, offering efficient columnar storage that reduces I/O and accelerates analytical queries.

Post-extraction and transformation, several tools are commonly used to process and convert data into structured formats before writing to Delta Lake. DuckDB, PyArrow, PySpark, Polars, and Pandas enable efficient data wrangling and transformation, typically into DataFrames or Arrow Tables. These tools allow for in-memory processing, batch transformations, and optimized conversions into Parquet tables, ensuring that the data is well-structured and query-ready before being written into Delta.

Ultimately, the choice for transformation tools depends on use case requirements. Arrow Tables and Polars DataFrames are highly efficient for in-memory analytics and small to medium-sized datasets, offering quick transformations with minimal overhead. For large-scale datasets, PySpark DataFrames provide distributed processing and parallelization, making them ideal for handling massive security log volumes before ingestion into Delta Lake.

Partitioning for Faster Queries

Partitioning is critical for reducing query scan time. Security data is often partitioned by fields such as:

Timestamp (year, month, day, hour): Ideal for nearly any high volume events or alerts with a timestamp, though the specificity of the partitions should be judged against the access patterns.
Log source: Helps isolate logs from different security platforms. You can also use an Open Cybersecurity Schema Format (OCSF) event class as a top-level partition, for more information on OCSF, check out our definitive guide.
Account or region: Useful in multi-cloud or multi-tenant environments. This can be expanded to domains, tenant IDs, compartments, and otherwise.

While partitioning is a fundamental optimization in all three major open table formats—Delta Lake, Apache Hudi, and Apache Iceberg—there are notable differences in how each handles partitioning:

Delta Lake: Partitioning is static, meaning that partitions must be explicitly defined at write time. Delta relies on Parquet partition pruning and supports OPTIMIZE commands to compact small files.
Apache Hudi: Supports both static and dynamic partitioning. Unlike Delta Lake, Hudi allows partition columns to be dynamically extracted from the data during writes, meaning they do not always need to be explicitly defined beforehand.
Apache Iceberg: Uses hidden partitioning, where the partition column does not need to be explicitly included in queries. Iceberg also avoids partition evolution issues by automatically handling partition changes over time.

Proper partitioning ensures that queries only scan relevant data, rather than the entire dataset, as defined in the _delta_log. When defining partitions, they must already be part of the DataFrame or table before writing to Delta Lake. For instance, consider the following snippet from our previous blog, the event_time field is converted to a proper timestamp and then year, month, and day columns are generated from it.

df = pl.DataFrame({
    "event_time": timestamps,
    ...truncated
})

df = df.with_columns(
    pl.col("event_time").str.strptime(pl.Datetime, "%Y-%m-%dT%H:%M:%S.%f", strict=False).alias("event_time")
)

df = df.with_columns(
    pl.col("event_time").dt.year().cast(pl.Int32).alias("year"),
    pl.col("event_time").dt.month().cast(pl.Int32).alias("month"),
    pl.col("event_time").dt.day().cast(pl.Int32).alias("day")
)

return df

When writing the Delta Table to your security data lakehouse built atop S3, you must define the partitions using the deltalake.write_deltalake parameter partition_by. As mentioned above, if the partitions are not explicitly defined this write would fail.

def writeSyntheticNetworkLogsToDelta(totalRecords: int):
    """Sends synth network logs to Delta Lake in S3 using year/month/day partitions"""
    df = generateSyntheticNetworkLogs(totalRecords)
    
    write_deltalake(
        S3_DELTA_PATH,
        df,
        mode="append",
        partition_by=["year", "month", "day"],
        storage_options={"AWS_REGION": S3_BUCKET_LOCATION}
    )

While partitioning reduces the volume of scanned data, it does not always guarantee the most efficient query performance, particularly when searching for specific values within large partitions. This is where additional indexing techniques, such as bloom filters, become valuable by further narrowing down the subset of files that need to be scanned for relevant records.

Bloom Filters for Selective Reads

Bloom filters and partitions serve different purposes in query optimization. While partitions help reduce the number of files scanned by grouping data based on predefined categories, bloom filters operate as probabilistic data structures that quickly determine whether a value might exist within a dataset. This distinction makes bloom filters especially useful when searching for values that do not align well with partitioning strategies.

Should you choose partitions or bloom filters? Yes. (Okay, seriously, consider using both).

Partitioning is effective for broad categorization and efficient data pruning, whereas bloom filters provide a fine-grained mechanism to reduce unnecessary file scans within partitions. When combined, these two techniques significantly improve query performance by limiting the search space as much as possible. Better query performance leads to faster answers which leads to more effective decision support for your defenders and detection engineers.

Bloom filters work best with high-cardinality columns where values are highly unique, such as user_id, ip_address, or host_id. These columns are difficult to partition effectively due to their large number of distinct values, but bloom filters help accelerate lookups by identifying which files likely contain relevant data, reducing the need for full-table scans.

This style of bloom filter can be particularly useful in threat hunting scenarios where analysts search for specific indicators of compromise (IOCs). Additionally, audit teams and GRC engineers as well as vulnerability management teams can benefit from bloom filters over asset identification information.It is important to note that as of the time of this blog being published, delta-rs does not support writing bloom filters, you’ll need to use another library such as PySpark instead as demonstrated in the code snippet below.

from pyspark.sql import SparkSession
from delta import DeltaTable
import pyspark.sql.functions as F

DELTA_PATH = "s3://your-bucket/delta-table/"

spark = SparkSession.builder \
    .appName("DeltaLakeBloomFilter") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

data = [
    (f"host-{i}", f"ip-10-0-{i % 255}-1.ec2.internal", i % 100) 
    for i in range(1, 100001)  # Simulating 100K rows
]

df = spark.createDataFrame(data, ["host_id", "private_dns_address", "cpu_usage"])

df.write.format("delta").mode("overwrite").option("delta.columnMapping.mode", "name").option("delta.bloomFilter.enabled", "true").option("delta.bloomFilter.columns", "host_id, private_dns_address").option("delta.bloomFilter.fpp", "0.05").save(DELTA_PATH)

Compression Strategies for Storage Efficiency

Delta Lake supports various compression codecs to minimize storage costs and improve query performance. Choosing the right compression format depends on the specific SecDataOps use case. For example, if the dataset is queried infrequently by Triage Analysts to provide historical context, a higher compression ratio format such as Gzip or BZIP2 may be acceptable, as decompression speed is less of a concern. Conversely, if the data is being used for real-time anomaly detection or detection engineering, a faster decompression format like Snappy or LZ4 would be more appropriate to ensure minimal query latency.

Snappy: A balanced choice for fast decompression and good compression ratios.
ZSTD: Provides higher compression efficiency, reducing storage costs further.
Gzip: Best for maximum compression but at the cost of slower read speeds.
LZ4: Optimized for fast compression and decompression, making it ideal for real-time analytics and high-throughput workloads.
BZIP2: Offers high compression ratios but significantly slower decompression speeds, making it less suitable for frequent querying.

Choosing the right compression strategy depends on the trade-off between storage savings and query speed requirements. Additionally, certain query engines may not support all compression codecs. For example, Amazon Athena does not support BZIP2, and Apache Spark may require additional configurations to handle specific codecs like LZ4 or ZSTD. When selecting a compression format, it’s essential to consider both query engine compatibility and the performance characteristics required for your workload.By default, Snappy is used when you do not specify a compression codec when writing your Delta Lake tables to your Security Data Lakehouse.

from deltalake import write_deltalake
import pandas as pd

DELTA_PATH = "s3://your-bucket/delta-table/"

df = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35]
})

write_deltalake(
    DELTA_PATH,
    df,
    mode="append",
    storage_options={"AWS_REGION": "us-east-1"}
)

To specify another compression codec, such as using ZStandard (ZSTD), you must define the compression codex in the delta_write_options parameter, such as shown in the snippet below.

write_deltalake(
    DELTA_PATH,
    df,
    mode="append",
    storage_options={"AWS_REGION": "us-east-1"},
    delta_write_options={"compression": "zstd"}
)

Additionally, compression may serve as cost lever, especially in terabyte-scale or petabyte-scale data lakehouses. You could save significant costs at that scale by reducing file sizes an additional 5%, 10%, or even 20%. You could recover that Operational Expense (OpEx) to deploy towards more compute or to purchase a SecDataOps Workshop from the Query team!

Delta Lake Write Modes

Delta Lake supports four different modes for writing Delta Lake tables to your target Security Data Lakehouse. The write mode determines how to handle existing data, there are four: error, append, overwrite, and ignore. By default, the write_deltalake method will use error mode if not otherwise overridden, which will raise an exception if the table already exists.

The append mode will add new data, literally appending it, to the table. This is useful for continuously ingesting logs or other incremental data updates without affecting previously written records. However, excessive appends without periodic compaction can lead to metadata bloat and degraded read performance. Running the OPTIMIZE command periodically can merge small files and improve query efficiency.

The overwrite mode will completely replace the existing data in the table with new data. This is ideal for scenarios where a complete dataset refresh is required, such as periodic snapshots or batch updates. However, overwriting large tables can be costly, as it results in full rewrites and increased S3 operations, impacting both performance and cost.

The ignore mode behaves similarly to error but simply skips the write operation if the table already exists. This is useful in workflows where duplicate writes should be avoided, but it does not address scenarios where data should be updated or merged.

Choosing the right write mode has direct performance implications:

Append Mode: Useful for streaming and incremental writes but may lead to excessive small file creation, requiring compaction for efficiency.
Overwrite Mode: Ensures clean datasets but incurs high costs and processing time due to full rewrites.
Error Mode: Prevents accidental overwrites but may cause failures in automated workflows if not handled properly.
Ignore Mode: Helps avoid redundant writes but does not support data updates or modifications.

For optimal performance, security teams should balance the choice of write mode with periodic maintenance tasks such as optimizing small files and leveraging Delta Lake’s time travel features to retain historical records while ensuring fast query performance.

Data Normalization for Improved Querying

While not a technical performance improvement, another important aspect to abide by the SecDataOps principle of improving security outcomes with the usage of data, is normalization and standardization of data. For SecDataOps use cases, or anything observability- or security-relevant, look no further than the Open Cybersecurity Schema Framework (OCSF) data model.

If this is your first time reading about the OCSF, or if you are coming back to it after an absence, consider reading our beginner and executive-friendly blog: Query Absolute Beginner’s Guide to OCSF. For a more detailed explanation of OCSF, see our other blog: Definitive Guide to Open Cybersecurity Schema Framework (OCSF) Mapping.

When used in conjunction with Delta Lake optimizations, OCSF can contribute to faster, more effective querying and a more scalable security data lakehouse architecture. OCSF provides several Categories of Event Classes which are schemas to normalize several types of telemetry into in a standardized way. For instance, your EDR, NDR or XDR alerts can be normalized into the Detection Finding event class while your WAF, IDS/IPS, and Layer 7 traffic can be normalized into the HTTP Activity event class.

By using OCSF Event Classes, you gain consistent standardization of common attributes within your data models. For instance, the `src_endpoint.ip` attribute is the standard area to place the IP address of a source device, endpoint, or server when they are serving traffic or opening an outbound connection. This gives your detection engineers and other defenders a consistent place to look and query for. This pays dividends when writing detection content, playbooks, or even when defining bloom filters or deriving partitions when in the ETL phase of your pipeline.

OCSF normalization also provides consistent data types which can help avoid costly operations such as casting (e.g., changing on the fly) columns within your Delta Lake tables into one data type or another. For instance, the `traffic.bytes_in` attribute is where all inbound bytes are to be written as an integer. If you had a detection or query that relied on running summation or averages on data transfers, having to cast a string into integer across 10s of millions of rows would slow the performance down and cost more compute.

Again, while this isn’t compulsory, a consistent data model will help improve performance as well as security outcomes and can apply to other open table formats for your own Security Data Lakehouse.

Conclusion

Ok, so no one does this for fun. The most important step to scaling security analytics efficiently, is writing efficiently. Without a structured and well-optimized data lakehouse, queries will be super slow, storage costs will keep getting higher, and operational snafus will kill effective threat detection and response. The higher the volumes, the more detections you need to service, the more noticeable this will be

By implementing key optimizations such as partitioning, bloom filters, and compression tuning, security teams can get much better query performance without cost going crazy. Leveraging append versus overwrite modes is the ticket to ingesting high-velocity security logs efficiently without data duplication or full-table rewrites.

Furthermore, standardizing security data through the Open Cybersecurity Schema Framework (OCSF) increases interoperability across tools and makes for more effective federated search and detection engineering. While OCSF is not directly tied to Delta Lake or storage performance, it is key for streamlining analytics and accelerating investigations by providing a consistent schema across all of your different security data sources.

Ultimately, the goal of an optimized security data lakehouse is to give security analysts and detection engineers faster access to relevant investigation data. The ability to iterate on detection rules, conduct investigations quickly, and retain full-fidelity logs without overwhelming costs is what separates a well-architected security data lake from a SIEM or log management approach.

If worrying about the query performance of your Security Data Lakehouse, whether using Delta Lake or not, has got you down, look no further than the Query Federated Search platform. We handle all of the read optimization, query translation, and data normalization into OCSF at search time without ever duplicating or storing your data.

We seamlessly integrate with Delta Lake tables in Amazon S3 that are indexed by the AWS Glue Catalog using our Amazon Athena Connector. If you would like us to support Delta Lake for another cloud provider, please contact product@query.ai! If you have any questions about how federated search can help you seamlessly parallelize queries, handle federated analytics, federated detections, and normalization contact our sales team here!

Until next time…

Stay Dangerous

Contributed by:

Jonathan Rau

VP/Distinguished Engineer, Query