August 3, 2025

Write to Gold with Query Security Data Pipelines

Introduction

How do all of these self-congratulating posts start, again? Oh right, “in the ever-changing security threat bad guy landscape, data is the new oil or diamond pickaxe!” Cynicism aside, I will continue to shout from the rooftops: the most important asset and skillset that a security organization needs to develop is data. Data engineering is its own career path filled with all of the dogma and pitfalls that befall security engineering teams, but these two seemingly disparate skills are more related than ever before.

Whether it’s because of the advent and (ridiculously high paced) evolution of AI and its usage in security, the desire to build pretty dashboards, to use old (kidding!) machine learning models to detect anomalies, or the desire to build a technology catalog and asset inventory that is actually complete: you need the data.

However, getting access to the data is one thing, writing it in a way that is easy for machines (and ultimately humans) to consume is an entirely different thing. The so-called Security Data Pipeline Platform (SDPP) category was supposed to be an anodyne here, but I disagree with that supposition. It’s not enough to write data, it’s not even enough to provide some transformations, if you’re going to be archiving a mere 10GB up to petabytes of data you need to do so efficiently!

So that is the real issue. Not just writing data, but writing data properly. If we are ever going to “kill” the SIEM, and move onto lakes, we better build the storage component of that lake (or lakehouse) correctly. Better yet if we can do it without 100s of extra bits of compute microservices, transformers, and orchestrators to run on our own. So, we did that. Provided the veritable “easy button” to send data to destinations you control in the exact right way, all without the large SIEM bill.

The Pipeline Problem

Whether you call it a SDPP or something else, the “pipeline tools” wanted to lift yet another task away from security teams – managing data. However, they didn’t take the data governance and operations seriously and really provided a cost lever for CISOs. And that cost? SIEM cost.

SIEMs are vaunted for their expense, to be fair they’re expensive for a lot of good reasons. One cannot exactly manifest performant schema-on-read search across 10s of millions of rows of data stuffed into random indexes. With those millions of records came 100s or 1000s of gigabytes of data, but it is an absolute back breaker with an ingest-based cost model.

Some SIEMs wanted to be a bit coy and started offering compute-based models where you are charged “credits” (Denarii? Atropian dancing pennies!?) based on what you would search. How the heck would you even know how much you would search if your data keeps growing?

For a small fee, you could move GBs of data off of your SIEM to your favorite cloud-based object storage and life was great! “Truly I tell you, we have a security data lake!” your CISO likely proclaimed to your board or management team.

Well, you had half of one, and it was all good until you started to run Glue Crawlers or write Data Definition Language (DDL) queries in TrinoDB against that data. Turns out that miles of JSON-L with nested tuples, arrays, and all sorts of nastiness isn’t even remotely performant to query, if you could even define the structure of the data. What. A. Disaster!

No matter what anyone tells you: row-based “human-readable” data formats (JSON variants, CSV, TSV, text files, XML) are horrible for any scale of data analytics. It’s horribly inefficient, and thus expensive, to query at any meaningful scale. It eats up your budget because the file sizes are larger by default, they don’t compress efficiently, and they don’t support column-wise search without otherwise forcing their ingestion into a big data system. They don’t support enough data type variety to allow your query engine’s own built-in efficiencies to be brought to bear: you cannot even write a Datetime into JSON without converting it to a string!

Yes, you cannot open up an Apache Parquet file and read it with grep or awk like you can with a CSV file. However, it’s not 2003 anymore, and it’s not 2012, nor is it 2021. If we – as in security professionals – want to drive risk-reducing outcomes with empirical evidence, we need to get with the picture. We must adopt these modern data formats and tools. I don’t care if you’re searching for a single IOC in logs, or you are building thousands of security AI agents, it all begins with properly governed and maintained data.

The next wave of pipeliners happily came along preaching my favorite concept: Minimum Necessary. Do you really need to keep all of your blocked WAF logs? Do you really need every DHCP ACK event? Why do you have immutable EDR findings in your lake?! Do you need seven different timestamps and a bunch of “dead” fields about the upstream tool’s release version that hasn’t changed in 12 years? Of course, I cannot answer that for you, but the point stands: only store exactly what you want where you want.

They even gave you “no-code” workflows to accomplish this, and to transform the data. Sometimes it was into their own convoluted data model. Some better tools used Splunk’s Common Information Model (CIM), or better yet: Open Cybersecurity Schema Framework (OCSF). Worst of all of them: they provide raw data! Life was good though, they provided some out of the box parsing, so instead of hectares of nested arrays you may have gotten flattened data – and some even wrote in Parque

Why am I enumerating a brief history of this SDPP space? Simple: it’s to tell you IT IS A HOT MESS!

There is a right way to do things and that is what we are doing with our Security Data Pipelines, powered by the security data mesh that Query creates. While we do take some agency away from you in the neighborhood of data formatting (you are only getting Parquet OCSF-transformed data, for now), we do that to give your team an easy button for data mobility. It’s simple and it works.

No more unnormalized data. No more JSON, CSV, XML, or other awful formats. No more writing your data partitioned by seconds, or compressed with GZIP or into tarballs. No more miles of arrays in arrays in tuples.

You get your data delivered to your own S3 bucket, Azure Blob (ADLSv2), Google Cloud Storage without bursty credit-based pricing, written in the best mapped OCSF you can get, in ZSTD or Snappy-compressed Parquet, in hourly Hive-compatible partitions.

It is ready to crawl with AWS Glue Crawlers to query in Athena, Redshift Spectrum, visualize in QuickSight, or bring your own Star Rocks or TrinoDB cluster. It’s ready to read out into Polars DataFrames or DuckDB tables for exploratory data analysis, or vectorized into LanceDB or pg_vector for your RAG and Agentic workflows. In addition to cloud storage, you also have the option to land the data in Splunk.

The security pipeline nonsense? It’s over.

Pipelines, with a side of Federation

Query, at its core, has always been about getting answers from your data when you want it. We took the opinionated approach of turning OCSF into our own two-way data model that becomes both a schema and a lingua franca to express your search intent. From there we handle all of the query planning and execution: parallelization, pruning, collation, translation, transpilation, and do it in-situ.

That’s great for threat hunters. Great for detection engineers. Great for any tiering or skill of your SecOps teams, and even for AppSec and GRC teams to ask questions of your data wherever it lives: in your EDR APIs, in your Entra ID tenant, or stored across dozens or hundreds of tables and indexes in your own Athena, Snowflake, Splunk, and a dozen other “dynamic” sources.

However, there are still times where federation doesn’t solve your issues. There are APIs that are not meant to search, like bulk-data firewall or identity log APIs. There are data retention requirements that are stipulated by your own GRC and ERM policies and/or regulations, or because the upstream tool doesn’t retain data after a certain date – like Entra ID authentication logs, JumpCloud SSO logs, or any number of security tools with poor APIs.

Our original answer to this quandary was with our Federated Search Query Language (FSQL) REST API. You can author your own FSQL and pipeline data on your own. That is an advanced use case, and teams just want an easy button to move and store data. We can write all of the treatises on best practices for data lakes and lakehouses, but that doesn’t account for the financial and operational burden a security organization would take on to do it right. So, again, we provided this service not only to do that – but do it correctly.

Query Security Data Pipelines gives you that. If this was a Medallion Architecture, we’re enabling you to write to the Gold Layer right away.

We support all of our Static Schema sources, including Carbon Black, Push Security, GitHub, Crowdstrike, and dozens of others. We support writing to Amazon S3, Google Cloud Storage, Azure Blob (ADLSv2), and Splunk using best practices to squeeze out all the performance possible when querying and using data stored in these destinations.

In the future, we will expand the sources to our Dynamic Schema sources such as Snowflake, Splunk, Databricks, Athena, Google SecOps, and a dozen others. If you needed to back-up or migrate from one of those platforms to cloud storage, that is the intent. Of course, we are looking at other destinations as well.

Farther along, the ability to prune down what events from the Sources you want, as well as what fields you don’t need, will be supported. We’re even looking at allowing you to get at the raw data if you don’t care one iota for OCSF.

So whether you need to threat hunt across 50 sources, author complex multi-condition detections normalized across 25 sources, or move 10 sources into your S3-based data lake: Query is the place to do it.

Conclusion

In this blog you learned the wrong way to solve security data pipelining and what right looks like. We are committed to doing security data pipelines the right way, because none of us can afford to have it done the wrong way.

Want to free yourself from the tyranny of SIEM billing? Tired of the four-layered rat turd burrito your current pipeline tool serves up? Want to have a true “easy button” to move and store data in the most performant way without blowing your budget?

Hit us up today. SecDataOps Savages are standing by.

Stay Dangerous.

Contributed by:

Jonathan Rau

VP/Distinguished Engineer, Query