Log and security event data normalization makes it possible to analyze data from multiple vendors.  Commonly applied by SIEM and log management solutions, normalization transforms data from multiple disparate formats coming from different sources, to a single common format that can then be used for analytics, visualization, reporting, etc. 

There are challenges though. In particular,  SIEM Engineers suffer from the fact that normalizing log data is abnormally difficult, error-prone, and vendors are always playing catchup. I have faced it regularly over my 20 years in the SIEM ecosystem. The journey of normalizing cybersecurity data is littered with vendor acronyms and half standards that have had short lifespans.

sec-analyst
Security analyst normalizing data from multiple systems

Let’s explore the evolution through the years, the challenges that remain, and how to progress from here.

The Evolution Journey through CyberSecurity Event Data Models

The first standard I came across 20 years back was IDMEF (Intrusion Detection Message Exchange Format), the then prevailing standard in the IDS/IPS world. However, it was painful to apply to log data. Its complexity, network-focus, poor readability, and XML verbosity made us soon realize its impracticality for representing cybersecurity events. 

I was part of the team at ArcSight that then defined and proposed CEF (Common Event Format). CEF was widely adopted by the industry as a standard because of its simplicity, readability, log categorization, and easy transferability over syslog. It improved on IDMEF by fitting the data to a single log line or syslog message with easy to map key-value pairs of defined fields, along with a categorization and “additionalData” extension mechanism. 

Formats from key vendors cropped up next. IBM came with LEEF (Log Event Extended Format), and McAfee with SEF (Standard Event Format) which were all inspired by CEF. However, the problem with CEF and the like was that the schema was network security centric – source and destination IP, port, … sets of fields – and extension mechanism to non-network data was a force-fit. Also, the focus on serialized representation was still on single-line syslog, whereas the rest of the world was moving to JSON. Nevertheless, CEF still remains prevalent even today as an easily understood and desired format for log data.

The next set of standards have tried to address the above key limitations, so different kinds of data can be represented via object models reflecting that kind. Splunk’s (or should we say Cisco!) CIM (Common Information Model), Elastic’s ECS (Elastic Common Schema), Chronicle’s UDM (Unified Data Model), Microsoft’s ASIM (Azure Sentinel Information Model) were individual vendors’ efforts in that direction.

Normalization and standardization has been hard to achieve, often to incumbent vendors’ advantage

While the above vendor standards brought good improvements, these remained limited to the vendor’s platform and were not adopted by the industry as a whole. SIEM platforms became universally tolerated and sticky because the entire pipeline and workflow was based on that vendor’s data model. Customers become hostages as it is hard to rip out such an ingrained pipeline and workflows from the incumbent vendors.

Community projects have tried to translate across vendors’ models. Sigma is an open-source initiative that tries to translate fields across the above common vendor standards, and OSSEM (Open Source Security Events Metadata) enables sharing information about security event logs.

How difficult could it be to have one normalized industry standard? After all, that has been possible to achieve in related areas. STIX is the fairly successful standard in sharing threat intelligence. But unfortunately, log and event normalization has been under the status quo – every vendor lives in their ecosystem and we are relying on connectors to map into the desired formats. 

multiple-connections

How do we solve this problem? (without the EU’s help!)

Let’s start with framing responsibilities in the data pipeline talk of a new standard that holds promise – OCSF.

Separating Data Producers, Data Brokers, and Data Consumers

Security analysts need data not just from security products, but also from IT/infrastructure products, business applications, and HR/administrative applications. That wide spectrum – let’s call them Data Producers – touches a huge band of software, which makes it very very hard to preach adopting a security specific standard. Security tools are not the only consumer of this data, often not even the primary consumer – business integrations end up taking precedence. 

Data Brokers are intermediary pipeline and collection products like Splunk, Elastic ELK stack, Amazon Security Lake, etc. The responsibility lies there to normalize, categorize, and most importantly, standardize producers’ data to give it a vendor-neutral industry standard structure for downstream consumption. More often than not, these data brokers have applied their proprietary data model on the entire data at collection and storage time, making it difficult for customers to switch. 

Some apply community data models at access time and only on the subset of results needed by the Data Consumers. By Data Consumers, I mean SOC analysts, SIEM and detection engineers, and increasingly, AI and analytics applications, that all benefit from normalized and standardized data. This gives freedom to customers to store in any format and location suitable to them, as they are the true owners of their data, not the broker vendors! Query falls into this category.

The journey ahead – OCSF?

OCSF (Open Cybersecurity Schema Framework) is a recent effort by the industry to normalize and standardize, and it is based upon the above framing. OCSF was developed as a community standard over the last two years, and its 1.0 version was released at BlackHat 2023. It is meant to address a lot of the challenges we discussed earlier. OCSF models security data into Objects such as User, Device, File, etc. and then there are Event Classes that represent activity related to objects.  I took the example scenario of modeling Windows authentication EventCode 4624 into OCSF in this blog that further explains OCSF. OCSF helps but the journey is still difficult.

OCSF will not be the end all be all. We have adopted it at Query, but every week we run across questions on what data should be mapped in what field, how to represent specific object and event relationships, etc. It is not a criticism of OCSF, rather a reflection of the nature and complexity of cyber data. For example, how do you identify a unique device, given that IPs, hostnames, and endpoint agent ids, all could change over time. In OCSF, objects represent the current state, and events represent activity over time. It is difficult to recreate/derive past states of referred objects, going back and analyzing in a particular time window, as analysts often need to do. Time is a weird dimension — not just in astro-physics. 😅

time-joke

OCSF Object hierarchy is good, but still brings modeling challenges. Data is custom – the richer the data the more unmappable it gets. Don’t be surprised if the analyst goes digging into un-normalized fields rather than common normalized fields, because that’s where the piece of information they needed was buried.

Reiterating that this is not a criticism of OCSF, rather a reflection of the nature and complexity of representing cybersecurity data into an understandable format. While the journey will continue, I do hope OCSF gets the traction and adoption it deserves.

OCSF with Open Federated Search

Even if standardization gets achieved, a lot is still left on the analyst. They need a solution that lets them easily query normalized cybersecurity data remotely, irrespective of where it is produced and stored. The broker running these queries needs to navigate the above complexities as transparently as possible so users can focus on writing powerful and effective queries on all data they have access to. That is why we created Query Federated Search. Check it out.