May 2, 2023

Implementing Security Observability on AWS-native SaaS

Do Security Analysts have to become Cloud Platform Engineers?

AWS-hosted SaaS has been widely adopted, but securing it is a tricky beast. Traditional on-prem security observability processes are not directly applicable in a microservices based SaaS environment. So, let’s take a look at a typical AWS-native SaaS application environment from a security observability and investigation perspective. We will cover how organizations can plan their security observability and analysts can run their security incident response process over such an application.

“As the types of cloud services available grow and organizations begin to deploy large PaaS and IaaS environments that employ numerous interconnected services, the range of cloud security controls needed, and surface to protect also gets larger.”
– 2020 SANS whitepaper How to Protect All Surfaces and Services in the AWS Cloud

– Apr 4, 2023 AWS Security Blog: Logging strategies for security incident response

Typical AWS-native SaaS Environment

There are thousands, maybe tens of thousands, of cloud-native applications running in AWS. These applications are mostly using AWS-native services, as that’s often the path of least resistance. With the microservices based application architecture and myriad building blocks offered in AWS, it is unsurprising that a typical company might be using 10-20 different AWS services to build their SaaS applications.

What constitutes an AWS-native SaaS environment can change application by application, depending on what use-cases it serves, who it serves, how deeply tied it is to AWS-native stack, etc. We assume the application was designed to fully exploit AWS’s platform services and is not simply using bare-bones EC2 VMs for hosting. Here are five popular companies running their tech-stack built over AWS-native environments – Reddit, Atlassian, Netflix, Slack and Airbnb. They may not fully disclose their architectural details on what AWS services they used, but some digging can give you a rough idea:

Reddit: We’re Reddit’s Infrastructure team, ask us anything! : r/aws
Atlassian: Atlassian Cloud architecture and operational practices
Netflix: Netflix Case Study and AWS Innovator: Netflix | Case Studies, Videos and Customer Stories
Slack: Building the Next Evolution of Cloud Networks at Slack and Slack Case Study
Airbnb: AWS Innovator: Airbnb | Case Studies, Videos and Customer Stories

While these examples are massive internet-scale applications used by millions of users, we should also consider small applications built and run by internal IT teams for a limited set of internal users in any organization. (The beauty of building cloud-native is that you can use the same set of services for a few hundred users to millions of users.) Also, when discussing full visibility for observability, we should not forget the organization’s vendors’ software that may be deployed in the organization’s cloud accounts. For the purposes of our security observability, we will go beyond the true definition of SaaS and consider these three application types running in AWS-native stack:

Monitoring and incident response (IR) over the organization’s customer-facing SaaS, if any.
Monitoring and IR over the organization’s home-grown internal-facing small cloud applications.
Monitoring and IR over any vendor software deployed in the organization’s cloud accounts.

Ok, so which AWS services are typically used in the above large and small varieties of cloud-native applications? Based upon the various architectural layers of your cloud infrastructure, you would need to create/use a stack of microservices for eight layers: Network, Compute, Storage, Caching, Streaming and Messaging, API and Application Delivery, Identity and Access, and Monitoring, Security and Compliance. Here are the relevant building-block services for each layer and very commonly you would be using multiple services from each:

Network

Amazon VPC: (Virtual Private Cloud) to segment the application’s microservices
Amazon Route 53: for DNS

Compute

AWS Lambda: for serverless computing
Amazon ECS: (Elastic Container Service) for hosting your application’s containers
Amazon EC2: (Elastic Cloud computing) for any cloud VM needs

Storage

Amazon S3: (Simple Storage Service) for raw data or file blob storage
Amazon S3 Glacier: for archiving/long term storage
AWS Athena: for running queries against your S3 storage
Amazon Aurora: for your cloud database
Amazon RDS: (Relational Database Services) if you are using another database
Amazon DynamoDB: for your application’s NoSQL document store

Caching

Amazon ElastiCache: for in-memory caching

Streaming and Messaging

Amazon Kinesis: for real-time stream data processing
Amazon SQS: (Simple Queue Service) for message queuing
Amazon SNS: (Simple Notification Service) for messaging and notifications

API and Application Delivery

Amazon API Gateway: to serve your APIs
Amazon Lightsail: to serve your application server
Amazon ELB: (Elastic Load Balancing) for load balancing
Amazon CloudFront: to deliver your web content

Identity and Access

AWS IAM: (Identity and Access Management) to manage access to your AWS resources
Amazon Cognito: for authentication, authorization, and user management in your application
AWS Secrets Manager: for storing tokens, keys, and other credentials

Monitoring, Security, and Compliance

AWS WAF: (Web Application Firewall) for protection from web-based attacks
Amazon Inspector: to scan your application.for vulnerabilities
AWS X-Ray: for tracing, debugging and analysis of your application
AWS Config: for troubleshooting, audit, and compliance
AWS CloudTrail: for auditing your AWS account’s resources
Amazon CloudWatch: for monitoring and observability on your logs
Amazon GuardDuty: for threat detection in your AWS account
AWS Security Hub: for centralized viewing of security alerts from all sources
Amazon Security Lake: for creating a security data lake

Who has control?

When we look at all of the above services, plus the SANS whitepaper, and the AWS Security Blog mentioned above, we have some guidance over relevant AWS services, along with their logging configuration and how to query for effective incident response. However, the security team now needs AWS Platform Engineering resources to get into the nitty gritties of each service. The cloud platform engineer may not have the Security Analyst perspective, which is typically a different skillset.

If security analysts are attempting above AWS plumbing and investigation directly, the roadblocks they would face are around access and efficiency. Analysts typically would not have access to individual AWS services – rather their access is limited to their monitoring and IR consoles. They need to often raise tickets for individual investigative data query requests that are then run by the devops/platform engineering. This back and forth, with multiple teams involved, is not efficient and greatly limits the analyst’s ability to pivot investigative processes.

Even if Security Analysts become part of devsecops and get complete AWS access, going to individual services’ consoles, running SQL queries manually, cross-correlating data in notepad/excel, etc. is painfully inefficient. Not only will you be in tab hell – see Eric Parker’s blog on Why So Many Tabs? – but also piecing and linking together information is complex, error prone, and not an easily transferable skill.

So, looking at the content references, while we have great AWS engineering guidance on how and what to investigate, the practical limitations around access, skill-sets, and efficiency make it difficult to adopt it as a SOC process.

Is SIEM the answer?

SIEM has historically been the designated tool for centralizing security data so analysts are not playing back and forth with devops.Yes, organizations could move the AWS-native logs into a SIEM, but, of course, there are practical challenges in achieving desired outcomes. The biggest is cost. SIEM licensing and upkeep costs skyrocket with the high log volume across a number of AWS services.

If the organization can withstand the cost, the next hurdle is that SIEMs do a poor job of understanding and correlating IDs and objects across AWS services. They continue to be very traditionally network-centric without a good understanding of cloud resources, microservices, containers, lambdas, elastic-IPs, etc. This limits their effectiveness to a basic text search tool, which means while analysts get the benefit of a single search interface, they still need to run multiple individual searches across multiple SIEM console tabs, and then piece together information manually. So, all we did by moving the data to an expensive SIEM is replace the browser-level tabs with SIEM-console level tabs. Back to TAB HELL!

SIEM search performance also degrades with the high volume of data ingested. While it may be possible to improve search performance with advanced indexing strategies available in some SIEMs, ultimately there is an impact with increased infrastructure costs to support fast searching.

The practical compromise with the above challenges is for organizations to send only a subset of events with clear security significance to the SIEM. The vast majority of AWS services’ logs are either not being captured at all or, at best, archived in Amazon S3 for a future “just in case” need. Reference Overcoming Cloud Data (In)visibility without Centralization for more information regarding the cloud visibility challenge.

Another challenge in making SIEM your primary tool for security observability of your application is that logs and events in SIEM are historical information and often do not represent the current state information that can only come from the source of truth – the original service that holds the information. For example, when investigating a user account, the analyst would go to AWS IAM to check whether that account is enabled or not, which groups it belongs to, what policies are applicable, etc. The inability to interrogate/introspect the AWS service via its APIs, is a big limitation on SIEMs.

SIEM vendors are incorporating SOAR functionality quickly, but there too the capabilities are limited to network-centric use-cases vs. AWS-native observability use-cases.

Are standalone Cloud Security Products the answer?

There is the emerging “hot” category of Cloud Security products that I will briefly touch upon as well. They are by definition, expected to have awareness of cloud objects, accounts, and services, which is a positive. They could produce alerts that are a more focused starting point because of that awareness. However, every single cloud alert still needs native logs to support the investigation.

Ultimately, cloud security products are another source of alerts that analysts need to investigate then. The analyst still needs to pivot across different services’ consoles to investigate further.

Cybersecurity is becoming more and more of a collaboration need than a single-product solution.

Investigate entities and their interactions across services

The crux of any investigation is that analysts need a way to understand the relevant entities and their interactions. Their entire effort is really around piecing that information together from multiple sources. Please review these use-cases and their specific queries, as described in the above referenced AWS Security Blog (see “Sample Queries” section at Logging strategies for security incident response):

Unauthorized attempts
Rejected TCP connections
Connections over older TLS versions
Filter connections from an IP
Investigate user actions

Upon review, you will notice that the above IR processes are really about the analyst viewing the relevant entities and their interactions. Ideally, a UI could be constructed to automatically run the relevant queries, extract relevant entities, and visualize the interaction data that would make the analysts directly productive. Let’s take an example:

To investigate unauthorized attempts alerts, the analyst would first view the IAM Principals, understand their groups and policies, look at the users’ authentication events, then the user’s other key activity, and any further relevant security alerts on them. Ideally, the analyst would want to visualize the user’s interactions in an entity-interactions graph constructed from the raw data queried from the relevant services – but they rarely have means to do that. They would next like to drill down into activity based facets to look deeper at particular unauthorized attempts.

For the above interactive investigation, context, and its visualization, the analyst console would have to call the relevant AWS services’ APIs directly using clients like Boto3 – AWS SDK for Python. The analyst console would also have to query relevant AWS resources like IAM, EC2, and S3. The analyst console would have to transparently and automatically generate and run Athena queries. Using the specified service’s APIs, it would have to live-query and show the user’s resources’ CloudTrail and relevant alerts from GuardDuty and SecurityHub.

The analyst console would have to map and resolve entities like User, Device, File, Process, etc. across different services and use a common data model to correlate that data. The common data model for such cross-correlation is of paramount importance, so let’s talk about it next.…

Cross-correlate entity “objects” and their interaction “events”

Data and telemetry obtained via live-querying from different AWS services would have to be extracted, normalized, and mapped into entities and their interactions so that we get a baseline common data model for further investigation. What better way to do it than the Open Cybersecurity Schema Framework (OCSF)?

OCSF is the common and open cybersecurity schema that came out of collaboration between vendors like AWS, Splunk, CrowdStrike, IBM, Okta, Palo Alto, ZScaler, etc. AWS Security Hub can transform security events into OCSF format and Amazon Security Lake stores it in that format. (See my previous blog Need to model Cybersecurity Data? Let’s walk through OCSF! for more information).

OCSF schema lets you model security-relevant objects and events. It has standardized object definitions with suitable attributes for key entities like User, Device, File, Process, Network Endpoint, Domain Info, and several more. The actual event itself is modeled via Event Classes that reference the above objects.

Cybersecurity federated search solutions query and transform native data formats seamlessly into OCSF and provide analysts with the above-mentioned desired console interface. Such a console seamlessly lets you run parallel federated search over vendors’ APIs, giving you combined data in OCSF format.

Now that you have an overview of the common data model, let’s understand cybersecurity federated search better…

Security Observability with Open Federated Search

Federated search allows the user to search multiple data repositories from a single place, without needing to move or centralize the data. A federated search based approach would architecturally fit well to address security investigations on an AWS-native SaaS environment, since it allows moving your search to where your big data lives vs moving your big data to your search tool. While federated search as a technology has existed for years, it is only now that solutions built with it are emerging in the cybersecurity industry. Of course, there are challenges in adopting generic federated search technology.

It’s recommended to use an open federated search solution that gives vendor-neutrality to search from native AWS services vs. closed federated search which is tied to using a particular vendor’s solution stack. Further, only those solutions that are purpose-built vertically for “cybersecurity open federated search”, would address the problem, because a text-based open federated search tool will again leave it upon the analyst to manually correlate cybersecurity information across data sources.

Solutions that are able to transparently apply a cybersecurity data schema on search results, follow-up searches automatically to reconstruct the entity interaction, and then create the answer for the specific analyst use-case, are the ones that will truly empower the analyst. Only then will the cost, skill-sets, and efficiency issues we discussed above, be truly addressed.

Summary

We covered SaaS security observability, monitoring, and investigation aspects across a common stack of services used in an AWS-native application. We understood the challenges analysts face in such an environment, the biggest being that they almost have to be AWS platform engineers. We looked at how SIEM plays a limited role and falls short in monitoring such an environment. More focused cloud security products too generate more alerts that require investigation support.

We proposed the strategy that analysts should focus on investigating interactions between relevant entities across the array of AWS services. To make that work, a common cybersecurity data model, OCSF, fits the bill. Finally, we looked at cybersecurity open federated search as the class of solution that can broker the common schema and let the analyst investigate across different services via a console that leverages AWS APIs. Federated search can solve that data relationship and visibility problem across a plethora of services. So, the answer is no, security analysts do not need to be cloud platform engineers to implement security observability in AWS-native SaaS.

Contributed by:

Dhiraj Sharan

Chief Scientist & Founder, Query