January 11, 2023

Querying for Malware Varients

Hello readers! In my last blog we talked about Querying Cybersecurity Data Stored in S3. In that blog we had looked at file hashes from Cuba Ransomware. Querying for malware hashes is useful if you already have their checksums from your threat intelligence feed or other sources (like the CISA Alert in the last blog). But ever wonder how researchers actually analyze a binary and identify it as malware in the first place? Also, is it enough to just search for malware checksums in our threat hunts?!

Today we will step things up a notch and experiment with querying executable files to understand them better, whether they look similar to malware, or are variants of a previously known malware family, etc. We will limit ourselves to use open source, vs. commercial. We could, of course, just rely on malware research done by our AV/EDR vendors and threat intelligence providers, but as cybersecurity analysts we do need to understand what goes on behind vendors’ curtains. And god forbid, if malware slips through and a breach happens in your environment, you will be better prepared to understand and do more IR.

Cuba Ransomware — can’t we just query for hashes?

Cuba Ransomware continues to be active and has impacted 100+ organizations that have paid $60M+ by now. In the last blog (link here), I gave steps to search for its hashes. The problem with relying on hash matching is that the malware writer can easily change the binary’s checksum and render our efforts useless (more on these techniques below). No wonder that the Cuba Ransomware “family” has 100 or so filenames and 50 or so hashes, as listed in its Dec 22 CISA Alert (scroll to Table-1, IOC section).

Sorry if I made your day worse 🙏

Common Evading Techniques

How do malware developers perform these cheap tricks of evolving from their previous versions?

To know that, let’s understand binary file structure a bit. Any executable has “sections” in it. You can see these using a basic file archiving tool like 7-Zip, or a more advanced disassembler. Common sections include:

header: metadata like OS and architecture
text: actual executable code
rsrc: linked resource files (images, sounds, etc.)
reloc: relocation table
data: static data such as global variables and constants
rodata: read-only data such as strings and constants
bss: uninitialized data
symbol table: names of variables and functions

The malware developer can insert no-op instructions, change comments, add whitespace, or do any other irrelevant change in the above, and—voilà—their new family member is born with a new checksum!

Further, the malware developer can write advanced malware that is programmed to change its own code. Why should that developer not benefit from automation? They have a demanding job! 🤔

Homework for you: Install 7-Zip and Notepad++. Add Hex-Editor in Plugins section of Notepad++. Set 7-Zip’s Tools=>Options=>Editor to Notepad++ path and then open any exe file using 7-Zip. See what sections you can locate from the list above. Select a section and view it using the Plugins=>Hex-Editor=>View in HEX. This is how OpenSSH’s executable code looks like:

Querying for Malware Variants

If you have read my earlier blogs, in general, we crawl + walk to get familiar with the topic. Given the space and complexity, a detailed ‘run’ analysis is beyond the scope. So, for today we will only touch the surface of how to do static analysis. And don’t worry, you don’t have to be a developer to understand the process!

Also, note that beyond static analysis there is dynamic analysis that involves running and observing the malware in a sandbox. As cool as it may sound, dynamic analysis is not a replacement for static analysis—they often help uncover different aspects. Dynamic analysis is a topic for another day.

⚠️ Warning: Fiddling with actual malware is dangerous, even if you don’t plan to run it. Just downloading it itself would trigger alarms. It is best to test with some of your OS’s system files vs. actual malware. If you do download real malware, make sure to do it in a sandbox, and only after documented approval from your boss and your cybersecurity team. I can’t stress enough the importance of those two approvals.

Static Analysis with YARA

Ideally we would want to use a community-built and trusted open-source tool to statically query/analyze the binary. Enter YARA…

The malware research community primarily uses YARA, developed at VirusTotal. It can help us query textual and binary patterns inside files, executables, emails, and memory regions. As described in YARA’s open-source github repository, “YARA is a tool aimed at (but not limited to) helping malware researchers to identify and classify malware samples.”

And in case you are wondering what YARA stands for, it is for some plainly geeky acronym coolness vs. anything meaningful: “Yet Another Recursive Acronym”.

🙄 My head recursively spinning at that!

YARA Rules for Cuba Ransomware Family

Most YARA rules run by cybersecurity vendors are closed-source, but I did find some vendors’ rules for Cuba Ransomware in their open sources. Let’s use these two YARA rule files to not only familiarize ourselves with how to query the malware, but to also compare two different approaches querying for the same malware family:

Windows_Ransomware_Cuba.yar (Elastic Security open-source repo)
Win32.Ransomware.Cuba.yara (ReversingLabs open-source repo)

Querying malware samples for matching fingerprint conditions

You can follow installation instructions from here and then run YARA on your sample binaries we had discussed earlier:

C:\>yara32.exe -r c:\myrules\Windows_Ransomware_Cuba.yar c:\malware-samples

Here is the rule from the 1st link above:

Luckily the above rule is simpler and mostly self-explanatory, but let’s review it a bit. Also, do look at the 2nd rule link from above. I am not pasting here for brevity, but you should definitely check out both and compare them. After all, they are meant to detect the same family!

Understanding the syntax from above

In general, any YARA rule has these sections:

rule name

You would start with the rule keyword and a relevant name as above.

meta section

The meta section above is for human consumption and not used by the YARA compiler. It is used to classify the malware family with fields like category and description.

strings section

The strings section defines strings to search for in the file. The benefit here is that you are not relying on the string to be present at a specific offset, and therefore you have a better shot at catching the changing malware. You can define multiple named strings starting with $. In the above rule, we have defined three strings to look for and as ascii/wide fullword (plain text in other words). In most other YARA rules you will actually see long sequences of hexadecimal strings representing the binary pattern to search for. Case in point, see some of the conditions in the 2nd rule link above:

$find_files_p1 = {
     51 50 8D 4D ?? E8 ?? ?? ?? ?? 8D 85 ?? ?? ?? ?? C6 45 ?? ?? .....
}

condition section

Beyond the above strings, you can add more complex conditions via the condition section. Here is the condition from the 2nd rule link:

condition:
     uint16(0) == 0x5A4D and
        (
            $enum_resources
        ) and
        (
            all of ($find_files_p*)
        ) and
        (
            all of ($encrypt_files_p*)
        )

Here we are looking for a windows executable, which is expressed via a file header condition on the first 16-bits (commonly known to be 0x5a4d for windows executables). Note the and to create compound conditions. You can use operators like or or not!

Can’t I just query for new attack patterns?

Sadly, no. 😞

While you can query for known malware patterns to look for similar malware, YARA is ultimately rules-based where you define what you are looking for. Humans write YARA rules, and this is always based on existing attack patterns.

If you have to do IR/forensics against a new attack, even if you didn’t prevent the attack, you can use a library of YARA rules to get the malware’s characteristics and possibly come up with new rules to classify and detect/prevent similar attacks from happening to you or others that you share your rules with.

So what can I query for? Any examples?

Most software around us is in binary form, and YARA gives us the ability to examine them by using textual and binary patterns. Beyond executable binaries, you could query for malware lying in other common file types, like HTML, JavaScript, CSS, JAR, and PDF.

Here is a good curated repository of community-built YARA rules that you could adopt:

https://github.com/InQuest/awesome-yara

And here is the most comprehensive community repository:

https://github.com/Yara-Rules/rules

Like most things in the world, YARA rules also need to be kept up-to-date with the latest ways to query for the particular malware and its variants.

That seems a lot of manual and complex work!

Yes, you are right if you are thinking so!

Also, even if you wanted to incorporate malware analysis in your process, you might be hard-pressed to find the time or the right talent. Only large organizations with big budgets and mandates even attempt to do so.

Does open-sourcing help?

We saw with the links earlier that good open source repositories of YARA rules exist, but be aware of the challenges. Collaboration over malware detection is tricky. You should not expect vendors to open-source their YARA files (though some have which I applaud!). The challenge with open-sourcing is that now you just gave the malware writer the direct understanding of how to deceive your rule! They are probably already testing against the above public repositories! Side note: For this reason, I doubt whether the two YARA rules we looked at above would actually catch the latest updated Cuba Ransomware. Those two vendors must be keeping a more updated private repo!

So what do we do?

The industry needs more collaboration, interoperable data model, and open vendor APIs

Most organizations have to rely on their EDR, threat intelligence, and other commercial cybersecurity vendors to do effective malware detection. No single vendor can solve this reliably. We only looked at one scenario above of how two vendors identified the same ransomware family differently, bringing their own unique approaches. There are 75 vendors listed to be using YARA. Scroll down the page at this link to see the list at https://virustotal.github.io/yara/. You might realize that you are already using several of these vendors’ software (my bet is at least five from this list if you work for even a mid-sized company). If only they all could collaborate with you and each other in your environment!

Is that possible? APIs? Standard data models?

The path forward is to more effectively leverage your collective investments in your cybersecurity software so every piece of software you use benefits from the knowledge and context coming from your other investments. Collective cross-functional APIs that share context in a standardized cybersecurity schema would aid decision-making. Irrespective of whether you use commercial or open-source tools, API-based information exchange with your environmental context is the need of the hour. If vendors adopt industry-standard community schema like OCSF (see my blog on the OCSF data model ), that would help as well.

Companies like mine (Query) are relevant in this context—acting as the glue between your existing tools, and seamlessly letting you run parallel federated search over vendors’ APIs, giving you combined data in OCSF format.

Summary, and further from here

We looked at the challenge of identifying malware and malware families because malware is constantly changing. We looked at binary file structure to understand how malware writers easily change checksums to evade detection. We went through a basic understanding of how to do static analysis to detect malware using YARA. We compared rules available in open-source that were written to catch Cuba Ransomware.

Writing a good rule means we must catch malware without causing too many false positives, and at the same time, not miss any malicious adaptations done by attackers. To do that effectively means one has to know the malware file’s characteristics well, look for patterns, and combine logic statements with and, or, not to get to an effective match.

To know more about YARA, you can continue further with:

Official documentation at https://yara.readthedocs.io/en/stable/
Github https://github.com/VirusTotal/yara
Community Github repositories here and here

You are most likely already using multiple software that are using YARA and other advanced malware detection capabilities. One recommendation I have for you is to look for federated searching solutions that can give you a collective answer that combines your current tool stack. The industry has been moving towards more APIs and has now also started to collaborate and adopt OCSF, a common cybersecurity data model. Federated searching solutions that tie these together are being created now, like from my company Query (https://query.ai).

Please reach out to me or contact Query (contact@query.ai) if you would like to discuss any of the above further.

Happy Querying…

Contributed by:

Dhiraj Sharan

Chief Scientist & Founder, Query