Archive for the ‘deep learning’ Category

Combing through the fuzz: Using fuzzy hashing and deep learning to counter malware detection evasion techniques

July 27th, 2021 No comments

Today’s cybersecurity threats continue to find ways to fly and stay under the radar. Cybercriminals use polymorphic malware because a slight change in the binary code or script could allow the said threats to avoid detection by traditional antivirus software. Threat actors customize their wares specific to their target organizations to increase their chances of breaking into and moving laterally through an entire corporate network, exfiltrating data, and leaving with little or no trace. The underground economy is rife with malware builders, Trojanized versions of legitimate applications, and other tools and services that allow malware operators to deploy highly evasive malware.

As the number of threats seen in the wild continues to increase exponentially, the continued evolution and innovation of their evasion tactics create a scenario where most malware is seen only once. Therefore, in today’s threat landscape, security solutions should no longer be just about the number of unique malware they can detect. Instead, they should deliver durable solutions that can defend against existing as well as future attacks. This requires comprehensive visibility into threats, coupled with the ability to process vast amounts of data. Microsoft 365 Defender provides such a capability using its cross-domain optics and the transformation of data into actionable security information through innovative applications of AI and machine learning methodologies.

We have previously discussed how we apply deep learning in detecting malicious PowerShell, exploring new approaches to classify malware, and in detecting threats via the fusion of behavior signals. In this blog post, we discuss a new approach that combines deep learning with fuzzy hashing. This approach utilizes fuzzy hashes as input to identify similarities among files and to determine if a sample is malicious or not. Then, a deep learning methodology inspired by natural language processing (NLP) better identifies similarities that actually matter, thus improving detection quality and scale of deployment.

This model aims to improve the overall accuracy of classifying malware and continue closing the gap between malware release and eventual detection. It can detect and block malware at first sight, a critical capability in defending against the wide range of threats, including sophisticated cyberattacks.

Case study: New NOBELIUM-related malware blocked at first sight

In March this year, Microsoft 365 Defender successfully blocked a file that would later be confirmed as a variant of the GoldMax malware. GoldMax, a command-and-control backdoor that persists on networks as a scheduled task impersonating systems management software, is part the of tools, tactics, and procedures (TTPs) of NOBELIUM, the threat actor behind the attacks against SolarWinds in December 2020.

Microsoft was able to proactively defend its customers from this newly discovered GoldMax variant because it leveraged two main technologies: fuzzy hashing, which serves as the input, and deep learning techniques inspired by NLP and computer vision, among others.

The earliest GoldMax sample, which Microsoft detects as Trojan:Win64/GoldMax.A!dha, was first submitted on VirusTotal in September 2020. While the new file was confirmed to be GoldMax variant in June 2021, or three months after Microsoft first blocked it, we started defending customers as soon as we saw it. As seen in the screenshots below, the new file’s TLSH and SSDEP hashes—the fuzzy hashes exposed on VirusTotal—are observably similar to the first GoldMax variant. Both files also have the exact ImpHash and file size, further supporting our initial conclusion that the second file is also part of the GoldMax family.

Screenshots of showing file properties of original GoldMax malware and the new variant

Figure 1. File properties of the first GoldMax variant (top) and the new file detected in March (bottom) (from VirusTotal)

In the next sections, we discuss fuzzy hashes and how we use them in conjunction with deep learning to detect new and unknown threats.

Understanding fuzzy hashes

Hashing has become an essential technique in malware research literature and beyond because its output—hashes—are commonly used as checksums or unique identifiers. For example, it is common practice to use SHA-256 cryptographic hash to query a knowledge database like VirusTotal to determine whether a file is malicious or not. The first antivirus products operated this way before antivirus signatures existed.

However, to identify or detect similar malware, traditional cryptographic hashing poses a challenge because of its inherent property called cryptographic diffusion, whose purpose is to hide the relationship between the original entity and the hash so that these are still considered one-way functions. With this property, even a minimal change in the original entity—in this case, a file—yields a radically different, undetected hash.

Below are screenshots that illustrate this principle. The word change in the text file and the resulting change in the MD5 hash represent the effect of changes in binary content of other files:

Screenshots of two text files opened in Notepad showing a minor difference in text and comparing their MD5 hashes

Figure 2. Example of cryptographic hashing

Fuzzy hashing breaks the aforementioned cryptographic diffusion while still hiding the relationship between entity and hash. In doing so, this method provides similar resulting hashes when given similar inputs. Fuzzy hashing is the key to finding new malware that looks like something we have seen previously.

Like cryptographic hashes, there are several algorithms to calculate a fuzzy hash. Some examples are Nilsimsa, TLSH, SSDEEP, or sdhash. Using the previous text files example, below is a screenshot of their SSDEEP hashes. Note how observably similar these hashes are because there is only a one-word difference in the text:

Screenshot of Windows PowerShell showing fuzzy hashing for two text files

Figure 3. Example of fuzzy hashing

The main benefit of fuzzy hashes is similarity. Since these hashes can be calculated on several parts or the entirety of a file, we can focus on hash sequences that are like one another. This is important in determining the maliciousness of a previously undetected file and in categorizing malware according to type, family, malicious behavior, or even related threat actor.

Fuzzy hashes as “natural language” for deep learning

Deep learning in its many applications has recently been remarkable at modeling natural human language. For example, convolutional architectures, recursive architectures like Gated Recurrent Units (GRUs) or Long Short Term Memory networks (LSTMs), and most recently attention-based networks like all the variants of Transformers have been proven to be state-of-the-art in tackling human language tasks like sentiment analysis, question answering, or machine translation. As such, we explored if similar techniques can be applied to computer languages like binary code, with fuzzy hashing as an intermediate step to reduce sequence complexity and length of the original space. We discovered that segments of fuzzy hashes could be treated as “words,” and some sequences of such words could indicate maliciousness.

Architecture overview and deployment at scale

A common deep learning approach in dealing with words is to use word embeddings. However, because fuzzy hashes are not exactly natural language, we could not simply use pre-trained models. Instead, we needed to train our embeddings from scratch to identify malicious indicators.

Once with these embeddings, we attempted to do most things with a language deep neural network. We explored different architectures using standard techniques from literature, explored convolutions over these embeddings, attempted with multilayer perceptrons, and tried traditional sequential models (like the previously-mentioned LSTM and GRU) and attention-based networks (Transformers).

Diagram showing architecture of fuzzy hashing model

Figure 4. Architecture overview of the deep learning model using fuzzy hashes

We got fairly good results with most techniques. However, to deploy and enable this model to the Microsoft 365 Defender, we looked into other factors like inference times and the number of parameters in the network. Inference time ruled out the sequential models because even though they were the best in terms of precision or recall, they are the slowest to run inference on. Meanwhile, the Transformers we experimented on also yielded excellent results but had several million parameters. Such parameters will be too costly to deploy at scale .

That left us with the convolutional approach and multilayer perceptron. The perceptron yielded slightly better results between these two because the spatial adjacency intrinsically provided by the convolutional filters does not properly capture the relationship among the embeddings.

Once we had landed on a viable architecture, we used modern tools available to us that Microsoft continues to extend. We used Azure Machine Learning GPU capabilities to train these models at scale, then exported them to Open Neural Network Exchange (ONNX), which gave us the extra performance we needed to operationalize this at scale on Microsoft Defender Cloud.

Deep learning fuzzy hashes: Looking for the similarities that matter

A question that arises from an approach like this is: why use deep learning at all?

Adding machine learning allows us to learn which similarities on fuzzy hashes matter and which ones don’t. Additionally, adding deep learning and training on vast amounts of data increases the accuracy of malware classification and allows us to understand the minor nuances that differentiate legitimate software from its malware or Trojanized versions.

A deep learning approach also has its inherent benefits, one of which is creating big pre-trains on massive amounts of data. One can then reuse this model for different classification, clustering, and other scenarios by using its transfer learning properties. This is similar to how modern NLP approaches language tasks, like how OpenAI’s GPT3 solves question answering.

Another inherent benefit of deep learning is that one does not have to retrain the model from scratch. Since new data is constantly flowing into the Microsoft Defender Cloud, we can fine-tune the model with these incoming data to adapt and quickly respond to an ever-changing threat landscape.

Conclusion: Continuing to harness the immense potential of deep learning in security

Deep learning continues to provide opportunities to improve threat detection significantly. The deep learning approach discussed in this blog entry is just one of the ways we at Microsoft apply deep learning in our protection technologies to detect and block evasive threats. Data scientists, threat experts, and product teams work together to build AI-driven solutions and investigation experiences.

By treating fuzzy hashes as “words” and not mere codes, we proved that natural language techniques in deep learning are viable methods to solve the current challenges in the threat landscape. This change in perspective presents different possibilities in cybersecurity innovation that we are looking forward to exploring further.

Numerous AI-driven technologies like this allow Microsoft 365 Defender to automatically analyze massive amounts of data and quickly identify malware and other threats. As the GoldMax case study showed, the ability to identify new and unknown malware is a critical aspect of the coordinated defense that Microsoft 365 Defender delivers to protect customers against the most sophisticated threats.

Learn how you can stop attacks through automated, cross-domain security and built-in AI with Microsoft 365 Defender.


Edir Garcia Lazo

Microsoft 365 Defender Research Team

The post Combing through the fuzz: Using fuzzy hashing and deep learning to counter malware detection evasion techniques appeared first on Microsoft Security Blog.

Seeing the big picture: Deep learning-based fusion of behavior signals for threat detection

July 23rd, 2020 No comments

The application of deep learning and other machine learning methods to threat detection on endpoints, email and docs, apps, and identities drives a significant piece of the coordinated defense delivered by Microsoft Threat Protection. Within each domain as well as across domains, machine learning plays a critical role in analyzing and correlating massive amounts of data to detect increasingly evasive threats and build a complete picture of attacks.

On endpoints, Microsoft Defender Advanced Threat Protection (Microsoft Defender ATP) detects malware and malicious activities using various types of signals that span endpoint and network behaviors. Signals are aggregated and processed by heuristics and machine learning models in the cloud. In many cases, the detection of a particular type of behavior, such as registry modification or a PowerShell command, by a single heuristic or machine learning model is sufficient to create an alert.

Detecting more sophisticated threats and malicious behaviors considers a broader view and is significantly enhanced by fusion of signals occurring at different times. For example, an isolated event of file creation is generally not a very good indication of malicious activity, but when augmented with an observation that a scheduled task is created with the same dropped file, and combined with other signals, the file creation event becomes a significant indicator of malicious activity. To build a layer for these kinds of abstractions, Microsoft researchers instrumented new types of signals that aggregate individual signals and create behavior-based detections that can expose more advanced malicious behavior.

In this blog, we describe an application of deep learning, a category of machine learning algorithms, to the fusion of various behavior detections into a decision-making model. Since its deployment, this deep learning model has contributed to the detection of many sophisticated attacks and malware campaigns. As an example, the model uncovered a new variant of the Bondat worm that attempts to turn affected machines into zombies for a botnet. Bondat is known for using its network of zombie machines to hack websites or even perform cryptocurrency mining. This new version spreads using USB devices and then, once on a machine, achieves a fileless persistence. We share more technical details about this attack in latter sections, but first we describe the detection technology that caught it.

Powerful, high-precision classification model for wide-ranging data

Identifying and detecting malicious activities within massive amounts of data processed by Microsoft Defender ATP require smart automation methods and AI. Machine learning classifiers digest large volumes of historical data and apply automatically extracted insights to score each new data point as malicious or benign. Machine learning-based models may look at, for example, registry activity and produce a probability score, which indicates the probability of the registry write being associated with malicious activity. To tie everything together, behaviors are structured into virtual process trees, and all signals associated with each process tree are aggregated and used for detecting malicious activity.

With virtual process trees and signals of different types associated to these trees, there’s still large amounts of data and noisy signals to sift through. Since each signal occurs in the context of a process tree, it’s necessary to fuse these signals in the chronological order of execution within the process tree. Data ordered this way requires a powerful model to classify malicious vs. benign trees.

Our solution comprises several deep learning building blocks such as Convolutional Neural Networks (CNNs) and Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN). The neural network can take behavior signals that occur chronologically in the process tree and treat each batch of signals as a sequence of events. These sequences can be collected and classified by the neural network with high precision and detection coverage.

Behavior-based and machine learning-based signals

Microsoft Defender ATP researchers instrument a wide range of behavior-based signals. For example, a signal can be for creating an entry in the following registry key:


A folder and executable file name added to this location automatically runs after the machine starts. This generates persistence on the machine and hence can be considered an indicator of compromise (IoC). Nevertheless, this IoC is generally not enough to generate detection because legitimate programs also use this mechanism.

Another example of behavior-based signal is service start activity. A program that starts a service through the command line using legitimate tools like net.exe is not considered a suspicious activity. However, starting a service created earlier by the same process tree to obtain persistence is an IoC.

On the other hand, machine learning-based models look at and produce signals on different pivots of a possible attack vector. For example, a machine learning model trained on historical data to discern between benign and malicious command lines will produce a score for each processed command line.

Consider the following command line:

 cmd /c taskkill /f /im someprocess.exe

This line implies that taskill.exe is evoked by cmd.exe to terminate a process with a particular name. While the command itself is not necessarily malicious, the machine learning model may be able to recognize suspicious patterns in the name of the process being terminated, and provide a maliciousness probability, which is aggregated with other signals in the process tree. The result is a sequence of events during a certain period of time for each virtual process tree.

The next step is to use a machine learning model to classify this sequence of events.

Data modeling

The sequences of events described in the previous sections can be represented in several different ways to then be fed into machine learning models.

The first and simple way is to construct a “dictionary” of all possible events, and to assign a unique identifier (index) to each event in the dictionary. This way, a sequence of events is represented by a vector, where each slot constitutes the number of occurrences (or other related measure) for an event type in the sequence.

For example, if all possible events in the system are X,Y, and Z, a sequence of events “X,Z,X,X” is represented by the vector [3, 0, 1], implying that it contains three events of type X, no events of type Y, and a single event of type Z. This representation scheme, widely known as “bag-of-words”,  is suitable for traditional machine learning models and has been used for a long time by machine learning practitioners. A limitation of the bag-of-words representation is that any information about the order of events in the sequence is lost.

The second representation scheme is chronological. Figure 1 shows a typical process tree: Process A raises an event X at time t1, Process B raises an event Z at time t2, D raises X at time t3, and E raises X at time t4. Now the entire sequence “X,Z,X,X”  (or [1,3,1,1] replacing events by their dictionary indices) is given to the machine learning model.

Diagram showing process tree

Figure 1. Sample process tree

In threat detection, the order of occurrence of different events is important information for the accurate detection of malicious activity. Therefore, it’s desirable to employ a representation scheme that preserves the order of events, as well as machine learning models that are capable of consuming such ordered data. This capability can be found in the deep learning models described in the next section.


Deep learning has shown great promise in sequential tasks in natural language processing like sentiment analysis and speech recognition. Microsoft Defender ATP uses deep learning for detecting various attacker techniques, including malicious PowerShell.

For the classification of signal sequences, we use a Deep Neural Network that combines two types of building blocks (layers): Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory Recurrent Neural Networks (BiLSTM-RNN).

CNNs are used in many tasks relating to spatial inputs such as images, audio, and natural language. A key property of CNNs is the ability to compress a wide-field view of the input into high-level features.  When using CNNs in image classification, high-level features mean parts of or entire objects that the network can recognize. In our use case, we want to model long sequences of signals within the process tree to create high-level and localized features for the next layer of the network. These features could represent sequences of signals that appear together within the data, for example, create and run a file, or save a file and create a registry entry to run the file the next time the machine starts. Features created by the CNN layers are easier to digest for the ensuing LSTM layer because of this compression and featurization.

LSTM deep learning layers are famous for results in sentence classification, translation, speech recognition, sentiment analysis, and other sequence modeling tasks. Bidirectional LSTM combine two layers of LSTMs that process the sequence in opposite directions.

The combination of the two types of neural networks stacked one on top of the other has shown to be very effective and can classify long sequences of hundreds of items and more. The final model is a combination of several layers: one embedding layer, two CNNs, and a single BiLSTM. The input to this model is a sequence of hundreds of integers representing the signals associated with a single process tree during a unit of time. Figure 2 shows the architecture of our model.

Diagram showing layers of the CNN BiLSTM model

Figure 2. CNN-BiLSTM model

Since the number of possible signals in the system is very high, input sequences are passed through an embedding layer that compresses high-dimensional inputs into low-dimensional vectors that can be processed by the network. In addition, similar signals get a similar vector in lower dimensional space, which helps with the final classification.

Initial layers of the network create increasingly high-level features, and the final layer performs sequence classification. The output of the final layer is a score between 0 and 1 that indicates the probability of the sequence of signals being malicious. This score is used in combination with other models to predict if the process tree is malicious.

Catching real-world threats

Microsoft Defender ATP’s endpoint detection and response capabilities use this Deep CNN-BiLSTM model to catch and raise alerts on real-world threats. As mentioned, one notable attack that this model uncovered is a new variant of the Bondat worm, which was seen propagating in several organizations through USB devices.

Diagram showing the Bondat attack chain

Figure 3. Bondat malware attack chain

Even with an arguably inefficient propagation method, the malware could persist in an organization as users continue to use infected USB devices. For example, the malware was observed in hundreds of machines in one organization. Although we detected the attack during the infection period, it continued spreading until all malicious USB drives were collected. Figure 4 shows the infection timeline.

Column chart showing daily encounters of the Bondat malware in one organization

Figure 4. Timeline of encounters within a single organization within a period of 5 months showing reinfection through USB devices

The attack drops a JavaScript payload, which it runs directly in memory using wscript.exe. The JavaScript payload uses a randomly generated filename as a way to evade detections. However, Antimalware Scan Interface (AMSI) exposes malicious script behaviors.

To spread via USB devices, the malware leverages WMI to query the machine’s disks by calling “SELECT * FROM Win32_DiskDrive”. When it finds a match for “/usb” (see Figure 5), it copies the JavaScript payload to the USB device and creates a batch file on the USB device’s root folder. The said batch file contains the execution command for the payload. As part of its social engineering technique to trick users into running the malware in the removable device, it creates a LNK file on the USB pointing to the batch file.

Screenshot of malware code showing infection technique

Figure 5. Infection technique

The malware terminates processes related to antivirus software or debugging tools. For Microsoft Defender ATP customers, tamper protection prevents the malware from doing this. Notably, after terminating a process, the malware pops up a window that imitates a Windows error message to make it appear like the process crashed (See figure 6).

Screenshot of malware code showing infection technique

Figure 6. Evasion technique

The malware communicates with a remote command-and-control (C2) server by implementing a web client (MSXML). Each request is encrypted with RC4 using a randomly generated key, which is sent within the “PHPSESSID” cookie value to allow attackers to decrypt the payload within the POST body.

Every request sends information about the machine and its state following the output of the previously executed command. The response is saved to disk and then parsed to extract commands within an HTML comment tag. The first five characters from the payload are used as key to decrypt the data, and the commands are executed using the eval() method. Figures 7 and 8 show the C2 communication and HTML comment eval technique.

Once the command is parsed and evaluated by the JavaScript engine, any code can be executed on an affected machine, for example, download other payloads, steal sensitive info, and exfiltrate stolen data. For this Bondat campaign, the malware runs coin mining or coordinated distributed denial of service (DDoS) attacks.

Figure 7. C2 communication

Figure 8. Eval technique (parsing commands from html comment)

The malware’s activities triggered several signals throughout the attack chain. The deep learning model inspected these signals and the sequence with which they occurred, and determined that the process tree was malicious, raising an alert:

  1. Persistence – The malware copies itself into the Startup folder and drops a .lnk file pointing to the malware copy that opens when the computer starts
  2. Renaming a known operating system tool – The malware renames exe into a random filename
  3. Dropping a file with the same filename as legitimate tools – The malware impersonates legitimate system tools by dropping a file with a similar name to a known tool.
  4. Suspicious command line – The malware tries to delete itself from previous location using a command line executed by a process spawned by exe
  5. Suspicious script content – Obfuscated JavaScript payload used to hide the attacker’s intentions
  6. Suspicious network communication – The malware connects to the domain legitville[.]com


Modeling a process tree, given different signals that happen at different times, is a complex task. It requires powerful models that can remember long sequences and still be able to generalize well enough to churn out high-quality detections. The Deep CNN-BiLSTM model we discussed in this blog is a powerful technology that helps Microsoft Defender ATP achieve this task. Today, this deep learning-based solution contributes to Microsoft Defender ATP’s capability to detect evolving threats like Bondat.

Microsoft Defender ATP raises alerts for these deep learning-driven detections, enabling security operations teams to respond to attacks using Microsoft Defender ATP’s other capabilities, like threat and vulnerability management, attack surface reduction, next-generation protection, automated investigation and response, and Microsoft Threat Experts. Notably, these alerts inform behavioral blocking and containment capabilities, which add another layer of protection by blocking threats if they somehow manage to start running on machines.

The impact of deep learning-based protections on endpoints accrues to the broader Microsoft Threat Protection (MTP), which combines endpoint signals with threat data from email and docs, identities, and apps to provide cross-domain visibility. MTP harnesses the power of Microsoft 365 security products to deliver unparalleled coordinated defense that detects, blocks, remediates, and prevents attacks across an organization’s Microsoft 365 environment. Through machine learning and AI technologies like the deep-learning model we discussed in this blog, MTP automatically analyzes cross-domain data to build a complete picture of each attack, eliminating the need for security operations centers (SOC) to manually build and track the end-to-end attack chain and relevant details. MTP correlates and consolidates attack evidence into incidents, so SOCs can save time and focus on critical tasks like expanding investigations and proacting threat hunting.


Arie Agranonik, Shay Kels, Guy Arazi

Microsoft Defender ATP Research Team


Talk to us

Questions, concerns, or insights on this story? Join discussions at the Microsoft Threat Protection and Microsoft Defender ATP tech communities.

Read all Microsoft security intelligence blog posts.

Follow us on Twitter @MsftSecIntel.

The post Seeing the big picture: Deep learning-based fusion of behavior signals for threat detection appeared first on Microsoft Security.

Microsoft researchers work with Intel Labs to explore new deep learning approaches for malware classification

May 8th, 2020 No comments

The opportunities for innovative approaches to threat detection through deep learning, a category of algorithms within the larger framework of machine learning, are vast. Microsoft Threat Protection today uses multiple deep learning-based classifiers that detect advanced threats, for example, evasive malicious PowerShell.

In continued exploration of novel detection techniques, researchers from Microsoft Threat Protection Intelligence Team and Intel Labs are collaborating to study new applications of deep learning for malware classification, specifically:

  • Leveraging deep transfer learning technique from computer vision to static malware classification
  • Optimizing deep learning techniques in terms of model size and leveraging platform hardware capabilities to improve execution of deep-learning malware detection approaches

For the first part of the collaboration, the researchers built on Intel’s prior work on deep transfer learning for static malware classification and used a real-world dataset from Microsoft to ascertain the practical value of approaching the malware classification problem as a computer vision task. The basis for this study is the observation that if malware binaries are plotted as grayscale images, the textural and structural patterns can be used to effectively classify binaries as either benign or malicious, as well as cluster malicious binaries into respective threat families.

The researchers used an approach that they called static malware-as-image network analysis (STAMINA). Using the dataset from Microsoft, the study showed that the STAMINA approach achieves high accuracy in detecting malware with low false positives.

The results and further technical details of the research are listed in the paper STAMINA: Scalable deep learning approach for malware classification and set the stage for further collaborative exploration.

The role of static analysis in deep learning-based malware classification

While static analysis is typically associated with traditional detection methods, it remains to be an important building block for AI-driven detection of malware. It is especially useful for pre-execution detection engines: static analysis disassembles code without having to run applications or monitor runtime behavior.

Static analysis produces metadata about a file. Machine learning classifiers on the client and in the cloud then analyze the metadata and determine whether a file is malicious. Through static analysis, most threats are caught before they can even run.

For more complex threats, dynamic analysis and behavior analysis build on static analysis to provide more features and build more comprehensive detection. Finding ways to perform static analysis at scale and with high effectiveness benefits overall malware detection methodologies.

To this end, the research borrowed knowledge from  computer vision domain to build an enhanced static malware detection framework that leverages deep transfer learning to train directly on portable executable (PE) binaries represented as images.

Analyzing malware represented as image

To establish the practicality of the STAMINA approach, which posits that malware can be classified at scale by performing static analysis on malware codes represented as images, the study covered three main steps: image conversion, transfer learning, and evaluation.

Diagram showing the steps for the STAMINA approach: pre-processing, transfer learning, and evaluation

First, the researchers prepared the binaries by converting them into two-dimensional images. This step involved pixel conversion, reshaping, and resizing. The binaries were converted into a one-dimensional pixel stream by assigning each byte a value between 0 and 255, corresponding to pixel intensity. Each pixel stream was then transformed into a two-dimensional image by using the file size to determine the width and height of the image.

The second step was to use transfer learning, a technique for overcoming the isolated learning paradigm and utilizing knowledge acquired for one task to solve related ones. Transfer learning has enjoyed tremendous success within several different computer vision applications. It accelerates training time by bypassing the need to search for optimized hyperparameters and different architectures—all this while maintaining high classification performance. For this study, the researchers used Inception-v1 as the base model.

The study was performed on a dataset of 2.2 million PE file hashes provided by Microsoft. This dataset was temporally split into 60:20:20 segments for training, validation, and test sets, respectively.

Diagram showing a DNN with pre-trained weights on natural images, and the last portion fine-tuned with new data

Finally, the performance of the system was measured and reported on the holdout test set. The metrics captured include recall at specific false positive range, along with accuracy, F1 score, and area under the receiver operating curve (ROC).


The joint research showed that applying STAMINA to real-world hold-out test data set achieved a recall of 87.05% at 0.1% false positive rate, and 99.66% recall and 99.07% accuracy at 2.58% false positive rate overall. The results certainly encourage the use of deep transfer learning for the purpose of malware classification. It helps accelerate training by bypassing the search for optimal hyperparameters and architecture searches, saving time and compute resources in the process.

The study also highlights the pros and cons of sample-based methods like STAMINA and metadata-based classification methods. For example, STAMINA can go in-depth into samples and extract additional signals that might not be captured in the metadata.  However, for bigger size applications, STAMINA becomes less effective due to limitations in converting billions of pixels into JPEG images and then resizing them. In such cases, metadata-based methods show advantages over our research.

Conclusion and future work

The use of deep learning methods for detecting threats drives a lot of innovation across Microsoft. The collaboration with Intel Labs researchers is just one of the ways in which Microsoft researchers and data scientists continue to explore novel ways to improve security overall.

This joint research is a good starting ground for more collaborative work. For example, the researchers plan to collaborate further on platform acceleration optimizations that can allow deep learning models to be deployed on client machines with minimal performance impact. Stay tuned.


Jugal Parikh, Marc Marino

Microsoft Threat Protection Intelligence Team


The post Microsoft researchers work with Intel Labs to explore new deep learning approaches for malware classification appeared first on Microsoft Security.