Unleashing the Power of a Security Data Lake

By Alex Vakulov

The origins of Security Data Lake

The idea of a Security Data Lake (SDL) is rooted in the traditional idea of a Data Lake. Its inception was driven by the exponential growth of data and declining storage costs. Gartner no longer views Data Lake as a game-changing technology but more as a trend in the advancement of storage solutions (such as Cloud Data Warehouse).

A Data Lake is a storage repository that keeps vast amounts of data in its original format. According to Gartner: "A Data Lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores."

The data in the "Lake" can be structured, partially structured, or unstructured, contain tables, text files, system logs, and much more.

Data Lakes and data stores have some distinct differences. In particular, data stores are usually designed for structured, pre-prepared data. The technology used in a Data Lake does not involve bringing the original material to one or another "convenient" structure. This approach allows for multi-variant processing of the same data. This is the primary value of Data Lakes.

The concept's popularity grew when data scientists noticed that traditional data stores presented challenges in solving novel problems. Pre-processing entails the loss of a piece of material that may at first be considered "junk" but which may later be discovered to have value. This problem becomes even more pronounced when dealing with vast amounts of data.

The difference between Security Data Lake and Data Lake

Corporate Data Lakes usually store unstructured data, including details about the company's products, financial metrics, customer data, marketing materials, etc. They are easily expandable and secure, relying on established security measures like other conventional storage systems.

Security Data Lake encompasses more than just security logs and alerts. It includes a range of other security-related information, such as:

Open-Source Intelligence (OSINT) information collected from various sources
Assessment of the security of the system or device
Information about attack countermeasures
Information on detected data leaks
Parameters for identifying existing threats
Sources and vectors of cyberattacks
Analysis of cyber incidents and attributes of attacks
Malware databases
IP address reputation information
Activity logs
Information about activity on the Dark Web

So, the Security Data Lake is a centralized repository intended to handle logs and other information directly related to the sphere of providing information security. The data collected from various sources is then analyzed using various tools. The purpose of creating SDL is to make it easier to access the original logs, thereby enhancing the efficiency of security operations.

By centralizing all data in SDL, the investigation process is streamlined, the effort required to gather logs from multiple systems is decreased, and the completeness of the data is guaranteed.

A security officer can theoretically access and examine all log sources without relying on SDL. But in practice, this seemingly simple task becomes difficult due to the presence of hundreds of different solutions for storing security logs and numerous types of network devices. SDL simplifies such processes as automated data retrieval through APIs or other means, data parsing, and information accumulation.

With large amounts of security data being generated, traditional security information and event management systems (SIEMs) can fail, struggling to gather the data effectively. To extract valuable information, the information security team must collect data across on-premises, cloud, and SaaS environments and then conduct analysis. Many tasks in this process are often manual and time-consuming. The implementation of SDL can address these challenges through automation.

Main features of SDL

There are five key features that SDL should have:

The key component of SDL is the automation of data collection and parsing. Organizations may use a wide variety of security systems, computers, mobile devices, and networks. It is imperative that log collection from each of these sources is performed automatically to guarantee the relevance of the collected data and to allow for real-time analysis. When evaluating the required performance of SDL, you must consider numbers and statistics coming from your own systems. For example, a typical information security system registers up to a million events per day that enter the SIEM for processing. One hundred thousand operations among them have a "red" level. These are usually clear signs of an intrusion requiring immediate assessment and response. Viewing this data manually is unrealistic. Automation of data collection is feasible when the work is done through API using specific protocols (NetFlow, Syslog, Cisco eStreamer, etc.)
Automation of adding context for security logs. The collected data requires supplementary information for analysis. Therefore, the SDL also includes the function of data enrichment by adding context. For example, if a connection to a corporate system originates from an unfamiliar computer or remote location, InfoSec tools may block this operation. But when employees work remotely, this blockage becomes challenging without context. The same thing happens when connecting to a Wi-Fi router. Collecting logs from it, SDL must add context: the type of device being connected, its location, the role position of the employee who authorized the connection, etc.
Markup of IP addresses. IP addresses on a corporate network are often assigned dynamically, meaning a single host may have varying IP addresses over time. If SDL does not have a mechanism for controlling IP addresses, monitoring potential intruders becomes very difficult. In this case, even the redundancy of the collected logs will not help.

Information security data analysis and reporting. It is essential to understand that IS and IT departments use different tools. Therefore, it is necessary to address the issue of the tools' compatibility.
Scalable architecture. The advancement of attack sophistication leads to the fact that, over time, it is necessary to collect more and more data from information security tools. Therefore, SDL must provide scaling. The need for scaling also arises when regulatory authorities are involved. The implementation of new regulations regarding data retention time frames significantly impacts the required storage capacity.

Additional features of SDL

Vendors are trying to expand the tooling of their products and offer features that can improve SDL. Here are some of them:

A set of services (Security-Analytics-as-a-Service) that allow machine learning to be applied to security research.
Graphic tools for analysis.
Possibility to install the platform on top of storage systems offered by other manufacturers, resulting in a hybrid SDL.
Wide range (200+) of customizable detection algorithms.
Ability to deploy on AWS, Snowflake, and other platforms.
Support for various types of backups, including forever incremental hypervisor backups and the full synthetic mode of data storage.
Ability to cache data and use the best lookup process to speed up search queries.

What is preventing the rapid adoption of SDL?

Unlike traditional IT systems, where the collection of logs is an auxiliary function, in information security tools, logs are part of their functional apparatus. At the same time, the licensing models for information security tools are based on the number of users. This leads to a steep hike in license costs. As a result, security teams sometimes deliberately do not collect all available data that could be useful in protecting against cyberattacks. If logs are absent, the attack may go undetected.

Additionally, the increasing amount of unstructured data, which some experts predict to make up 80% of the world's information by 2025, poses a significant challenge in searching and analyzing it.

Finally, the implementation of SDL is often hampered by a lack of qualified personnel. Companies may not have the personnel with the necessary skills and expertise to implement and manage an SDL effectively.

SDL or SIEM?

The main distinction between SDL and SIEM lies in their approach to proactive threat detection.

SIEM is an information system that allows you to identify the causes of alerts and serves to eliminate them. SDL is viewed more as a standalone system. It collects data about protected objects and stores information about possible cyber-attack vectors. Due to this, it can be used for machine learning.

SIEM helps analyze notifications and flag certain events for further investigation. But all further operations are carried out outside of this tool.

SDL can be used to search for threats and identify their signs through the accumulated information request interface. The task of SDL is to identify possible threats, provide context for them, predict the signals that should be expected in the event of a cyber-attack.

Conclusion

Security Data Lakes, a specialized type of Data Lake designed for information security, are still in their early stages of development. Still, some products are already available to enhance organizational security. SDLs can be a valuable tool for security officers to detect attackers within an organization quickly.

Tags: Big Data, Data Security,

Comments

Unleashing the Power of a Security Data Lake

The origins of Security Data Lake

The difference between Security Data Lake and Data Lake

Main features of SDL

Additional features of SDL

What is preventing the rapid adoption of SDL?

Age of AI Cybercrime Report Gives Orgs Six-Month Window to Act

Top Countries in Cybersecurity: The Global Leaders Setting the Standard

Marginal Value Theorem as a Framework for Human Interaction with AI

Unleashing the Power of a Security Data Lake

The origins of Security Data Lake

The difference between Security Data Lake and Data Lake

Main features of SDL

Additional features of SDL

What is preventing the rapid adoption of SDL?

Age of AI Cybercrime Report Gives Orgs Six-Month Window to Act

Top Countries in Cybersecurity: The Global Leaders Setting the Standard

Marginal Value Theorem as a Framework for Human Interaction with AI

Subscribe to Email Updates