Every now and then in the cyber security industry, there’s a news title that becomes explosive. “The Biggest DDoS in History”, “Massive Data Leak from Company X”, or “Researchers Discover a New Bleeding-Edge Tool in the Dark Web”, garners a lot of attention for a short while, resulting in CISOs and SOC teams scratching their heads about whether it affects them and how they can better prepare for this new threat. A few days ago, such a title caught everyone’s attention – a massive compilation of 16 billion records stolen by Infostealers had been discovered. Some of our customers have already reached out to us to better understand this recent discovery and there are TV news reports on the finding, indicating that this news have captured the attention of the larger community. However, as news articles have already pointed out, those 16 billion credentials are not a new data breach, but a compilation of other infostealer logs data.
However, this claim only tells just part of the story when it comes to infostealer logs. When assessing the risk of such a finding, which includes a massive amount of data, it is important to note the quality of the source. As we’ve pointed out in a previous article, infostealer logs are inherently “noisy” – the vast majority of records are invalid, whether because they’re a compilation of previous data (as is this case), with some records even being almost a decade old, or because it is flat out fake, generated by LLMs and other tools. Even in cases where the data is valid and legitimate, many infostealer logs contain duplicates in order to artificially inflate the sample size. Most samples also contain “legitimate” duplicates. As the purpose of an infostealer is to capture any stolen credentials, samples usually contain multiple versions of the same record captured in different webpages of the same website. For example, a sample may contain a record captured from https://website.com/login/, and also https://website.com/account/, https://website.com/update/ and so on. All these contribute to a huge number of credentials in each infostealer log sample, but in reality only a very few of them are actually relevant. The “noisy” characteristic of infostealer logs has been observed both in free samples provided by threat actors, as well as “private” files that are considered more premium-quality (for a more in-depth look into how the freemium model is used by the infostealer logs community, check out our other article here).
In the infostealer logs scene, the majority of samples provided by the threat actors are of relatively low quality. As a service which collects and reports infostealer logs relevant to our clients, we often avoid adding certain samples to our repository, as the data appears to be unreliable. We also purge records (sometimes millions of them) if we learn about the poor quality of a sample after-the-fact. This is necessary in order to provide a decent level of quality to our customers. In order to achieve a compilation of 16 billion records, one cannot be as picky as we do (we “only” have 1 billion records in our repository).
This compilation is not the first massive sample that caused ripples across the industry. A previous massive leak of ALIEN TEXTBASE also made headlines, only to end up containing a lot of generated fake information sprinkled with older data. That is not to say that the recent finding poses no risk to organizations, but considering it has been compiled using other data sources, it does increase the already existing-risk of infostealers. Compromised credentials from infostealer logs should be analyzed and looked into, as they do present a real danger to organizations, however SOC teams should be aware that they would have to sift through mud in order to find gold nuggets of intelligence (which are indeed valuable), and as the saying goes, “the size doesn’t matter” (when it comes to samples of infostealer logs).