VESoft raises $8M to meet China’s growing need for graph databases

Sherman Ye founded VESoft in 2018 when he saw a growing demand for graph databases in China. Its predecessors, like Neo4j and TigerGraph, had already been growing aggressively in the West for a few years, while China was just getting to know the technology that leverages graph structures to store data sets and depict their relationships, such as those used for social media analysis, e-commerce recommendations and financial risk management.

VESoft is ready for further growth after closing an $8 million funding round led by Redpoint China Ventures, an investment firm launched by Silicon Valley-based Redpoint Ventures in 2005. Existing investor Matrix Partners China also participated in the Series pre-A round. The new capital will allow the startup to develop products and expand to markets in North America, Europe and other parts of Asia.

The 30-people team is comprised of former employees from Alibaba, Facebook, Huawei and IBM. It’s based in Hangzhou, a scenic city known for its rich history and housing Alibaba and its financial affiliate Ant Financial, where Ye previously worked as a senior engineer after his four-year stint with Facebook in California. From 2017 to 2018, the entrepreneur noticed that Ant Financial’s customers were increasingly interested in adopting graph databases as an alternative to relational databases, a model that had been popular since the 80s and normally organizes data into tables.

“While relational databases are capable of achieving many functions carried out by graph databases… they deteriorate in performance as the quantity of data grows,” Ye told TechCrunch during an interview. “We didn’t use to have so much data.”

Information explosion is one reason why Chinese companies are turning to graph databases, which can handle millions of transactions to discover patterns within scattered data. The technology’s rise is also a response to new forms of online businesses that depend more on relationships.

“Take recommendations for example. The old model recommends content based purely on user profiles, but the problem of relying on personal browsing history is it fails to recommend new things. That was fine for a long time as the Chinese [internet] market was big enough to accommodate many players. But as the industry becomes saturated and crowded… companies need to ponder how to retain existing users, lengthen their time spent, and win users from rivals.”

The key lies in serving people content and products they find appealing. Graph databases come in handy, suggested Ye, when services try to predict users’ interest or behavior as the model uncovers what their friends or people within their social circles like. “That’s a lot more effective than feeding them what’s trending.”

Neo4j compares relational and graph databases (Link)

The company has made its software open source, which the founder believed can help cultivate a community of graph database users and educate the market in China. It will also allow VESoft to reach more engineers in the English-speaking world who are well-acquainted with the open-source culture.

“There is no such thing as being ‘international’ or ‘domestic’ for a technology-driven company. There are no boundaries between countries in the open-source world,” reckoned Ye.

When it comes to generating income, the startup plans to launch a paid version for enterprises, which will come with customized plug-ins and host services.

The Nebula Graph, the brand of VESoft’s database product, is now serving 20 enterprise clients from areas across social media, e-commerce and finance, including big names like food delivery giant Meituan, popular social commerce app Xiaohongshu and e-commerce leader JD.com. A number of overseas companies are also trialing Nebula.

The time is ripe for enterprise-facing startups with a technological moat in China as the market for consumers has been divided by incumbents like Tencent and Alibaba. This makes fundraising relatively easy for VESoft. The founder is confident that Chinese companies are rapidly catching up with their Western counterparts in the space, for the gargantuan amount of data and the myriad of ways data is used in the country “will propel the technology forward.”

How Liberty Mutual shifted 44,000 workers from office to home

In a typical month, an IT department might deal with a small percentage of employees working remotely, but tracking a few thousand employees is one thing — moving an entire company offsite requires next-level planning.

To learn more about how large organizations are adapting to the rapid shift to working from home, we spoke to Liberty Mutual CIO James McGlennon, who helped orchestrate his company’s move about the challenges he faced as he shifted more than 44,000 employees in a variety of jobs, locations, cultures and living situations from office to home in short order.

Laying the groundwork

Insurance company Liberty Mutual is headquartered in the heart of Boston, but the company has offices in 29 countries. While some staffers in parts of Asia and Europe were sent home earlier in the year, by mid-March the company had closed all of its offices in the U.S. and Canada, eventually sending every employee home.

McGlennon said he never imagined such a situation, but the company saw certain networking issues in recent years that gave them an inkling of what it might look like. That included an unexpected incident in which two points on a network ring around one of its main data centers went down in quick succession, first because a backhoe hit a line, and then at another point because someone stole the fiber-optic cable.

That got the CIO and his team thinking about how to respond to worst cases. “We certainly hadn’t contemplated needing to get 44,000 people working from home or working remotely so quickly, but there have been a few things that have happened over the last few years that made me think,” he said.

Privnotes.com Is Phishing Bitcoin from Users of Private Messaging Service Privnote.com

For the past year, a site called Privnotes.com has been impersonating Privnote.com, a legitimate, free service that offers private, encrypted messages which self-destruct automatically after they are read. Until recently, I couldn’t quite work out what Privnotes was up to, but today it became crystal clear: Any messages containing bitcoin addresses will be automatically altered to include a different bitcoin address, as long as the Internet addresses of the sender and receiver of the message are not the same.

Earlier this year, KrebsOnSecurity heard from the owners of Privnote.com, who complained that someone had set up a fake clone of their site that was fooling quite a few regular users of the service.

And it’s not hard to see why: Privnotes.com is confusingly similar in name and appearance to the real thing, and comes up second in Google search results for the term “privnote.” Also, anyone who mistakenly types “privnotes” into Google search may see at the top of the results a misleading paid ad for “Privnote” that actually leads to privnotes.com.

A Google search for the term “privnotes” brings up a misleading paid ad for the phishing site privnotes.com, which is listed above the legitimate site — privnote.com.

Privnote.com (the legit service) employs technology that encrypts all messages so that even Privnote itself cannot read the contents of the message. And it doesn’t send and receive messages. Creating a message merely generates a link. When that link is clicked or visited, the service warns that the message will be gone forever after it is read.

But according to the owners of Privnote.com, the phishing site Privnotes.com does not fully implement encryption, and can read and/or modify all messages sent by users.

“It is very simple to check that the note in privnoteS is sent unencrypted in plain text,” Privnote.com explained in a February 2020 message, responding to inquiries from KrebsOnSecurity. “Moreover, it doesn’t enforce any kind of decryption key when opening a note and the key after # in the URL can be replaced by arbitrary characters and the note will still open.”

But that’s not the half of it. KrebsOnSecurity has learned that the phishing site Privnotes.com uses some kind of automated script that scours messages for bitcoin addresses, and replaces any bitcoin addresses found with its own bitcoin address. The script apparently only modifies messages if the note is opened from a different Internet address than the one that composed the address.

Here’s an example, using the bitcoin wallet address from bitcoin’s Wikipedia page as an example. The following message was composed at Privnotes.com from a computer with an Internet address in New York, with the message, “please send money to bc1qar0srrr7xfkvy5l643lydnw9re59gtzzwf5mdq thanks”:

A test message composed on privnotes.com, which is phishing users of the legitimate encrypted message service privnote.com. Pay special attention to the bitcoin address in this message.

When I visited the Privnotes.com link generated by clicking the “create note” button on the above page from a different computer with an Internet address in California, this was the result. As you can see, it lists a different bitcoin address, albeit one with the same first four characters the same.

The altered message. Notice the bitcoin address has been modified and is not the same address that was sent in the original note.

Several other tests confirmed that the bitcoin modifying script does not seem to change message contents if the sender and receiver’s IP addresses are the same, or if one composes multiple notes with the same bitcoin address in it.

Allison Nixon, the security expert who helped me with this testing, said the script also only seems to replace the first instance of a bitcoin address if it’s repeated within a message, and the site stops replacing a wallet address if it is sent repeatedly over multiple messages.

“And because of the design of the site, the sender won’t be able to view the message because it self destructs after one open, and the type of people using privnote aren’t the type of people who are going to send that bitcoin wallet any other way for verification purposes,” said Nixon, who is chief research officer at Unit 221B. “It’s a pretty smart scam.”

Given that Privnotes.com is phishing bitcoin users, it’s a fair bet the phony service also is siphoning other sensitive data from people who use their site.

“So if there are password dumps in the message, they would be able to read that, too,” Nixon said. “At first, I thought that was their whole angle, just to siphon data. But the bitcoin wallet replacement is probably much closer to the main motivation for running the fake site.”

Even if you never use or plan to use the legitimate encrypted message service Privnote.com, this scam is a great reminder why it pays to be extra careful about using search engines to find sites that you plan to entrust with sensitive data. A far better approach is to bookmark such sites, and rely exclusively on those instead.

The Good, the Bad and the Ugly in Cybersecurity – Week 24

The Good

With the first half of 2020 more or less behind us, and the U.S. Election season fast approaching in what is already the most turbulent year in decades, it’s good to see cybersecurity being ramped up at the national level. To that end, the National Guard and U.S. Cyber Command have teamed up to provide timely data and response to cyber attacks, from ransomware infections to election security incidents, through its Cyber 9-Line initiative.

Cyber 9-Line uses a common framework to allow rapid reporting of incidents by National Guard units that is fed into USCYBERCOM’s Cyber Nation Mission Force. The CNMF can then diagnose and provide unclassified feedback to help address the incident. While only 12 states have so far completed the registration process, most others are now working through the steps to establish accounts and undertake training. USCYBERCOM have said that defense of the 2020 Election is “the number-one priority of both the command and the National Security Agency”. Cyber 9-Line is expected to play a crucial role in ensuring election integrity.

Meanwhile, over the pond in the UK the Ministry of Defence has also been gearing up to fight off digital attacks with the launch of a new cyber regiment, the 13th Signal Regiment. In a statement, the British Army said that the new outfit would match “cutting edge technology with cyber-fit soldiers to compete and win in the information age.”

The Bad

Remember Meltdown and Spectre? And the side-channel attacks RIDL, Fallout and ZombieLoad? These processor-level vulnerabilities from yesteryear (OK, 2018 and 2019, actually) made it possible for attackers to extract sensitive information as it passed through an Intel CPU’s microarchitectural buffers.

The source of the problem, dubbed Microarchitectural Data Sampling (MDS), was so deeply rooted it wasn’t possible to prevent the buffers leaking; the best Intel could do was update existing processors’ microcode so that buffers would be overwritten whenever the CPU switched to a new security-sensitive task. Intel subsequently released their 8th-gen Whiskey Lake CPUs that were supposed to be resistant to these kinds of MDS attacks. Alas, the bad news is it seems these mitigation strategies didn’t entirely work. New research from two separate teams has shown that even on Whiskey Lake machines, it’s possible to bypass the countermeasures.

SGAxe builds on an earlier attack, CacheOut, and exploits CVE-2020-0549 to steal user data from Security Guard Extensions (SGX) secure enclaves, while CrossTalk makes it possible for attackers to leak data protected in an SGX enclave even if the attacker’s code is running on a different CPU Core to that holding the sensitive data.

The researchers said that “it is almost trivial to apply these attacks to break code running in Intel’s secure SGX enclaves” and that “mitigations against existing transient execution attacks are largely ineffective”.

Intel refers to CrossTalk as Special Register Buffer Data Sampling (SRBDS) and has said that its Atom, Xeon Scalable and 10th Gen Intel Core families of processors are not affected. For processor families that are affected, expect vendors to provide updates in the coming weeks. Patches against an earlier vulnerability, as well as developers following recommended guidelines, should also help to protect against CacheOut and SGAxe, Intel have said.

The Ugly

Human rights defenders, environmentalists, and journalists as well as politicians and CEOs are among tens of thousands that have been targeted by an hitherto unknown hackers-for-hire group dubbed ‘Dark Basin’, according to Citzen Lab, a Canadian research group focused on digital threats to civil society.

American non-profit organizations have been extensively targeted by the Dark Basin group, who also engaged in phishing campaigns against organizations advocating net neutrality and fighting to expose climate denial activities. A partial list of targets who agreed to be named includes:

  • 350.org
  • Climate Investigations Center
  • Conservation Law Foundation
  • Center for International Environmental Law
  • Greenpeace
  • Public Citizen
  • Union of Concerned Scientists

The Dark Basin group were uncovered due to their use of a custom URL shortener used in their phishing campaigns. The researchers were able to identify almost 28,000 URLs containing email addresses of targets after they discovered that the shorteners created URLs with sequential shortcodes. The malicious links led to credential phishing sites: attacker-controlled clones of login pages for popular services like Facebook, LinkedIn and Google Mail, among others.

Initially suspecting the threat actor may have been a state-sponsored APT, Citizen Lab unearthed links between the targets and individuals working at a private, Indian-based company called “BellTrox InfoTech Services” and “BellTrox D|G|TAL Security”. While the researchers say they have “high confidence” that BellTrox employees are behind Dark Basin activities, they do not have strong evidence pointing to any party who may have commissioned their hacking activities.


Like this article? Follow us on LinkedIn, Twitter, YouTube or Facebook to see the content we post.

Read more about Cyber Security

OpenStack adds the StarlingX edge computing stack to its top-level projects

The OpenStack Foundation today announced that StarlingX, a container-based system for running edge deployments, is now a top-level project. With this, it joins the main OpenStack private and public cloud infrastructure project, the Airship lifecycle management system, Kata Containers and the Zuul CI/CD platform.

What makes StarlingX a bit different from some of these other projects is that it is a full stack for edge deployments — and in that respect, it’s maybe more akin to OpenStack than the other projects in the foundation’s stable. It uses open-source components from the Ceph storage platform, the KVM virtualization solution, Kubernetes and, of course, OpenStack and Linux. The promise here is that StarlingX can provide users with an easy way to deploy container and VM workloads to the edge, all while being scalable, lightweight and providing low-latency access to the services hosted on the platform.

Early StarlingX adopters include China UnionPay, China Unicom and T-Systems. The original codebase was contributed to the foundation by Intel and Wind River System in 2018. Since then, the project has seen 7,108 commits from 211 authors.

“The StarlingX community has made great progress in the last two years, not only in building great open source software but also in building a productive and diverse community of contributors,” said Ildiko Vancsa, ecosystem technical lead at the OpenStack Foundation. “The core platform for low-latency and high-performance applications has been enhanced with a container-based, distributed cloud architecture, secure booting, TPM device enablement, certificate management and container isolation. StarlingX 4.0, slated for release later this year, will feature enhancements such as support for Kata Containers as a container runtime, integration of the Ussuri version of OpenStack, and containerization of the remaining platform services.”

It’s worth remembering that the OpenStack Foundation has gone through a few changes in recent years. The most important of these is that it is now taking on other open-source infrastructure projects that are not part of the core OpenStack project but are strategically aligned with the organization’s mission. The first of these to graduate out of the pilot project phase and become top-level projects were Kata Containers and Zuul in April 2019, with Airship joining them in October.

Currently, the only pilot project for the OpenStack Foundation is its OpenInfra Labs project, a community of commercial vendors and academic institutions, including the likes of Boston University, Harvard, MIT, Intel and Red Hat, that are looking at how to better test open-source code in production-like environments.

 

Gauging growth in the most challenging environment in decades

Traditionally, measuring business success requires a greater understanding of your company’s go-to-market lifecycle, how customers engage with your product and the macro-dynamics of your market. But in the most challenging environment in decades, those metrics are out the window.

Enterprise application and SaaS companies are changing their approach to measuring performance and preparing to grow when the economy begins to recover. While there are no blanket rules or guidance that applies to every business, company leaders need to focus on a few critical metrics to understand their performance and maximize their opportunities. This includes understanding their burn rate, the overall real market opportunity, how much cash they have on hand and their access to capital. Analyzing the health of the company through these lenses will help leaders make the right decisions on how to move forward.

Play the game with the hand you were dealt. Earlier this year, our company closed a $40 million Series C round of funding, which left us in a strong cash position as we entered the market slowdown in March. Nonetheless, as the impact of COVID-19 became apparent, one of our board members suggested that we quickly develop a business plan that assumed we were running out of money. This would enable us to get on top of the tough decisions we might need to make on our resource allocation and the size of our staff.

While I understood the logic of his exercise, it is important that companies develop and execute against plans that reflect their actual situation. The reality is, we did raise the money, so we revised our plan to balance ultra-conservative forecasting (and as a trained accountant, this is no stretch for me!) with new ideas for how to best utilize our resources based on the market situation.

Burn rate matters, but not at the expense of your culture and your talent. For most companies, talent is both their most important resource and their largest expense. Therefore, it’s usually the first area that goes under the knife in order to reduce the monthly spend and optimize efficiency. Fortunately, heading into the pandemic, we had not yet ramped up hiring to support our rapid growth, so were spared from having to make enormously difficult decisions. We knew, however, that we would not hit our 2020 forecast, which required us to make new projections and reevaluate how we were deploying our talent.

Email Reply Chain Attacks | What Are They & How Can You Stay Safe?

As recent data confirms, email phishing remains the number one vector for enterprise malware infections, and Business Email Compromise (BEC) the number one cause of financial loss due to internet crime in organizations. While typical phishing and spearphishing attacks attempt to spoof the sender with a forged address, a more sophisticated attack hijacks legitimate email correspondence chains to insert a phishing email into an existing email conversation. The technique, known variously as a ‘reply chain attack’, ‘hijacked email reply chain’ and ‘thread hijack spamming’ was observed by SentinelLabs researchers in their recent analysis of Valak malware. In this post, we dig into how email reply chain attacks work and explain how you can protect yourself and your business from this adversary tactic.

How Do Email Reply Chain Attacks Work?

Hijacking an email reply chain begins with an email account takeover. Either through an earlier compromise and credentials dumping or techniques such as credential stuffing and password-spraying, hackers gain access to one or more email accounts and then begin monitoring conversation threads for opportunities to send malware or poisoned links to one or more of the participants in an ongoing chain of correspondence.

The technique is particularly effective because a bond of trust has already been established between the recipients. The threat actor neither inserts themselves as a new correspondent nor attempts to spoof someone else’s email address. Rather, the attacker sends their malicious email from the genuine account of one of the participants.

Since the attacker has access to the whole thread, they can tailor their malspam message to fit the context of an ongoing conversation. This, on top of the fact that the recipient already trusts the sender, massively increases the chance of the victim opening the malicious attachment or clicking a dangerous link.

To see how this works, suppose an account belonging to “Sam” has been compromised, and the attacker sees that Sam and “Georgie” (and perhaps others) have been discussing a new sales campaign. The attacker can use this context to send Georgie a malicious document that appears related to the conversation they are currently having.

In order to keep the owner of the compromised account ignorant of the attacker’s behaviour, hackers will often use an alternate Inbox to receive messages.

This involves using the email client’s rules to route particular messages away from the usual Inbox and into a folder that the genuine account holder is unlikely to inspect, such as the Trash folder. With this technique, if Georgie in our example replies to Sam’s phishing email, the reply can be diverted so the real Sam never sees it.

Alternatively, when a hacker successfully achieves an account takeover, they may use the email client’s settings to forward mail from certain recipients to another account.

Another trick that can help keep an account holder in the dark is to create an email rule that scans for keywords such as “phish, “phishing, “hack” and “hacked” in incoming messages and deletes them or auto replies to them with a canned message. This prevents any suspicious or concerned colleagues from alerting the account holder with emails like “Have you been hacked?” and so on.

Which Malware Families Have Used Reply Chain Attacks?

Email reply chain attacks began appearing in 2017. In 2018 Gozi ISFB/Ursnif banking trojan campaigns also started using the technique, although in some cases the chain of correspondence itself was faked simply to add legitimacy; in others, the attackers compromised legitimate accounts and used them both to hijack existing threads and to spam other recipients.

Malicious attachments may leverage VBScript and PowerShell through Office Macros to deliver payloads such as Emotet, Ursnif and other loader or banking trojan malware.

SentinelLabs researchers have shown how Valak malware uses specialized plugins designed to steal credentials specifically for use in email reply chain attacks.

As the researchers point out:

“If you are going to leverage reply chain attacks for your spamming campaigns, then you obviously need some email data. It’s interesting to see that when campaigns shifted more towards Valak and away from Gozi, the addition of a plugin surrounding the theft of exchange data showed up.”

Why Are Email Reply Chain Attacks So Effective?

Although spearphishing and even blanket spam phishing campaigns are still a tried-and-trusted method of attack for threat actors, email reply chain attacks raise the bar for defenders considerably.

In ordinary phishing attacks, it is common to see tell-tale grammar and spelling errors, like here.

Also, mass spoofing emails are often sent with subjects or body messages that bear little meaningful context to most recipients, immediately raising suspicion.

Even with more targeted spearphishing attacks, awareness training and safe email practices such as not clicking links, opening attachments from unknown senders or replying to unsolicited emails can have an impact on reducing risk. However, with email reply chain attacks, the usual kind of warning indicators may be missing.

Email reply chain attacks are often carefully-crafted with no language errors, and the leap in credibility gained by inserting a reply to an existing thread from a legitimate sender means that even the most cautious and well-trained staff are at risk of falling victim to this kind of tactic.

How Can You Prevent a Reply Chain Attack?

Given their trusted, legitimate point of origin and the fact that the attacker has email history and conversational context, it can be difficult to spot a well-crafted reply chain attack, particularly if it appears in (or appears to be part of) a long thread with multiple, trusted participants.

However, there are several recommendations that you can follow to avoid becoming a victim of this type of fraud.

First, since reply chain attacks rely on account compromises, ensure that all your enterprise email accounts are following best practices for security. That must include two factor or multi-factor authentication, unique passwords on every account and passwords that are 16 characters in length or longer. Users should be encouraged to regularly inspect their own email client settings and mail rules to make sure that messages are not unknowingly being diverted or deleted.

Second, lock down or entirely forbid use of Office Macros wherever possible. Although these are not the only means by which malicious attachments can compromise a device, Macros remain a common attack vector.

Third, knowledge is power, so expand your user awareness training to include discussion of email reply chain attacks and how they work by referring staff to articles such as this one. Email users need to raise their awareness of how phishing attacks work and how attackers are evolving their techniques. Crucially, they need to understand why it’s important to treat all requests to open attachments or click links with a certain amount of caution, no matter what the source.

Fourth, and most importantly, ensure that your endpoints are protected with a modern, trusted EDR security solution that can stop the execution of malicious code hidden in attachments or links before it does any damage. Legacy AV suites that rely on reputation and YARA rules were not built to handle modern, fileless and polymorphic attacks. A next-gen, automated AI-powered platform is the bare minimum in today’s cyber security threatscape.

Conclusion

Email reply chain attacks are yet another form of social engineering deployed by threat actors to achieve their aims. Unlike the physical world with its hardcoded laws of nature, there are no rules in the cyber world that cannot be changed either by manipulating the hardware, the software or the user. This, however, is just as true for defenders as it is for attackers. By keeping control of all aspects of our cyber environment, we can defeat attacks before they occur or create lasting damage to the organization. Secure your devices, educate your users, train your staff, and let the criminals find another target.


Like this article? Follow us on LinkedIn, Twitter, YouTube or Facebook to see the content we post.

Read more about Cyber Security

Next-Gen AV and The Challenge of Optimizing Big Data at Scale

At SentinelOne, we provide full visibility into managed endpoint data for our customers. Over time, the amount of data events we need to store, search and retrieve has become huge, and we currently handle around 200 billion events per day. While collection and storage is easy enough, querying the data quickly and efficiently is the main challenge. In this post, we will share how we overcame this challenge and achieved the ability to quickly query tremendous amounts of data.

Architectural Overview

Our event stream results in a lot of small files, which are written to S3 and HDFS. We store our partitions in Hive Metastore and query it with Presto. Files are partitioned by date where every day a new partition is automatically created.


We started in a naive way, by simply aggregating events in the pipeline into files being periodically pushed to S3. While this worked well at the beginning, as scale surged, a serious problem emerged.

To allow near real-time search across our managed endpoints, we wrote many small files, rather than attempting to combine them into larger files. We also support the arrival of late events, so data might arrive with an old timestamp after a few days.

Unfortunately, Presto doesn’t work well with many small files. We were working with tens of thousands of files, from hundreds of kilobytes to tens of megabytes. Leaving data in many small files, obviously, made queries very slow, so we faced the challenge of solving the common problem of “small files”.

Attempt 1 — Reduce Frequency of Writing

Our first thought was simply to write less often. While this reduced the number of files, it conflicted with our business constraints of having events searchable within a few minutes. Data is flushed frequently to allow queries on recent events, thus generating millions of files.

Our files are written in ORC format, so appending ORC files is possible, but not effective. When appending ORC stripes, without decompressing the data and restructuring the stripes, the results are big files that are queried really slowly.

Attempt 2 — In-place Compaction

Next, we tried to compact files on the fly in S3. Since our data volumes were small, we were able to compact the files in-memory with Java code. We maintained Hive Metastore for partition locations. It was quite a simple solution, but it turned out to be a headache.


Compacting files in-place is challenging since there’s no way to make atomic writes in S3. We had to take care of deletion of small files that were replaced by compacted files. While the Hive Metastore partitioning was pointing to the same S3 location, we ended up with duplicate or missing data for a while.

S3 listing is eventually consistent:

“A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.”

Although the file would be uploaded successfully, it might appear in the list perhaps half an hour later. Those issues were unacceptable, so we returned to the small files problem.

Attempt 3 — Write files to HDFS, Then Copy to S3

To mitigate S3 eventual consistency, we decided to move to HDFS, which Presto supports natively and as such the transition required zero work.

The small files problem is a known issue in HDFS. HDFS name node holds all file system metadata in memory. Each entry takes about 1 KB. As the amount of files grows, HDFS requires a lot of RAM.

We experienced even worse degradation when trying to save our real time small files in S3:

  • When querying the data in presto, it retrieves the partition mapping from the Hive metastore, and lists the files in each partition. As mentioned above, S3 listing is eventually consistent, so in real-time we sometimes missed a few files in the list response. Listing in HDFS is deterministic and immediately returns all files.
  • S3 list API response is limited to 1000 entries. When listing a directory of large amounts of files, Hive executes several API requests, which cost time.

We stored files in Hive Metastore, different locations for each partition, where Today pointed to HDFS and the files of older data pointed to S3.


This solution solved our consistency issues, but still, small files are problematic! How can we avoid that? Let’s try compaction again.

Attempt 4 — Compaction with Presto Clusters

To spawn a Presto cluster and run compaction with it is simple and easy.

At the end of each day, we created EMR clusters that handled compaction on the last day files. Our clusters had hundreds of nodes, memory-optimized with the compaction done in-memory.


When you set up a Presto cluster, you need to do these steps:

  • Set these parameters to optimize compacted files output:
  • Create two external tables — a source table with raw data and a destination table of compacted data — based on this template:
  • Add partitions to the Hive store, for the source and destination tables to point to the correct location:
  • Finally, run the insert magical command, that will do the compaction:

The rapid growth in SentinelOne’s data made this system infeasible from a cost and maintenance perspective. We encountered several problems as our data grew:

  • Each big partition had 200 GB on disk, which in memory is in fact 2T of raw, uncompressed data every day. Compaction is done on uncompressed data, so holding it in memory through the entire process of compaction required huge clusters.
  • Running a few clusters with hundreds of nodes is quite expensive. First, we ran with Spots to reduce costs, but as our cluster grew, it became hard to get a lot of big nodes for several hours. At peak, one cluster ran for 3 hours to get one big partition compacted. When we moved to On-Demand machines, costs increased dramatically.
  • Presto has no built-in fault-tolerance mechanism, which is very disruptive when running on Spots. If even one Spot failed, the whole operation failed, and we had to run it all over again. This caused delays in switching to compacted data, which resulted in slower queries.

As we started to compact files, Hive Metastore locations were changed to point to compacted data vs. current-day, small-file data.

At the end of the compaction process, we had a job that switched Hive Metastore partitioning.

Attempt 5 — Compact the Files Take 2: Custom-Made Compaction

At this point, we decided to take control. We built a custom-made solution for compaction, and we named it Compaction Manager.

  • When small files are written to our S3 bucket from our event stream (1), we use AWS event notifications from S3 to SQS on object creation events (2).
  • Our service, the Compaction Manager, reads messages from SQS (3) and inserts S3 paths to the database (4).
  • Compaction Manager aggregates files ready to be compacted by internal logic (5), and assigns tasks to worker processes (6).
  • Workers compact files by internal logic and write big files as output (8).
  • The workers update the Compaction Manager on success or failure (7).

What Did We Gain from the Next Generation Solution?

  • We control EVERYTHING. We own the logic of compaction, the size of output files and handle retries.
  • Our compaction is done continuously, allowing us to have fine-grained control over the amount of workers we trigger. Due to seasonality of the data, resources are utilized effectively, and our worker cluster is autoscaled over time according to the load.
  • Our new approach is fault-tolerant. Failure is not a deal breaker any more; the Manager can easily retry the failed batch without restarting the whole process.
  • Continuous compaction means that late files are handled as regular files, without special treatment.
  • We wrote the entire flow as a continuous compaction process that happens all the time, and thus requires less computation power and is much more robust to failures. We choose the batches of files to compact, so we control memory requirements (as opposed to Presto, where we load all data with a simple select query). We can use Spots instead of On-Demand machines and reduce costs dramatically.
  • This approach introduced new opportunities to implement internal logic for compacted files. We choose what files to compact and when. Thus, we can aggregate files by specific criteria, improving queries directly.

Conclusion

We orchestrated our own custom solution to handle huge amounts of data and allow our customers to query it really quickly by utilizing a combination of S3 and HDFS for storage. For first-day data, we enjoy the advantages of HDFS, and for the rest of the data, we rely on S3 because it’s a managed service.

Compaction with Presto is nice, but as we learned it is not enough when you are handling a lot of data. Instead, we solved the challenge with a customised solution which both improved our performance and cut our costs by 80% relative to the original approach.

This post was written by Ariela Shmidt, Big Data SW Engineer and Benny Yashinovsky, Big Data SW Architect at SentinelOne.

If you want to join the team, check out this open position: Big Data Engineer


Like this article? Follow us on LinkedIn, Twitter, YouTube or Facebook to see the content we post.

Read more about Cyber Security

API platform Postman delivers $150M Series C on $2B valuation

APIs provide a way to build connections to a set of disparate applications and data sources, and can help simplify a lot of the complex integration issues companies face. Postman has built an enterprise API platform and today it got rewarded with a $150 million Series C investment on a whopping $2 billion valuation — all during a pandemic.

Insight Partners led the round with help from existing investors CRV and Nexus Venture Partners. Today’s investment brings the total raised to $207 million, according to the company. That includes a $50 million Series B from a year ago, making it $200 million raised in just a year. That’s a lot of cash.

Abhinav Asthana, CEO and co-founder at Postman, says that what’s attracting all that dough is an end-to-end platform for building APIs. “We help developers, QA, DevOps — anybody who is in the business of building APIs — work on the same platform. They can use our tools for designing, documentation, testing and monitoring to build high quality APIs, and they do that faster.” Asthana told TechCrunch.

He says that he was not actively looking for funding before this round came together. In fact, he says that investors approached him after the pandemic shut everything down in California in March, and he sees it as a form of validation for the startup.

“We think it shows the strength of the company. We have phenomenal adoption across developers and enterprises and the pandemic has [not had much of an impact on us]. The company has been receiving crazy inbound interest [from investors],” he said.

He didn’t want to touch the question of going public just yet, but he feels the hefty valuation sends a message to the market that this is a solid company that is going to be around for the long term.

Jeff Horing, co-founder and managing director at lead investor Insight Partners certainly sees it that way. “The combination of the market opportunity, the management team and Postman’s proven track record of success shows that they are ready to become the software industry’s next great success,” he said in a statement.

Today the company has around 250 employees divided between the US and Bangalore in India, and he sees doubling that number in the next year. One thing the pandemic has shown him is that his employees can work from anywhere and he intends to hire people across the world to take advantage of the most diverse talent pool possible.

“Looking for diverse talent as part of our large community as we build this workforce up is going to be a key way in which we want to solve this. Along with that, we are bringing people from diverse communities into our events and making sure that we are constantly in touch with those communities, which should help us build up a very strong diverse kind of hiring function,” he said.

He added, “We want to be deliberate about that, and over the coming months we will also shed more light on what specifically we are doing.”

Tulsa is trying to build a startup ecosystem from scratch

When you think about startup hubs, Tulsa, Oklahoma is probably not the first city that comes to mind.

A coalition of business, education, government and philanthropists are working to foster a startup ecosystem in a city that’s better known for its aerospace and energy companies. These community leaders recognized that raising the standard of living for a wide cross-section of citizens required a new generation of companies and jobs — which takes commitment from a broad set of interested parties.

In Tulsa, that effort began with George Kaiser Family Foundation (GKFF), a philanthropic organization, and ended with the creation of Tulsa Innovation Labs (TIL), a partnership between GKFF, Israeli cybersecurity venture capitalists Team8 and several area colleges and local government.

Why Tulsa?

Tulsa is a city of more than 650,000 people, with a median household income of $53,902 and a median house price of $150,500. Glassdoor reports that the average salary for a software engineer in Tulsa is $66,629; in San Francisco, the median home price is over $1.1 million, household income comes in at $112,376 and Glassdoor’s average software engineer salary is $115,822.

Home to several universities and a slew of cultural attractions, the city has a lot to offer. To sweeten the deal, GKFF spun up “Tulsa Remote,” an initiative that offers $10,000 to remote workers who will relocate and make the city their home base. The goal: draw in new, high-tech workers who will help build a more vibrant economy.

Tulsa is the second-largest city in the state of Oklahoma and 47th-most populous city in the United States. Photo Credit: DenisTangneyJr/Getty Images

Local colleges are educating the next generation of workers; Tulsa Innovation Labs is working with the University of Tulsa in partnership with Team8 through the university’s Cyber Fellows program. There are also ongoing discussions with Oklahoma State University-Tulsa and the University of Oklahoma-Tulsa about building a similar relationship.

These constituencies are trying to grow a startup ecosystem from the ground up. It takes a sense of cooperation and hard work and it will probably take some luck, but they are starting with $50 million, announced just this week from GKFF, for startup investments through TIL.