Hot on the heels of Snowflake Summit, Databricks held their annual Data + AI user conference from June 27 – 30 in San Francisco. The event was packed with announcements and informative sessions for 5,000 in person attendees and 60,000 virtually. Having these two user conferences in close proximity provides us with an opportunity to compare product direction and strategy. On the surface, the two companies appear to be rapidly converging towards a common vision of becoming the single platform needed for analytics, data engineering and machine learning. At the same time, the two companies currently cater to rather distinct audiences, use cases and implementation tolerances.

In this post, I will review Databricks’ product strategy, what was announced at Data + AI and how it all relates to Snowflake.

Audio Version
View all Podcast Episodes and Subscribe

Setting the Stage

Databricks kicked off the conference with a keynote headlined by their CEO. He delivered a very useful overview of the state of the market for large scale data processing, some of its assumptions and challenges and how the Databricks Lakehouse Platform is positioned to solve them. This sets the stage for the whole conference and underpins the goal of the various product announcements.

First, he introduced the Data Maturity Curve. This consists of two axes. Data and AI Maturity (the x axis) represents the progression of data processing stages. It starts with aggregation and cleansing of all an enterprise’s data into one location for analysis. Then, it moves through different types of analytics including ad hoc queries, canned reports/dashboards and broader data exploration. These are all grouped into a historical context around “what happened?”.

Further to the right, we progress into “what will happen” or predicting outcomes. After that, determining how an enterprise should act on those predictions and even automating decisions based on those recommended actions. These activities fall into the realm of data science – through machine learning and AI. These involve taking large data sets and trying to predict the likelihood of a future event. Based on predictions, the system can even be instructed to take an action, or rely on a human operator as the trigger. Predictions may be personalized shopping recommendations, transportation routing, supply availability, drug efficacy, etc.

The items further to the right are assumed to be more mature because they are more difficult to achieve. The Y Axis represents the competitive advantage generated for an enterprise by executing the data processing activity well. Predictive outcomes are assumed to create more competitive advantage than simply reporting the past. I suppose this makes sense – if an enterprise could perfectly predict the future, they would obviously be in a better competitive position. To baseline this, however, many companies compete effectively based on strong granular analytics. Further, the ROI and effectiveness of predictive functions can be mixed.

The CEO then bifurcated the diagram, with historical reporting (BI) traditionally served by a data warehouse and predictive functions (AI) better served from a data lake. This creates a state where many enterprises have two separate sets of data – one set that lands in the data warehouse and another set in the data lake. The inference is that structured data from transactional databases and SaaS apps is sent to the data warehouse, and raw semi-structured or unstructured data from many other sources are dropped into the data lake.

Because of this, the data warehouse and data lake create two distinct data silos. The CEO contends that enterprises waste resources copying a lot of data between the two. Each data store employs a data governance and security model, controlling access to the individual data items (whether rows and tables or files). Then, each data source is harnessed for their function in the data maturity model – ranging from analytics and BI to data science and ML.

This model has traditionally existed in many enterprises. Depending on the type of company, their age and even product category, they typically started with a data warehouse function and later added a data lake as the place to deposit all their raw data. Typically, raw data is stored in S3 or some other cloud-based object store. If a company’s primary function revolved around data services, they may have stood up a data lake first and utilized a simple transactional database for their reporting.

To solve for this, Databricks created the Lakehouse Platform based on the lakehouse model. The lakehouse combines the functionality of a traditional data lake and data warehouse into one platform. For Databricks, the Lakehouse Platform consists of three layers. At the bottom is an open, performant data storage engine that can handle all types of data efficiently (structured, unstructured, semi-structured). For this, Databricks built an open format storage layer on top of a standard data lake. This is called Delta Lake and represents the foundation of the Databricks Lakehouse Platform.

Delta Lake supports supports all types of data formats, including real-time streaming. It provides the query engine that makes running traditional data warehousing analytics workloads feasible and performant. It adds support for ACID transactions and schema enforcement, bringing consistency to storage in a data lake. Databricks has also been iterating on the performance of the Delta Lake query engine and they claim to have benchmark tests demonstrating better query performance than comparable cloud data warehouses for simulated workloads.

This includes reports of better performance than Snowflake, which the Snowflake team has disputed. To which, the Databricks team then rebutted. While interesting drama, most technology leaders (myself included) on the customer side are skeptical of these database benchmark tests in general. No matter how objective each side argues that they are, the assumption is that the test methodology is somehow flawed or set up in a way that optimizes for that vendor’s solution. I do find it interesting that Databricks is obsessed by these performance tests, as they published more of them during Summit. For the purposes of this analysis, I will assume that both solutions are comparable and that each customer needs to base their purchase decision on how each platform performs on their workload.

Over the Delta Lake, the lakehouse paradigm layers governance and security. This applies to structured tables (through ACLs), individual files and blobs. This ensures that all users, regardless of function, can only access the data for which they have permission. This governance layer is something that has been lacking in traditional data lakes, making the lakehouse enterprise-ready by supporting regulations like GDPR and HIPAA. The Databricks Lakehouse Platform supports governance through its Unity Catalog.

The top layer consists of the technologies that enable the different data processing use cases of Data Science & ML, Data Streaming, Business Intelligence and SQL Analytics. These use cases are supported by several Databricks products including Databricks Machine Learning, Structured Streaming, Databricks Workflows and Databricks SQL. Collectively, these enable the Lakehouse Platform to address all ML, BI, SQL and streaming use cases in one place.

In the press release accompanying the Data + AI Summit announcements, Databricks’ CEO expressed their high level goal as “our customers want to be able to do business intelligence, AI, and machine learning on one platform, where their data already resides.” This is at the core of their product strategy and encapsulated many of the product announcements for the week.

To close on his overview, the CEO laid out three objectives that enterprises are seeking in their move to the Databricks Lakehouse Platform.

Simplicity. One system instead of two with all data in one place.
Multi-cloud. The solution needs to work across all three cloud vendors, as 80% of their customers have operations on more than one cloud
Open source and open standards. Enterprises want to avoid lock-in with a single vendor or proprietary format.

Interestingly, these were presented as making Databricks distinct from other cloud data warehouses, particularly from Snowflake. However, Snowflake addressed some of these as part of their Summit a couple weeks ago and is rapidly evolving their market positioning as well.

Databricks and Snowflake

The idea of simplicity, by performing all data processing on one platform adjacent to all of an enterprise’s data, mirrors the messaging coming out of Snowflake’s Summit conference. As I discussed in my blog post summary of Summit, Snowflake’s directive is to “bring the applications to the data, not the data to the applications.” In this regard, Snowflake and Databricks are converging on the same goal to address all of an enterprise customer’s big data needs.

Snowflake Platform Diagram, Snowflake Summit 2022

As part of Summit, Snowflake introduced an updated platform diagram to emphasize a few items. First, they identify three data storage types, including data warehouse, data lake and their new transactional application database, Unistore. The addition of a data lake reflects the ability to work with unstructured data. Second, they represent a cross-cutting layer of Snowgrid, representing Snowflake’s ability to work across all three major clouds. Finally, they include Data Science & ML as a top level workload, along with data engineering, collaboration (data sharing) and applications.

The applications workload represents the fact that Snowflake is positioning themselves as a suitable data store for high-traffic data applications to query. With Unistore, they are targeting queries as fast as 10ms response times, suitable for data-rich applications with a UI. Unistore is also transactional, providing some support for data capture use cases. In this vein, Snowflake has also elevated Cybersecurity as a top-level use case. I see this more as an implementation type, but Snowflake clearly sees a lot of potential for Cybersecurity as a discrete use case driven through their Powered By program.

Similarly, to signal their willingness to adhere to open data standards, Snowflake announced support for the open industry format Apache Iceberg. Iceberg is a popular open table format for analytics. Iceberg brings high performance, reliability and simplicity of SQL tables to big data. It is compatible with other data processing engines like Spark, Trino, Flink, Presto and Hive. This allows customers to integrate their Snowflake data with other external stores.

Snowflake Summit Investor Session, June 2022

Iceberg support represents Snowflake’s counterargument to being labelled as a closed system. To round out the three goals laid out by Databricks’ CEO, Snowflake has always supported a multi-cloud deployment. Databricks’ and Snowflake’s recent product moves appear to be converging on the same set of goals and the strategic distinction between them is diminishing.

With that said, the two companies are approaching the vision of becoming a single “big data” platform from two different sides of the spectrum. Snowflake started as a cloud-based data warehouse and has been gradually moving into enabling more advanced data processing through their Snowpark runtime. With the ability to run scripts in Python, Java and other languages directly against a customer’s data set, this is increasingly supporting more sophisticated data analytics, data engineering and data science workloads. These go beyond simple SQL scripts run directly against the data warehouse. Particularly with the addition of full Python support and Anaconda libraries, customers are more empowered to address data science workloads without having to move data outside of Snowflake.

As Databricks has moved from a data lake to enabling data warehouse workloads (Lakehouse), Snowflake has added the ability to process semi-structured and unstructured data. They position Snowflake as a suitable solution for a data lake, offering the ability to query object storage directly or to load all data into Snowflake. Data can then be queried using Snowflake’s elastic processing engine or manipulated with Snowpark. Over all this, customers retain the benefit of Snowflake’s granular governance controls and security at multiple steps in the data lifecycle.

Further, any new insights created from this processing can be written right back into the Snowflake Data Cloud and used as the data source for applications. This is being enabled by their new transactional database, Unistore. Steamlit integration was announced as completed as well, providing another mechanism to build an application directly on the Snowflake data set.

Given Snowflake’s strategy, what does that mean for Databricks? I think Databricks is wise to layer analytics functionality over the data lake, emulating a data warehouse. A data lake by itself is useful, but limits market scope. Some enterprises early to the data lake model became frustrated. This situation has improved with the Lakehouse, by making the data lake actionable.

In terms of size of market, every enterprise at scale needs an analytics function (or its equivalent). This capability is necessary to consolidate data from all their different sales and operational channels to drive a practical analytics function. Every commercial enterprise has to generate reports, dashboards, business intelligence to allow them to distribute their financials, operate the company and inform their stakeholders. So, while a data warehouse sounds antiquated, it is practically necessary.

Machine learning and AI are emerging as a competitive differentiator, but are still early in the adoption cycle. The ROI of machine learning and AI is sometimes harder to find for enterprises. Sophisticated data science driven companies have generated significant benefit from machine learning. Large enterprises in multiple categories are finding new insights from their data continuously, particularly when combined with outside sources. Retailer Kohl’s, for example, uses large data sets to create personalized offers for repeat customers and to automate merchandising decisions at a hyperlocal level based on thousands of factors.

“We’ve got about two more years to go to get to a place where I would describe us as a fully data-native organization, using automated decision processes instead of using data just augmenting human decision processes,” says Gaffney.

Key to that push is a strategy to make the most of machine learning and third-party data in service of customer personalization and the “hyper-localization” of merchandising decisions, Gaffney says.
CTO Kohl’s, CIO Magazine

On the other hand, many companies claim to being using machine learning and AI, when in reality they are generating insights through programmatic algorithms or deterministic outcomes. When they calculate product pricing, schedule shipments, determine shipping routes, optimize marketing performance or improve customer service, the outcomes are based on some scoring calculation, decision tree or brute force algorithm. They aren’t using ML or AI to discover an outcome that was unexpected.

This isn’t to say that there isn’t enormous promise for machine learning and AI, but I think the market for deterministic analytics is just as large. This probably explains why Databricks is so intent on demonstrating that the Lakehouse platform can be used for traditional data warehousing (analytics) functions. If the value of analytical workloads was low (as it is on the Data Maturity Curve), then why would they care? I think it’s because the market for analysis of historical data is at least as large as that for predictive at this point. As predictive data processing becomes more accurate at actually determining in advance what will happen, that balance will certainly shift.

Sponsored by Cestrian Capital Research

Cestrian Capital Research provides extensive investor education content, including a free stocks board focused on helping people become better investors, webinars covering market direction and deep dives on individual stocks in order to teach financial and technical analysis.

The Cestrian Tech Select newsletter delivers professional investment research on the technology sector, presented in an easy-to-use, down-to-earth style. Sign-up for the basic newsletter is free, with an option to subscribe for deeper coverage.

Software Stack Investing members can subscribe to the premium version of the newsletter with a 33% discount.

Cestrian Capital Research’s services are a great complement to Software Stack Investing, as they offer investor education and financial analysis that go beyond the scope of this blog. The Tech Select newsletter covers a broad range of technology companies with a deep focus on financial and chart analysis.

Product Announcements

The Databricks team introduced a number of new features and product extensions during the Data + AI user conference. These were highlighted in the keynotes and detailed in subsequent individual product deep-dives.

The announcements revolved around a few major themes and spanned most of the major functional areas of the Databricks Lakehouse Platform. First, Databricks is clearly making the case for the Lakehouse to be considered a suitable source of analytics workloads. They revealed new data warehousing performance improvements, functionality and expanded data governance. Second, they acknowledge the benefit of leveraging third-party data sets and enabling secure data sharing and collaboration. Third, they want to keep operational costs low and reduce the complexity of managing data pipelines and machine learning jobs.

These themes help reinforce the overall product strategy for Databricks. With that set-up, let’s dig into what Databricks announced during the Data + AI Summit and how these bring them closer to their product vision.

Data Warehousing on the Lakehouse

Databricks is laser focused on making the case that the Lakehouse Platform is a full-featured and performant data warehouse. They want to enable analytics workflows on both structured and unstructured data. Customers should be able to run SQL queries against the Lakehouse, just as they would with any data warehouse. The Databricks Lakehouse Platform accomplishes this by adding a SQL query engine on top of the data storage layer. Data analysts simply provision a SQL endpoint and query it like they would any database.

Analysts expect the SQL endpoint to perform similarly to a query against a traditional data warehouse. This explains why Databricks puts so much emphasis on performance improvements for their SQL engine and comparing that to other data warehouse offerings. The natural assumption is that it wouldn’t perform as well, just based on the levels of abstraction from a traditional data warehouse (which is optimized for this type of performance).

During the conference, Databricks announced several enhancements that support this idea of shifting standard analytics workloads to the Lakehouse Platform. These product improvements included:

Databricks SQL Serverless

First, Databricks announced the availability of serverless compute for Databricks SQL (DBSQL) in Public Preview on AWS. It provides instant, secure and fully managed elastic compute with high performance for SQL workloads. This is an improvement over their previous implementation, which required customers to manage the virtual instances that backed SQL endpoints. With serverless, customers don’t need to worry about provisioning servers themselves.

Behind the scenes, Databricks is managing an active server fleet. When customer users make queries, Serverless SQL transfers compute capacity to those user queries within 15 seconds of the first request. By default, if at any point the cluster is idle for 10 minutes, Serverless SQL will automatically shut it down and remove the resources. It will restart the compute provisioning process for the next query. This is how Serverless SQL helps lower overall costs – by matching capacity to usage in a way that avoids over-provisioning and idle capacity when users are inactive.

The benefit to customers is lower cost and less system management overhead. Pricing is usage-based and removes costs for server resource idling. Databricks SQL Serverless can expand or contract resources on demand to handle varying workloads. Additionally, customers no longer incur the burden of capacity management, patching, upgrading and performance optimization of their SQL cluster.

Data can be secured, using customer managed keys for encrypting data at rest. This allows customers to bring sensitive, production workloads and maintain governance controls. To activate Serverless SQL, customers can create new serverless SQL warehouses (previously called endpoints) from their account console, or convert existing SQL warehouses to operate in serverless mode.

Photon is Ready for Primetime

Databricks announced that their new query engine, Photon, will be generally available on Databricks Workspaces in the coming weeks. Photon provides fast query performance at low cost. It can be applied to multiple functions like data ingestion, ETL, streaming, data science and interactive queries. Photon provides the SQL-like experience on top of the data lake. It is an ANSI-compliant engine designed to be compatible with modern Apache Spark APIs and works with existing SQL, Python, R, Scala and Java code.

Photon was built for performance, written in C++ and designed to work with modern hardware to generate faster queries. The benefits to customers are lower total cost of ownership (TCO) and faster data processing. Databricks claims that Photon can achieve TCO savings up to 80% over comparable systems and accelerate data and analytics workloads by up to 12x.

New Open Source Data Connectors

Databricks announced a full set of open source connectors for Go, Node.js and Python to make it easier for developers to connect to Databricks SQL from any application. They also introduced a new CLI (Command Line Interface), which allows developers and analysts to run queries directly from their local computers. The official Databricks JDBC driver is now available in the Maven central repository. This makes it possible to use the driver with enterprise build systems and package it within applications.

Since its GA earlier this year, the Databricks SQL Connector for Python has experienced heavy adoption from the developer community, averaging over 1M downloads a month. With these additions, Databricks SQL now has native connectivity to Python, Go, Node.js, the CLI, ODBC/JDBC, as well as a new SQL Execution REST API that is in Private Preview.

Data Governance

One necessary feature of a generalized data warehouse function is granular control over data access. Governance and security of data has been a challenge in the past for data lakes, as there is no common control layer across the different data types. Additionally, users of data lakes often struggle to catalog and locate data, wasting time combing through different data stores for the appropriate data to feed their models. Snowflake leadership contends that their customers value data governance and security features the most.

Databricks addresses governance through their Unity Catalog offering. It provides a central governance solution for all data and AI assets, including files, tables and machine learning models within the lakehouse. Users can define access policies once at the account level and enforce them across all workloads and workspaces.

Supporting the catalog function, Unity Catalog includes built-in search and discovery of data assets. Users can quickly find, understand and reference relevant data from across the data estate with a unified search experience for data analysts, data engineers and data scientists. Data search in Unity Catalog is secure by default, limiting results based on the access privileges of users, adding an additional layer of security for privacy considerations

As part of the Data + AI Summit, Databricks announced several improvements to the Unity Catalog. These included the following:

Automated Data Lineage. Unity Catalog now automatically tracks data lineage across queries executed in any language. Data lineage is captured down to the table and column level, while key assets such as notebooks, dashboards and jobs are tracked.
Search. Unity Catalog now includes a built-in search capability. Once data is registered in Unity Catalog, end users can easily search across metadata fields including table names, column names, and comments to find the data they need for their analysis.
Simplified Access Controls. Unity Catalog offers a simple model to control access to data via a UI or SQL. Databricks has now extended this model to allow data admins to set up access to thousands of tables via a single click or SQL statement.
Information Schema. Unity Catalog brings the concept of the Information Schema to the lakehouse. These offer a pre-defined set of views that describe the objects within the database, including what tables have been created, when, by who, and what access levels have been granted on each. This metadata is often leveraged by users to understand what data is available in the system, but also to automate report generation on topics such as access levels per table.
Governance and Catalog Partners. Databricks offers an ecosystem of partners who further support and extend the capabilities of the Unity Catalog.

The Unity Catalog will be generally available on AWS and Azure in the upcoming weeks. Currently, customers can apply for a public preview through their sales rep.

Enhanced Data Sharing Capabilities

Databricks’ addition of data sharing capabilities appeared to represent an effort to reach parity with Snowflake. Snowflake has offered robust data sharing, clean rooms and a data marketplace for some time. I think these capabilities contribute to network effects for customers on the Snowflake platform. Having their data hosted on Snowflake makes exchanging it with other Snowflake customers seamless.

Databricks appreciates these benefits. They already support a basic data sharing function through their Delta Sharing product, announced in May of last year. This capability is based on a new open source project, created by Databricks for the purpose. Delta Sharing is also offered by Databricks as a hosted solution on their platform.

Delta Sharing Functional Diagram, Databricks Data + AI 2021 Summit Keynote Video

The main components of Delta Sharing are the Delta Lake provider, the Sharing Server and the Delta Sharing Client. The provider can be any modern cloud data storage system, like AWS S3, Azure ADLS or GCS, and of course Databricks Lakehouse, that provides an interface to their source data in the Delta Lake format (open source project sponsored by Databricks). The Delta Sharing Server is hosted by the cloud data provider and serves as a proxy to receive client requests, check permissions and return links to approved data files in Parquet format.

The Delta Sharing Client runs an implementation of the Delta Sharing protocol. Databricks already published open source connectors for several popular tools and languages (Pandas, Spark, Rust, Python) and are working with partners to publish more. Through the client connectors, end customers would be able to query the data shared as permissioned by the provider.

Building on the benefits of Delta Sharing, Databricks announced two new capabilities at the Data + AI Summit.

Cleanrooms

Available in the coming months, Cleanrooms will provide a way to share and join data across organizations within a secure environment. No data replication is required. Organizations can easily collaborate with customers and partners on any cloud and provide them the flexibility to run complex computations and workloads using both SQL and data science-based tools – including Python, R, and Scala – with consistent data privacy controls.

The Databricks team shared a few potential use cases for Cleanrooms. These are similar to examples cited by Snowflake.

A consumer packaged goods (CPG) company could measure sales uplift by joining their first-party advertisement data with the point of sale (POS) transactional data of their retail partners.
Media advertisers and marketers can deliver more targeted ads, with broader reach, better segmentation, and greater ad effectiveness transparency while safeguarding data privacy.
Financial services companies can collaborate across the value chain to establish proactive fraud detection or anti-money laundering strategies.

Databricks Marketplace

Available in the coming months, Databricks Marketplace will provide an open marketplace to package and distribute data and analytics assets. The Databricks Marketplace will enable data providers to securely package and monetize digital assets like data tables, files, machine learning models, notebooks and analytics dashboards. Data consumers can discover new data and AI assets, potentially accelerating their own discovery and analysis.

For example, instead of acquiring access to a dataset and investing their own time to develop and maintain dashboards to report on it, customers can choose to simply subscribe to pre-existing dashboards that already provide the necessary analytics. Databricks Marketplace is powered by Delta Sharing, allowing data providers to share their data without having to move or replicate the data from their cloud storage.

MLflow 2.0

Databricks introduced the next version of MLflow. Getting a machine learning pipeline into production requires setting up infrastructure. This can be tedious for data teams at scale. MLflow Pipelines, included in MLflow 2.0, will handle the operational details for users. Instead of setting up orchestration of notebooks, users can simply define the elements of the pipeline in a configuration file. MLflow Pipelines manages the execution automatically. MLflow Pipelines offers data scientists pre-defined, production-ready templates based on the model type they’re building to allow them to bootstrap and conduct model development without requiring intervention from production engineers.

Databricks also added Serverless Model Endpoints to directly support production model hosting, as well as built-in Model Monitoring dashboards to help teams analyze the real-world model performance.

Delta Live Tables – New Capabilities and Performance Enhancements

Delta Live Tables (DLT) is an ETL framework that uses a simple, declarative approach to build data pipelines. Since its launch earlier this year, Databricks continues to expand DLT with new capabilities. During Summit, they introduced a new performance optimization layer designed to speed up execution and reduce costs of ETL. Additionally, new Enhanced Autoscaling is purpose-built to intelligently scale resources with the fluctuations of streaming workloads. Change Data Capture (CDC) tracks every change in source data for both compliance and machine learning experimentation purposes.

Project Lightspeed

In collaboration with the Spark community, Databricks announced Project Lightspeed, the next generation of the Spark streaming engine. As the diversity of applications moving towards streaming of data has increased, new requirements have emerged to support data streaming. Spark Structured Streaming has been widely adopted because of its ease of use, performance, large ecosystem and developer communities. Going forward, Databricks will collaborate with the community and encourage participation in Project Lightspeed to improve performance, support for new connectors and simplify deployment, operations, monitoring and troubleshooting.

Customer Presentations

During the keynotes at Data + AI, several customers presented their data journey and use of the Databricks Lakehouse Platform. These included Adobe, Amgen, Coinbase, Intuit and John Deere. Two of these are also Snowflake customers. I think this highlights the reality that while the two companies are vying for the same space, at least in the near term, many enterprise customers are comfortable maintaining both solutions. For their analytics teams, they use Snowflake as the central data warehouse. For data science, they utilize Databricks running on top of their data lake.

Over the longer term, of course, it will be interesting to see if one solution supersedes the other. In a presentation from shared customer Coinbase, the senior data engineer presented his vision that more analytics workloads would be run using Databricks Lakehouse Platform on top of their data lake.

In the opening keynote, the Databricks CEO also highlighted their strategy to form vertical solutions and marketing strategies around individual industries. This is similar to Snowflake’s Industry Solutions. In fact, the two companies share many of the same labels (and partners) for each industry vertical.

For each vertical, the strategy is similar. Both companies plan to offer value-added services, like ready-made templates and enhanced governance and regulatory compliance, based on the requirements of each vertical. These are designed to lure in enterprise customers and partners, with the expectation that each additional participant benefits from network effects. Within the verticals, participants can exchange data through sharing and clean rooms. They can also subscribe to relevant data feeds from third-parties and packaged data applications within a shared Marketplace on each platform.

The formation of these industry verticals, influence of network effects and desire to build-out participation will likely up the competitive ante between Databricks and Snowflake. While they both claim a bias towards openness, the formation of industry verticals creates an undercurrent of exclusivity and commitment.

Investor Take-aways

Databricks certainly demonstrated their continued product momentum during the Data + AI Summit. They are executing quickly towards a broad vision. They enjoy strong customer relationships and ravenous engagement from a cohort of sophisticated users. Innovation and strong technical chops lie at the core of the Databricks culture.

A large market opportunity exists in enabling machine learning and AI for enterprises at scale. This is because the data sets and compute necessary to derive meaningful insights and make accurate predictions is much greater than what is required for analytics. With that said, all enterprises need analytics, providing a larger customer base. Also, while many companies appreciate the potential of ML and AI, they have been slow to invest in new data science initiatives or have been disenfranchised by failed efforts that didn’t deliver on expected ROI.

While Snowflake and Databricks are converging on the same goals, there are still differences in terms of their general approach, architecture and go-to-market.

Pre-packaged versus configurable. Snowflake provides a packaged, off-the-shelf solution that allows companies to get basic analytics off the ground in minutes. Databricks offers much greater customization and configuration offering customers full control over their set-up.
Open source versus proprietary. Putting aside support for Iceberg, Snowflake is generally a closed platform. Databricks, on the other hand, is based on several open source projects. The transparency of openness builds trust with customers and obviates a concern for lock-in. It also creates constraints around product development flexibility due to community interoperability. A closed, cloud-based system is usually easier to evolve and roll out changes to customers. Drawing a parallel to the observability market, Datadog and Elastic provide a reasonable proxy.
Staffing Requirements. Databricks implementations tend to be supported by a more sophisticated data team, with varied functional skills and more experience with infrastructure operations. A basic Snowflake implementation, on the other hand, can be managed by a small team of data analysts.

Given the size of the market and the rapid acceleration of enterprise data creation, I think there will be plenty of room for both companies to grow for some time. While they share a common goal to be the tool of choice (to the exclusion of the other), I think each will appeal to different audiences, enterprise types and depth of data operations in the near term. The go-to-market strategy for both companies will determine market share, as much as technology capabilities.

In fact, many customers still have both solutions in their data stack. Snowflake is used for analytics and Databricks for data science. Some companies have expressed aspirations to consolidate the two solutions onto one platform. In this regard, I think we will see data-native companies trying to run some or all of their analytics workloads on the Databricks Lakehouse Platform. On the other hand, enterprises with a heavy Snowflake investment may add data science and application workloads by taking advantage of Snowflake’s product extensions in these areas. For at least the next couple of years, I see both companies co-existing, in a similar way that we have three hyperscalers.

If Databricks were to come to the public markets, I would be very interested as a pair investment to Snowflake. I think both companies are capitalizing on the massive secular trends in data processing, sophisticated analytics and machine learning. These will provide tailwinds to both companies for some time. Investors should monitor the competitive jockeying between these two closely, particularly as it relates to any impact on Snowflake’s growth while we wait for Databricks to go public.

NOTE: This article does not represent investment advice and is solely the author’s opinion for managing his own investment portfolio. Readers are expected to perform their own due diligence before making investment decisions. Please see the Disclaimer for more detail.

Additional Reading

For more in-depth background on the Snowflake and Databricks platform, product offerings and competitive positioning, peer analyst Muji at Hhhypergrowth offers in-depth coverage as part of his premium content.

7 Comments

Michael Orwin
July 8, 2022 at 7:21 am

Thanks for the great content. Are any of these migrations much harder than the others?
1) Snowflake to Databricks for analytics
2) Databricks to Snowflake for analytics
3) Snowflake to Databricks for data science
4) Databricks to Snowflake for data science
- poffringa (Post author)
  July 11, 2022 at 10:06 am
  
  Thanks, Michael. I don’t think we are seeing many migrations at this point from one solution to the other completely. What is happening in some cases is the following two scenarios, I think:
  – Companies are trying to remain on one of the platforms primarily/exclusively by adding new use cases for either data science or analytics on that platform. Examples would be a Snowflake user introducing data science workloads by leveraging Snowpark. Or, Databricks users addressing analytics with their new SQL query engine.
  – Some companies that are using both solutions trying to move some workloads one way or the other. An example would be Coinbase’s intention to try running some of their new analytics workloads on Databricks instead of Snowflake.
Yuva
July 10, 2022 at 11:09 pm

Love to see you initiate coverage on Palantir Foundry. Appears to be a data platform that warrants comparison to both Snowflake and Databricks. Thoughts on Foundry?
- poffringa (Post author)
  July 11, 2022 at 10:11 am
  
  I am watching Palantir, but not covering them closely at this point. While they have advanced data processing capabilities, use cases and customers thus far have been primarily for government and regulated industries. I would like to see them break into mainstream adoption. I also feel like their platform has been extremely closed, where they build a lot supporting software infrastructure themselves, rather than partnering. Observability is a good example. This is probably necessary due to security requirements, but could make for an enormous amount of support and tech debt over time.
  - Yuva
    July 18, 2022 at 10:09 pm
    
    I agree with your points. I’m an investor and the only way they’re going to go through a step-change in growth is to go mainstream and sell to anyone / make Foundry easily accessible. There’s a guy on YouTube called CodeStrap who covers them. Love to know your thoughts if you can connect with him.
- Michael Orwin
  July 11, 2022 at 1:51 pm
  
  I’ll just mention that some people who want to know more about Palantir might want to try Codestrap’s videos on youtube. He’s used some of their software, and acknowledges the possibility of bias because of some relation he has with Palantir, though I’ve forgotten the details. Some of the short pieces (under 5 minutes) don’t have much substance, IMO. Also just IMO, sometimes the video is slick enough that it seems like an advert, but he’s been critical of management. He said that Palantir needed to sell modules instead of making customers take on the whole lot, they needed to appeal to developers and be less antagonistic to the IT department, and they couldn’t just rely on the CEO to make sales. Palantir is now addressing those issues, but I think execution still remains to be seen. (The biggest goof might be not launching the minimal viable product years ago, if they were really as far ahead as claimed.) That’s all just my opinion or from memory.
Liberty
July 14, 2022 at 2:44 pm

Great detailed overview, thanks Peter

Databricks Data + AI Summit 2022