AI has crashed onto the investing stage in 2023, driving significant stock price gains for several companies. Some, like Nvidia and Microsoft have already projected a direct revenue benefit as part of recent earnings reports. Others have indicated they expect AI to drive demand tailwinds going forward as part of management commentary.
Eventually, most software service and infrastructure providers should benefit from increased demand, as AI services proliferate and contribute to all areas of the economy. Because many AI services are delivered through Internet-based applications, the same guardrails of security, delivery, monitoring and operational data storage will be needed. This is in addition to the increased consumption of data services to collect, prep, distribute and process the inputs for various AI models.
AI-driven expert systems and co-pilots will raise the productivity of information workers. Enterprises will need fewer of them to accomplish the same amount of work. This will free up budget to increase spend on AI software services, similar to the efficiencies gained from the proliferation of SaaS tools over the last decade that helped internal business teams automate most aspects of their operations.
Software development teams, in particular, will experience a significant boost in output per employee. Enterprises will be able to clear their application backlogs more quickly, increasing the demand for hosting infrastructure and services. At steady state, fewer developers will be needed, supporting a shift of IT budget from salaries to software.
As data is the largest ingredient to these enterprise AI development efforts, software vendors providing data processing and infrastructure services stand to benefit. AI has further elevated the value of data, incentivizing enterprise IT leadership to review and accelerate efforts to improve their data collection, processing and storage infrastructure. Every silo’ed data store is now viewed as a valuable input for fine-tuning an even more sophisticated AI model.
In the realm of big data processing, enterprises need a place to consolidate, clean and secure all of their corporate data. Given that more data makes better AI, enterprise data teams need to ensure that every source is tapped. They are scrambling to combine a modernized data stack with an AI toolkit, so that they can rapidly, efficiently and securely harness AI capabilities to launch new application services for their customers, partners and employees.
At the center of these efforts are the big data solution providers. These include legacy on-premise data warehouses, cloud-based data platforms and of course, the hyperscalers. Among these, Snowflake and Databricks are well-positioned, representing the fastest growing modern data platforms that can operate across all three of the hyperscalers. While the hyperscalers will win their share of business, enterprise data team leadership often expresses a preference for an independent data platform where feasible.
Fortunately for investors, Snowflake and Databricks held their annual user conferences recently. Perhaps it was intentional that they fell within the same week – at least they staggered the events between the first and second half. Both companies made major product and partnership announcements, leading to many comparisons between the two and speculation about changes in relative product positioning.
The market for the combination of big data and AI processing will be enormous, with some projections reaching the hundreds of billions of dollars in annual spend. While the Snowflake and Databricks platforms are clearly converging in feature set scope, they still retain different approaches based on their historical user types. Such a large market will likely support multiple winners.
Audio Version
View all Podcast Episodes and Subscribe
In this post, I will review the major themes around AI and then discuss the primary announcements from Snowflake Summit and Databricks’ Data + AI Summit. As part of this, I will try to extrapolate how each is positioning themselves relative to the major trends emerging in the AI and data processing market. Investors can use this information to position their portfolio to capitalize on AI secular trends and specifically consider the opportunity for SNOW. Hopefully, Databricks will enter the public market in the next year or two, providing another investment option.
AI as a New Secular Tailwind (Maybe the biggest yet)
First, let’s discuss what has changed in the last year that has brought AI to the forefront of the market’s purview and ignited a rush of investment into the space. Investors are being inundated with references to AI and are left to interpret what impact this may have on cloud infrastructure spending. They also hear discussion from entrenched software providers that they have been using ML and AI all along. So, what is new and how does this change things? Haven’t we seen this before?
If we focus just on generative AI and LLM’s, the primary change is how humans can interact with these new digital experiences and the value they can expect from them. Specifically:
Better User Interface. The method of interaction between human and machine is evolving from point, click, select on a screen (web browser or mobile app) to natural language queries and task instruction. This increases the efficiency of the interface by an order of magnitude or more. Natural language as an interface to data makes everyone a business analyst, programmer or power user, without needing to learn an obscure scripting language or complex interaction protocol.
As natural language models improve and add contextual awareness for specific business domains, humans will be able to interact with data services through conversation, rather than code. Expert systems will be built on top of the data, providing guidance to operators and employees, making everyone a specialist. This will increase efficiency, disperse expertise and speed up decision making. In many cases, actions can be automated and the human interface will shift to quality control rather than work product creation.
Better Value Extraction. The latest generation of machine learning tools are able to model much more complexity than in the past. They can represent billions of parameters across a neural network, consisting of millions of nodes and the relative weights between them. With highly powered GPUs, these neural networks can be modeled by training on large data sets in relatively short amounts of time. Combined with new architectures like transformers for deep learning, these capabilities have spawned next level AI engines that are orders of magnitude more powerful and sophisticated than even the best recommendation engines of a decade ago.
Microsoft’s CEO calls these “reasoning engines”, discussing these two contributors as part of his keynote at the recent Microsoft Inspire partner event. This provides a useful label to wrap the underlying AI complexity. It captures the idea of information processing – delivering a model that represents people, places, things and the relationships between them. These models can generate insights, make predictions and complete structured work items. They encapsulate all of the data available, whether scraped from the public Internet or loaded from enterprise systems.
With a simplified, natural language user interface sitting in front of a sophisticated reasoning engine, enterprises can improve efficiency and outcomes across a wide range of use cases. These improvements represent the two major additions that have spawned a new wave of interest in unlocking the capabilities of AI.
Better Outcomes
More efficient user interaction will drive much higher utilization of existing software services. LLMs and ChatGPT like interfaces allow humans to interact with software applications through an interface that is based on natural language. Rather than being bound to traditional GUIs with preset choices or requiring use of a scripted language (like SQL) to define tasks, chat interfaces allow users to engage software applications through text-based prompts. Additionally, larger machine learning models can represent more complex information sets, greatly expanding the scope of problems that can be addressed by AI training.
As an example in the consumer space, Priceline is working with Google AI to create a virtual travel concierge as the entry point for users to plan a trip. A simple text-based instruction with some rough parameters could kick off a large number of queries to multiple application data services (flight, car, hotel, entertainment, dining, etc). This replaces complex interfaces with many combinations of drop-downs, selectors, submit buttons, etc., which are then repeated for each aspect of trip planning. The user efficiency of querying for all of this in one or two sentences would result in more overall usage. Not to be outmaneuvered, other travel providers like Expedia are working on similar features.
Beyond consumer shopping and generalized knowledge aids like ChatGPT, there is an even larger opportunity within enterprises to harness machine learning to improve internal business operations. These could take the form of better customer service, predicting failures, increasing productivity or speeding up business processes. These all promise to contribute to higher sales and lower costs. As one enterprise in an industry rolls out an AI-driven service that improves their competitive position, then all other players will need to follow suite.
AI is a transformative technology that has the potential to unlock tremendous business value. According to a recent McKinsey study, AI could add up to $4.4 trillion annually to the global economy. Our focus is on enterprise AI, designed to address these opportunities and solve business problems. The list of use cases is long and includes IT operations, code generation, improved automation, customer service, augmenting HR, predictive maintenance, financial forecasting, fraud detection, compliance monitoring, security, sales, risk management, and supply chain amongst others.
IBM CEO Opening Remarks, Q2 2023 Earnings Call
An emerging area of investment that delivers many of these benefits revolves around creating digital twins of common physical business services. At the Databricks Data + AI Summit, JetBlue provided a great example of this. Their senior manager of data science and analytics described how JetBlue is using the Databricks platform to enable generative AI across a number of use cases.
What struck me is the extent of their application of AI across multiple business domains. JetBlue is leveraging LLMs to create chatbots to serve information across all of their operational specializations (maintenance, operations, planning, ticketing, etc.). Additionally, they are creating digital twins for most real-world functions (customer activity, flights, airports, crews, etc) in order to run gaming scenarios to predict potential service issues as well as business opportunities.
The scope of these operations and the planned expansion provides investors with confidence that this AI trend is real. I imagine that if JetBlue is engaged in building this many AI models, then every other airline is likely pursuing a similar strategy. If they aren’t, then they risk creating a competitive disadvantage. This dynamic shifts investment in AI from a nice-to-have to must-have.
Business operations will become more efficient and lower cost, as expertise is dispersed to all employees, rather than a few highly paid, time-constrained specialists. This will have implications across many industries, whether health care (everyone is a doctor), legal, product design, finance, transportation, supply chain management, etc. Enterprises will be able to produce more output with fewer people, driving profitability. Savings from less staff can be invested into more AI enablement.
I think that these factors will drive a large increase in consumption of new AI-enabled digital experiences. Existing software applications will be reimagined and redesigned to make use of the improvements in interaction, efficiency and effectiveness. While public Internet consumers experience the most visible benefits, we will likely see a much larger investment and catalyst from internal business applications. The creation of these expert systems for use within enterprises will drive a whole new level of business investment and productivity improvement.
This process may resemble the scramble to launch new mobile applications in the early 2010s, but likely at an even larger scale. Mobile apps increased usage of software infrastructure because humans could access those applications from anywhere. Instead of interacting for an hour a day while seated at their computer, users could engage over many hours from anywhere. Additionally, new hand gestures (touch / swipe) made the interface more efficient.
Yet, mobile apps didn’t make the base software applications more effective. Most mobile apps involved reproducing a similar experience to that exposed in a web browser. With AI-enabled applications, though, we get both benefits. The interface is more efficient due to natural language and the effectiveness of the application will be much greater as a consequence of more powerful reasoning engines. Combined, these two factors should generate more application usage (more interaction, more data scope, more processing).
If employees are more productive, then enterprises will need fewer of them. This reduction of department headcount will free up budget to pay for the software that drives this productivity. Whether it is $20/month for ChatGPT or $30/month for Microsoft’s new 365 Copilot, a $100k annual all-in cost per corporate information worker (salary, benefits, space, etc.) will pay for a lot of software.
A recent study from consultancy McKinsey, reported that AI has the potential to improve productivity in a number of common information worker functions in large enterprises. Some examples include:
- Sales productivity increased by 3% to 5%.
- Marketing productivity increased by 5% to 15%.
- Companies saved 10% to 15% on R&D costs.
- Software engineering productivity increased by 20% to 45%.
- Customer service productivity increased by 30% to 45%.
As these productivity enhancing AI services ramp up, I think the rate of hiring will slow down within the Global 2000 for traditional knowledge workers. Obviously, enterprises will be careful with this messaging, as they don’t want to fuel the “AI is replacing jobs” narrative, but I think the writing is on the wall. In some cases, executives aren’t avoiding it. IBM previously announced that they are pausing hiring for information worker roles that they think could be replaced by AI at some point. This is projected to impact about 7,800 workers over several years.
Hiring in back-office functions — such as human resources — will be suspended or slowed, Krishna said in an interview. These non-customer-facing roles amount to roughly 26,000 workers, Krishna said. “I could easily see 30% of that getting replaced by AI and automation over a five-year period.”
BLOOMBERG ARTICLE, MAY 2023
When fewer high cost information workers are required to accomplish the same output, that savings can offset the cost of additional software automation. IBM’s CEO intends to invest more in AI and automation to address these corporate functions. That implies a shift of more corporate budget to IT and associated software services.
As part of Accenture’s latest earnings call, they cited an internal survey in which executives at customer companies were asked about their plans for AI. The results indicated that 97% of executives expect generative AI to be transformative to their industry and that 67% of organizations are planning to increase their spending on technology in general, with prioritization for investments in data and AI.
And so while it is early days, we see generative AI as a key piece of the digital core and a big catalyst for even bigger and bolder total enterprise reinvention going forward. In fact, in a survey of global executives that we completed just last week, 97% of executives said Gen AI will be transformative to their company and industry and 67% of organizations are planning to increase their level of spending in technology, prioritizing investments in data and AI.
ACCENTURE Q3 2023 EARNINGS CALL, JUNE 2023
Themes and Likely Beneficiaries
With that background, let’s explore how the rush to pursue AI strategies may change the prioritization of enterprise IT spend and what capabilities become more important. In some cases, AI is accelerating the need for some services (real-time data) or even reversing the direction of some trends (processed data versus raw). Additionally, while themes around governance have always been important, AI injects new considerations into data privacy and even data sharing.
In the Snowflake Summit keynote, their CEO said that in order to have an AI strategy, a customer has to have a data strategy. He is referring to the fact that most enterprises are sitting on a treasure trove of data, unique to their industry. When OpenAI rapidly rolled out iterations of ChatGPT, many technology analysts thought those generalized capabilities might disrupt a number of traditional industries. This was because ChatGPT’s intelligence was easy to extrapolate to solve all kinds of problems.
However, what has become clear is that a large language model is only as effective as its training data set. With the public Internet available for training, it is very difficult for ChatGPT to discover unique insights with industry-specific context in business segments like manufacturing, retail, health care, supply chain, transportation, finance and more.
ChatGPT and other generative AI tools have served an important purpose, though. They provide a very visible example of what is possible with generative AI and the application of new large language models. Every C-level executive can extrapolate those capabilities to envision uses within their business. Their personal experience with ChatGPT (or their kids’) develops a new frame of reference.
Pre-trained models, like those from OpenAI, serve as a valuable starting point for industry solutions. It’s estimated that 95% of data processing is saved by bootstrapping an AI service with a pre-trained foundation model. The last 5% represents the critical and unique contribution to transform the custom model into a working solution applicable for a whole bevy of industry specific use cases. This last 5% of data comes from enterprises.
After extensive pre-training, the final steps of inference and fine-tuning can be accomplished with a targeted data set. This highlights the need for enterprise data to be clean, structured and recent. It is what Snowflake’s CEO was referring to. In order for enterprises to realize all the AI value from their proprietary data sets, they need to update the infrastructure that collects, processes and stores that data.
Databricks’ CEO expressed the opportunity for enterprises as succinctly. Enterprises will create competitive advantage by harnessing their proprietary data sets to build their own unique AI models. These will differentiate them not just from other players in their industry, but the first wave of popular AI services built from content on the public Internet.
The use cases for enterprises will be specific to their business operations, offering huge opportunities to automate decision making, improve customer service and empower employees. These services can also be shared within their industry’s information value chain, made available to key partners, suppliers and service providers. These use cases aren’t as interesting to public Internet users, but stand to drive enormous productivity gains for businesses. I think the market for these internal AI enterprise services will be much, much larger than what the public has experienced thus far through popular chat agents like ChatGPT.
And this is the core of the opportunity for a number of AI-adjacent data providers. These companies can ride the coattails of the rush to capitalize on new AI-driven capabilities and the desire to unlock whole new services, insights and automation within many industries. As enterprises invest in AI, data infrastructure providers can both facilitate direct access to AI models and help improve the end-to-end pipeline of data that will feed them. This makes the whole AI investment surge a potential tailwind for data processing and storage companies.
Databricks and Snowflake stand at the center of this, offering extensive platforms of capabilities with a large set of engaged enterprise customers. Both are evolving their platforms quickly to position themselves as the ideal platform for the convergence of data and AI. Let’s review some of the data themes elevated by AI and examine how these two data providers are positioning their platforms. This will be passed through the lens of what they prioritized for announcement at their respective user conferences at the end of June.
Sponsored by Cestrian Capital Research
Cestrian Capital Research provides extensive investor education content, including a free stocks board focused on helping people become better investors, webinars covering market direction and deep dives on individual stocks in order to teach financial and technical analysis.
The Cestrian Tech Select newsletter delivers professional investment research on the technology sector, presented in an easy-to-use, down-to-earth style. Sign-up for the basic newsletter is free, with an option to subscribe for deeper coverage.
Software Stack Investing members can subscribe to the premium version of the newsletter with a 33% discount.
Cestrian Capital Research’s services are a great complement to Software Stack Investing, as they offer investor education and financial analysis that go beyond the scope of this blog. The Tech Select newsletter covers a broad range of technology companies with a deep focus on financial and chart analysis.
Get as Much High Quality Data as Possible
With high processing costs to train extensive AI models, the data storage industry is returning to its bias towards succinct and clean data to load into AI models. This increases the importance of high quality data sources, favoring structured models like those found in a data warehouse. While storage of data in raw, unstructured formats, like those in data lakes still exists, they are often viewed as a preliminary state in advance of AI processing.
This represents a reversal from the industry’s general trend over the last few years, where data vendors were trying to accommodate vast data lakes supporting the ability to query across multiple data types. The emergence of AI is pushing the industry back towards an appreciation for clean, formatted, well structured data sources. Generalized LLMs are powerful, but expensive. Custom models can leverage curated data to fine-tune pre-trained models for less money.
A shift towards pre-processed data as an input for AI inference favors those solution providers that maintain clean data structures with tight controls over data access. Performance is important to contain costs, as data sets expand. Granular governance over data sources ensures proprietary enterprise data isn’t leaked into public models or exposed to partners. A platform architecture that favors centralization of data would support the simplest way to ensure security, control and performance.
Given these trends, Snowflake is doubling down on the idea that the Data Cloud represents the best place for enterprises to consolidate all of their data. By migrating all enterprise data onto Snowflake’s platform, customers can ensure they have full control over it. This also minimizes latency associated with pulling data across a remote network connection from a distant source for processing. Governance is easy to enforce if the data is all in a vendor-specific format.
Where customers need to manage data outside of the Snowflake platform, they can leverage Iceberg Tables as Snowflake’s universal open format. At Summit, Snowflake announced the extension of this open format to imbue governance concepts into the management of the data. Customers will be able to designate whether their Iceberg Tables inherit governance from Snowflake (managed) or allow another engine to handle access controls (unmanaged). Native and External Tables are being consolidated as part of this.
The important distinction that Snowflake makes on their approach is that this solution introduces no loss of performance. Customers who choose to store data in open formats will get comparable performance to data stored in internal Snowflake tables. This performance guarantee represents the trade-off between supporting multiple open formats (like Databricks) and consolidating on a single format (like Snowflake Iceberg Tables).
Databricks Moves Towards Controlled Openness through Unity Catalog
Databricks, on the other hand, is taking a more open approach to data quality and management of input sources. This starts with accommodating multiple data formats. Where several data management providers (hyperscalers, Snowflake) are trying to select a single data format for storage, Databricks is pivoting towards making the issue a moot point with their introduction of Delta Lake 3.0.
With Delta Lake 3.0, Databricks essentially created an abstraction layer over all three data formats (Delta Tables, Hudi and Iceberg) and called it UniForm. This manages the metadata associated with each format and then translates all three formats out to Parquet so that the data can be operated upon without worrying about the underlying storage format.
This is an interesting, and arguably strategic, move by Databricks to sit above the fray and become the connective layer between all these data sources. For the other players in the ecosystem, it likely increases their determination to make their format the standard. Or, they will acquiesce and gravitate towards a similar approach as Databrick’s universal format. All of that abstraction would impact performance, though, so there is an advantage to having a single format for raw processing speed.
This commitment to openness by Databricks is being manifested in other ways as well. They are hanging governance and data source auditing on their Unity Catalog, which has been expanded to support more data types and tracking functions. Unity Catalog provides a central view of all resources and workflows within a customer’s Databricks instance. Besides just managing permissions and access control, it has deep auditing and lineage capabilities, allowing users to track the source of data through history.
During the Data + AI Summit, Databricks announced three major additions to the Unity Catalog. These extend its reach to more data sources, expand governance capabilities and allow for broader monitoring.
Enterprise-wide reach. The reality in most enterprises is that data is scattered across multiple silos. While the ideal from Snowflake’s perspective is that data teams consolidate all of this data onto a single platform through a series of migrations, that will take time and teams need to harness this disparate data now.
To address this need, Databricks introduced Lakehouse Federation into public preview. This supports a data middleware that can connect to multiple data sources (including competing ones) for discovery and querying. It can set and enforce data access controls on each data source down to a granular level. To improve performance, it can optimize queries and cache frequently accessed data.
In the future, it may even push data access control policies back down to the data sources. This would be a very interesting extension, providing a central control plane for governance across all data sources.
Governance for AI. Beyond working with structured tables in data sources, AI requires interaction with other types of artifacts. Examples include unstructured files, models and feature definitions. Databricks has extended Unity Catalog to work across these new AI structures, allowing data teams to manage access all in one tool. This also injects lineage across data and these AI artifacts, so that teams can keep track of all dependencies. These capabilities are part of Unity Catalog for AI, which is available in public preview.
AI for Governance. AI can be applied to governance to make it easier to manage a large-scale data pipeline. Databricks is making new capabilities available as part of Lakehouse Monitoring that perform quality profiling, anomaly and drift detection, and data classification across both data tables and AI models. The output of this monitoring can be exposed in a variety of formats, including tables, visualizations, alerts and policies.
As a useful example, Lakehouse Monitoring can be trained to look for PII within data tables and then generate an alert when it is identified. This capability is in public preview for select customers to use.
With these expanded capabilities, Unity Catalog underscores the emerging core strategy for Databricks. They intend to become the connective tissue across all data sets within the enterprise, allowing users to securely access, share and process all data, no matter where it is located (even if on Snowflake). This isn’t being done in a loosey-goosey way either, with strict governance controls spanning all sources.
This elevates Databricks to a status of spanning all clouds and data providers (sometimes referred to as a supercloud function). This posture aligns well with their positioning around openness and open source, as opposed to a closed system. Over this architecture, they are allowing enterprises to build AI models and support generative AI services. Given that AI performs better with access to more data, this positioning of supporting an open ecosystem plays well.
Snowflake has a similar broad vision, which is to make all of an enterprise’s data available in one system, allowing data teams to easily build and serve AI models. Their strong preference is that all data is stored in the Snowflake platform. Where Snowflake needs to connect to external data sources, they make that possible through Iceberg Tables, but the bias is towards bringing the data onto the platform eventually. All of Snowflake’s supporting services around deep governance, collaboration and fast processing just work better this way. This makes Snowflake often referred to as a “closed” system.
Closed systems aren’t inherently at a disadvantage, though. They are generally easier to manage (fewer interfaces to maintain), perform better (less layers of abstraction) and have a simpler infrastructure footprint. Security and privacy are straightforward to control and therefore less likely to fail. Historically, closed systems have had more commercial success in software infrastructure than open ones.
However, Databricks’ bias towards openness and connectivity allows new customers to harness their disparate data sources faster. Lakehouse Federation enables them to just plug in all their silo’ed data sources quickly. They don’t necessarily give up governance, as Unity Catalog enforces access controls. Performance will likely be slower than having all the data in one place, but customers don’t need to wait on a migration to start fine-tuning their AI models.
Bring AI Compute to the Data
After enterprises have ensured they have maximized the high quality data available to feed their AI models, they need to actually run them. A major trend driven by the growing consolidated data sets is the ability to bring the AI model processing closer to the data itself. There are a couple of justifications for this. First, the pure gravity of the data set increases the cost to move it around. Second, the requirements of privacy and security are more easily accomplished if the data remains within a controlled data storage environment.
These drivers have led the big data storage vendors to expand their capabilities to deliver AI runtimes within the same environment as the data. Those providers have historically already invested extensively in data governance capabilities to manage access to approved data sets at a very granular level. By expanding their capabilities to provision an application runtime adjacent to the data, these providers can extend the same data governance controls to the applications themselves.
With AI workloads, this requirement is just as important. A model trained on proprietary data can reveal sensitive information. Even internally, there may be different employee cohorts with varying levels of access. Enterprises will likely fine-tune hundreds of smaller custom AI models to support compartmentalization of data between users. Systems of governance can be extended to the AI models and the interfaces used to access them within the same environmental boundary already managed by the data provider.
The alternative is to copy proprietary data outside of the system of record to train a model or perform fine-tuning. This copying is not just costly, but also dilutes the security model for that data. User access permissions would need to be duplicated for each application that has access to each custom AI model, creating a lot of overhead for the security team.
Data providers have been working hard to address these issues by bringing a development environment and runtime into the core data platform. At their annual Summit user conference in June, Snowflake announced a number of new capabilities to accomplish this. These go beyond existing modules, like Snowpark, that support application code written in specific languages (Java, Python, Scala). The new capabilities span partnerships, products and extensions of previous announcements, greatly increasing the extensibility of the platform and flexibility for developers.
The most encompassing product development is the introduction of Snowpark Container Services. Containers provide developers with a reproducible runtime environment in which to execute their packaged code. Snowpark Container Services allows data engineering teams to either build their own applications from scratch or import ready-made containers from partners. Partner contributions are served as Native Apps and distributed through the Snowflake Marketplace. For portability, Snowflake’s Containers are packaged and distributed as Docker containers.
Containers built and packaged by developers can be written in any programming language (C, Java, Node.js, Python, etc.). The container itself can be executed on either traditional CPUs or GPUs. This broad flexibility goes beyond what had been available through Snowpark previously, which offered more limited language support through structured frameworks like UDFs (User Defined Functions) and Stored Procedures. Snowflake effectively short-circuited the long road of incrementally adding more languages to Snowpark.
With Snowflake Container Services, developers have the ability to create any application that they could on a standard hyperscaler platform. These can still be data science or engineering specific programs, like executing machine learning Python libraries for training or processing scripts for transforming and loading data. Developers can even layer on rich front-end user interfaces with popular Javascript frameworks like React. Applications can be deployed as scheduled jobs, triggered service functions or long-running services with a UI.
Snowflake Container Services provides customers with multiple benefits. They simplify the overhead for developers and data science teams by outsourcing the configuration and management of the hosting environment to Snowflake. Further, containers can access customer data directly within the Snowflake platform, bringing existing governance controls along.
This management of the data is an important distinction. One could argue that Container Services just mirrors what the hyperscalers already offer. The difference is that Container Services can only access the customer’s data through the existing Snowflake governance layer, allowing Snowflake to guarantee access controls and privacy restrictions. Regular containers on the hyperscalers don’t come with this built-in governance.
Further, data teams can leverage Snowflake’s new relationship with Nvidia to unlock a number of AI services and processing power within Snowpark Container Services. This gives developers access to Nvidia’s NeMo framework for extending third party large language models (LLMs), as well as Nvidia’s own internally developed models. Snowflake enterprise customers can use their Snowflake data to create custom LLMs for advanced generative AI services that are specific to their industry domain. These might include chatbots, recommenders or summarization tools.
The collaboration also brings NVIDIA AI Enterprise to Snowpark Container Services, along with support for NVIDIA accelerated computing. NVIDIA AI Enterprise includes over 100 frameworks, pre-trained models and development tools like PyTorch for training, NVIDIA RAPIDS for data science and NVIDIA Triton Inference Server for production AI deployments.
The big advantage of this relationship is that enterprise customers can make use of Nvidia AI services without having to move their proprietary data outside of the fully secured and governed Snowflake platform.
“Data is essential to creating generative AI applications that understand the complex operations and unique voice of every company,” said Jensen Huang, founder and CEO, NVIDIA. “Together, NVIDIA and Snowflake will create an AI factory that helps enterprises turn their own valuable data into custom generative AI models to power groundbreaking new applications — right from the cloud platform that they use to run their businesses.”
Snowflake Press release, June 2023
In addition to Nvidia, Snowflake secured partnerships with a number of leaders in the AI space and related data analytics providers. For example, customers can run Hex’s industry-leading Notebooks for analytics and data science. They can tap into popular AI platforms and ML features from Alteryx, Dataiku and SAS to run more advanced AI and ML processing. Other launch partners include AI21 Labs, Amplitude, CARTO, H2O.ai, Kumo AI, Pinecone, RelationalAI and Weights & Biases. All of these partners are delivering their products and services within Snowpark Container Services.
Snowpark Container Services is in private preview. It enhances the rapid adoption that Snowpark has already achieved. In his opening remarks, Snowflake’s CEO shared that in Q1 more than 800 customers used Snowpark for the first time. About 30% of all customers are now using Snowpark on at least a weekly basis, up from 20% in the prior quarter. Consumption of Snowpark has increased nearly 70% q/q, after being released just 6 months ago.
Snowflake’s strategy with these marquee partnerships like Nvidia is interesting. Understanding that they are not an AI company and unlikely to compete on AI capabilities, they are sticking to their core competencies and leveraging partnerships to offer their customers the best-of-breed AI capabilities within the Snowflake environment. This contrasts somewhat with Databricks’ approach, which is to develop or own (through the MosiacML acquisition) AI models that they can share with customers.
As an example of the potential benefits of Snowflake’s approach to deliver best-of-breed AI processing through partnerships, Nvidia claims that their RAPIDS processing architecture can generate significant performance improvements over other pipelines for machine learning data prep and model training. RAPIDS is a suite of open-source software libraries and APIs for executing data science pipelines entirely on GPUs. RAPIDS leverages Nvidia’s years of development in graphics, machine learning, deep learning and high-performance computing (HPC) to deliver high throughput. This capability is available to Snowflake customers directly adjacent to their data within the secure Data Cloud.
Snowflake Streamlit Delivers a Pre-packaged User Interface
Streamlit provides an important front-end for AI powered experiences on Snowflake. It is an open-source Python library for app development that is natively integrated into Snowflake for secure deployment onto the platform. Customers can use Python to create interactive applications from data and ML models. Besides supporting custom code in Python, the tool provides a visual interface for selecting UI components and configuring them in a preview screen. Deployment is then initiated with a single click.
Streamlit has been in private preview since last year with a lot of customer interest. Now, it is being leveraged as a ready-made front end to bootstrap AI applications. As of the Summit conference, Snowflake leadership shared that over 6,000 Streamlit powered apps with generative AI or ML models behind them have already been built on the Snowflake platform. At the conference, leadership committed that Streamlit will go into public preview within a few weeks.
Snowflake’s vision is the same as it has been, as they look towards new opportunities from generative AI and LLMs. Enterprises on Snowflake have already invested significant cycles in organizing their data, creating roles, setting access policies and establishing overall governance. Snowflake wants to honor all of that investment and allow customers to capitalize on the full value of generative AI without having to move data off of the platform.
Customers can leverage either Snowflake’s internal first party models or import third party models from any source. They can perform full model training, or just fine-tune the models with their custom data. This can all be performed securely within the Snowflake platform without exposing their proprietary data to outside parties.
With that said, Snowflake is developing some enhanced AI capabilities internally. The product team introduced several new capabilities to improve Snowflake’s AI processing support on the platform. The first is Document AI, which allows customers to extract data from unstructured documents. These can even be an image of the document and the service will use OCR to translate it into text.
Once the text is extracted, it can be stored in a large language model and queried by users. Document AI will be useful for customers where documents constitute a large part of their data workloads, like businesses in healthcare. The content of these documents can be extracted and utilized as part of model training in the same way as structured data within the Snowflake platform. Document AI is the manifestation of the Applica acquisition announced last year. The capability is now available in private preview.
Snowflake also announced ML-Powered Functions in public preview. These extend machine learning capabilities to a broader audience, specifically analysts who traditionally operate in SQL. This allows analysts to create standard functions using SQL, but access three different machine learning techniques to enhance the response.
The three ML frameworks available in the first release are Forecasting, Anomaly Detection and Contribution Explorer (what conditions caused a problem). The business benefit for the customer is that it empowers the business analyst to be more self-reliant to address common machine learning investigations themselves. For Snowflake, these types of queries would drive more consumption.
Snowflake also provided an update on Unistore. It is still in private preview, taking longer than expected to be ready for public release. As a show of confidence, five of the private preview customers are using Unistore in production, in spite of Snowflake’s caveats. They are targeting public preview near the end of 2023.
This is a bit disappointing, as Unistore brings the promise of consolidating transactional and analytical workloads onto a common platform. The value proposition remains the same and customers appear engaged so far. I look forward to seeing the actual use cases that enterprises adopt once this data storage engine is publicly available. At the very least, it could power read-heavy applications that are data rich.
Snowflake is moving more quickly with the Native App Framework, announced at Summit last year. This allows third-parties to build, distribute and monetize apps natively in the Data Cloud. The big advantage of these is that they run next to a customer’s data, keeping that data private, even from the app developer.
The Native App Framework was in private preview and has been promoted to public preview on AWS. As part of that, Snowflake announced that 25 partners had already produced about 40 apps that are available in the Marketplace. The number is relatively low because the Snowflake team is performing thorough quality control on the Apps (like the Apple Store), where they test applications for security, performance and reliability before allowing the App to be listed. Snowflake leadership claims that some of these provider Apps, like DTCC and Bloomberg, are bringing new customers to Snowflake.
To allow customers access to more robust ML models, Snowflake made two other product announcements. First, they introduced two new libraries in public preview. The first supports feature engineering, which enables users to prepare data to be fed into an AI model. The other supports actual model training within Snowflake. Fidelity was an early private preview customer of these libraries for their internal use cases.
A final major feature was the announcement of a Model Registry for Snowpark. This allows customers to manage their growing repository of ML models to help support ML Ops. With the Model Registry, customers can discover, publish and share models with a governed view of the model artifacts and associated metadata. Registered models can be easily deployed for inference. Customers can expose the models and results for internal developers and SQL users to consume.
Databricks Adds Adjacent Compute as Well
Not to be outdone, Databricks introduced a number of new capabilities at their Data + AI Summit which add compute capabilities in close proximity to the data. In many ways, Databricks has already been there, being built on top of the Spark engine. If Snowflake’s CEO says that “in to order to have an AI strategy, you have to have a data strategy”, Databricks makes a similar call to action of “data needs to be at the center of your AI strategy”.
The Databricks platform supports building end-to-end AI applications through three main steps. These encompass what they call Lakehouse AI. It starts with the data. The user has to collect the right data and prepare it in the optimal format for machine learning. Next, they need to identify the appropriate models and tune them. Finally, users can make model output available to end consumers through applications, with monitoring of performance and strict governance.
Databricks’ Unity Catalog supports all of these steps, delivering data source discovery, access controls, governance, lineage, auditing and monitoring. This AI workflow should be performed inside of the data platform. That way, proprietary data used to build models isn’t leaked out into the public space. Managing the data is the hardest part of these three steps – collecting large data sets, running them through models and using data to measure the effectiveness of the AI solution. Keeping all these steps in one platform allows for a common user access, governance, data and UI model.
To make each of these capabilities more AI-centric, Databricks announced a number of new capabilities at the Summit. Over the last year, they have upgraded the capabilities of the Lakehouse AI platform to work better with new generative AI practices and artifacts.
To support more robust preparation of data for AI processing, Databricks management introduced two new technologies to be released in upcoming months.
- Native Vector Search. This applies to text-based documents or unstructured data stored in Databricks, like internal business process documentation. Vector search can parse this text into tokens and build an index of the relationship between them. Vector Search exposes an endpoint for the AI model to query.
- Support for Online Feature Serving. Allows the data model to include contextual transactional data to customize the response for the user querying the model. This is critical to enabling enterprises to add their proprietary context and customer data to AI-powered applications.
Combined, these new capabilities allow LLMs to craft a relevant response to the application query. The other half of the flow is of course the AI models themselves. The AI model is the most powerful part of the application and Databricks wants to provide developers with many choices for their models. Databricks has updated the Lakehouse AI stack to help application developers find, tune and customize those models. As a starting point, they have built in support for a number of the most popular proprietary models available as a service from third parties, including OpenAI, Bard, MosiacML, Cohere and Anthropic.
Many customers are also utilizing open source models. To help customers manage them, Databricks announced their open source Model Library. They identified all the best-of-breed open source models, packaged them up and made them available inside of Databricks. They also optimized access to accommodate high performance requirements.
Databricks will also be developing ways for users to take these open source models and customize them. This allows customers to bring additional data into the model, perhaps after it has been exercised in production for a period, to further enhance the model’s effectiveness. This could include responses that were scored positively as an example. They will also support tuning, latency and cost optimization trade-offs.
Once models are in production, they need to be evaluated. The user wants to constantly score the model for accuracy, bias, effectiveness and other metrics. To accomplish this, Databricks released support for MLflow evaluation. Users can measure the responses from multiple models and A/B test them. This might be useful when comparing the performance of a SaaS model to a tuned OSS model.
To support outside model access, Databricks introduced the MLflow AI Gateway. This enables enterprises to consolidate and access all of their third-party AI use cases through a single gateway. This works across the major SaaS AI models, keeping credentials and logs in one place. It also manages shared credentials and enforces rate limiting where needed.
The last step is going to production. This requires continuously running infrastructure to serve the AI application. Databricks has offered an ML Inference product for two years, which has been growing rapidly. At the Data + AI Summit, they added GPU support for inferencing and also tuning and optimizing the most popular LLM families. This will deliver even higher performance at serving time.
Finally, as the application is being served, it should be monitored. Monitoring for ML models is a data problem. Customers should log and analyze every response given to end users. The model’s performance should be correlated to whatever business metrics are important. An example might be a successful customer service outcome without requiring escalation to a human.
These metrics can be collected and incorporated into a monitoring dashboard that users can track. To help customers with this, Databricks introduced Lakehouse Monitoring. This allows users to identify relevant performance data to collect, aggregate it into charts and graphs and then display that information in a historical context within a dashboard.
Secure Collaboration
AI raises the usefulness of secure data sharing as well. Data quality for a custom model would be further enhanced if an enterprise could increase the scope of data they have available to fine-tune it. While each enterprise will closely guard their proprietary data, there is an argument for a few companies to form a strategic alliance within their industry category to deliver a better solution than the rest of their competitors. This type of collaboration could be facilitated by secure data sharing.
Snowflake has long supported data collaboration through a number of capabilities. They started with secure data sharing, which allows one Snowflake customer to make a data set available to another customer in a protected location on the Snowflake platform. In this case, the recipient is not provided a copy of the data. To make a data share more actionable, Snowflake later introduced Clean Rooms, which allow the recipient to run data processing against the shared data and only store the result. This enables the sharing party to keep the majority of the data hidden from the recipient.
As a result of Snowflake’s long time offering of data sharing capabilities, the feature enjoys high penetration among customers. This is particularly entrenched within Snowflake’s large customers. During Snowflake’s Investor Day during Summit, leadership revealed that 70% of their $1M+ ARR customers are using data sharing.
They even provided an example of data collaboration by technology service provider FiServ, which spans multiple data sharing partners. What is noteable is the cascading chains of data sharing and processing, extending out to multiple partners and customers. The benefit for Snowflake and other data providers with data sharing is that the capability is very sticky. In Snowflake’s case, the participants need a Snowflake account in order to use data sharing. This network effect both encourages new customers to join Snowflake and also discourages them from leaving.
Recall that Snowflake has invested heavily in the past in the creation of industry specific solutions. These represent an ecosystem of participants in a particular industry segment, who can interact through systems of shared data and special product offerings curated for that particular vertical. As each of these verticals tries to apply AI to their domain, Snowflake will be well-positioned to offer domain-specific capabilities. They can feed foundation models to create unique offerings that possess specific industry context with the latest data.
Snowflake can help ecosystem participants securely assemble larger data sets through controlled sharing. They can extend foundation models with domain specific context and offer them to ecosystem participants. While individual enterprises will closely guard their proprietary data, they will realize that collaborating with a couple of other participants in the same sector might result in an even better AI-driven offering. We should see the emergence of tightly controlled enterprise alliances that revolve around data sharing to create better AI models for their industry that disrupt other participants. Snowflake’s sophisticated data sharing capabilities will become a huge advantage here.
While providers in the Snowflake Marketplace have been focusing on selling curated data sets, AI models provide a whole new layer of services that Snowflake can offer through the Marketplace. As we saw with the Data Cloud Metrics in the Q1 earnings presentation, sequential growth of Marketplace offerings slowed down to 3% Q/Q growth. I’m not surprised, as there are likely only so many generic demographic, weather, financial, etc. datasets that can be sold. However, rich, contextually-aware AI models distributed through the Marketplace could provide a whole new growth vector for vendors.
Databricks Ramps up Collaboration Features
Acknowledging the opportunity in the collaboration space, Databricks significantly enhanced their capabilities around data sharing, clean rooms, marketplace and apps as part of the Data + AI Summit. These enhancements followed the similar theme with Unity Catalog to make the Databricks platform open to both customers and non-customers.
For secure data sharing, Databricks introduced their open Delta Sharing service about 2 years ago. This works across any platform. The Delta Sharing protocol allows the provider to determine what data to share and who is allowed to access it. The consumer need only implement a Delta Sharing client and subscribe to the source published by the provider. With minimal configuration, the consumer will get access to the shared data in their client application.
Delta Sharing has become a popular capability, with 6,000+ active data consumers (not necessarily 1:1 with customers) and over 300 PB of data shared per day. On top of Delta Sharing and Unity Catalog are built the three primary features of Databricks’ Collaboration platform. These include the Marketplace, Lakehouse Apps and Clean Rooms.
The Databricks Marketplace spans multiple assets going beyond just data sets, including visualizations, AI models and applications for interested consumers. One of the big selling points is that these items can be accessed from multiple platforms, not requiring the consumer to be on Databricks.
The marketplace had been in preview mode for several months, with hundreds of listings. At the Data + AI Summit, Databricks moved the Marketplace to General Availability. Providers can share data sets and notebooks, across the public marketplace or a private exchange to share data securely with select organizations.
Coming soon, customers will be able to discover and share AI models through the Marketplace. The user can search for a pre-trained AI model that matches their use case (Databricks demo’ed a medical text summarization model), read about the model and even submit some sample queries against it. If the user indicates they want to subscribe to the model, the provider then provisions the models and a notebook into the user’s workspace. The user can run the model directly within the notebook or invoke it through a model serving endpoint.
Models available in the marketplace will span both open source and proprietary models. AI Model sharing will be very useful for customers with access to proprietary data, who want to fine-tune an existing model for their use case. On the other hand, some customers may not have access to sufficient data or want to short-circuit this process for some use cases, where the investment in customizing their own model isn’t warranted.
To accommodate this, customers could access a full-blown application that provides the desired functionality with a simple interface. Databricks introduced Lakehouse Apps to address this case. This provides a new way to build, deploy and manage applications for the Databricks platform.
Developers will be able to create Apps using any language and then run them on the Databricks platform within a customer’s instance. The proximity of the application to the customer’s data on Databricks provides a secure environment, without risk of data leakage. It also avoids lengthy security or privacy reviews, as the application never moves data outside of Databricks. The incentive for developers is to gain access to Databricks’ 10,000+ customers. Early App developer partners include Retool and Posit.
When a customer wishes to subscribe to an App, they can search the Marketplace for a provider that matches their target use case. Then, they can install the App and designate the data set which it will be allowed access. Access permissions are controlled in Unity Catalog and can map to all of a customer’s data assets (whether on Databricks or not).
Finally, Databricks announced Clean Rooms as part of the user conference. This addresses the use case where multiple data collaborators want to share data, but only after data processing scripts have been run to create a subset of data appropriate for sharing. This limited sharing might involve a use case between one or more retailers and an advertiser. The retailers don’t want to expose all of their consumer data and only reveal details for the intersection of consumers on a particular media site.
Clean Rooms provide the ability to run data processing jobs in a secure environment on Databricks and only share the output between parties. Those jobs can be written in any language and run as a controlled workload in a secure runtime on the Databricks platform. Privacy is maintained as the output is only made available to the participants, without exposing the source data (only Databricks can read that).
In the future, Databricks plans to add more capabilities to the Collaboration suite. These additions include transaction support in the marketplace and code approval workflows for Clean Rooms. For those following Snowflake’s history, these new features are very familiar. While Databricks is following Snowflake’s strategy with collaboration features, they significantly round out the Databricks platform’s feature set, bringing it closer to being on par with Snowflake’s capabilities in this area.
Real-time Data Enhances AI Performance
As enterprises realize that AI efficacy is improved with high quality, proprietary data, interest in making sure that data is recent is increasing as well. This provides a catalyst for upgrading data infrastructure stacks to include data streaming, stream processing and real-time data ingest. Both Snowflake and Databricks announced new capabilities to handle the ingestion of real-time data streams. As investors will recall, MongoDB announced a new module for stream processing at their MongoDB.local user conference in June as well.
For a clearer view of how modern data infrastructure providers could benefit from the increased use of newer foundational models by enterprises, we can refer to a diagram provided by Confluent at their Investor Day. With traditional machine learning processes, the primary focus by enterprises was on performing custom training and feature engineering. These models were created by loading large enterprise data sets through a batch function from a data lake or data warehouse. Once the base enterprise model was established, inference would customize results for each request. Inference generated some data access, but not at the same volume as the original model construction.
With Generative AI, LLMs and other foundation models, a third party often provides the generic model pre-trained with public data. The enterprise then applies much heavier inference to inject its contextual data into the generic model. This inference can be applied in real-time for every user interaction. Given the scope of functions and data addressed from a chat-based interface, the volume of data accessed in real-time to deliver a customized response for each user (based on their history) could actually be much larger and broader in scope than what was required to build a standard ML model under the prior method.
To get a sense for the demand associated with leveraging real-time data streams to better inform AI services, Databricks commented on growth of streaming jobs that feed Delta Live Tables on customer instances. Over 50% of customers are now using real-time data streaming. Weekly streaming jobs have grown by 177% over the last 12 months.
The Databricks CEO described usage of this feature as being “on fire”. He said that while a lot of people are excited by the potential for generative AI, they aren’t paying attention to “how much momentum streaming applications now have.”
Snowflake has also been supporting streaming data ingestion for several years through Snowpipe Streaming. This supports standard ingestion frameworks and integrates with popular streaming sources. The improvements that the Snowflake team has been focusing on is reducing the latency of the streaming load. At this point, data landed in Snowflake can be accessed within a few seconds, versus minutes previously.
Snowpipe Streaming is supported by Dynamic Tables, which allow users to perform data transformations as data is being piped into Snowflake. Data can be joined and aggregated across multiple sources into one Dynamic Table. As those data sources update, the Dynamic Table will refresh to reflect the latest results.
Databricks and Snowflake are using Kafka as the source for their real-time streaming data ingestion. If they are noting such significant growth in usage of data streaming capabilities, that likely implies strong demand for streaming services from providers like Confluent (CFLT). If investors are searching for another company that might be an indirect beneficiary from the rush to leverage AI, CFLT is worth a look.
Investment Plan
Software and data service providers would like to power as many of the steps in the AI value chain as possible. Foundation models are becoming ever more available through open source, the public domain and commercial offerings with API interfaces. The generic steps of serving the data inputs (structured and unstructured), training, adaptation and inference could be powered by a single platform. This platform would provide the foundation for an ever-increasing number of domain specific, AI-enhanced tasks that are incorporated into enterprise application functions.
This all implies that cloud-based data storage and processing engines like Snowflake and Databricks would be very useful for adding user-specific context to any pre-trained model. As this data is often requested in near real-time, overall consumption of storage and compute resources would logically increase for the customer. This increased demand should drive more revenue for data and AI platform providers.
New AI application investment doesn’t need to require incremental budget from most enterprises. New costs can be offset by the productivity gains for knowledge workers. Enterprise departments will find that their employees can accomplish more with AI-driven software services and assistants. They will therefore require less headcount to complete the same amount of work. Payroll costs will decrease, providing savings to be invested in more sophisticated software services, whether digital co-pilots, workflow automation or system-to-system coordination.
As evidenced by the announcements at their respective user conferences in late June, both Snowflake and Databricks are pivoting to address the enormous opportunity presented by the rush to incorporate AI into nearly every business and consumer process. They both added substantial capabilities to their respective platforms. I think Databricks is progressing more quickly, but Snowflake already has a large customer base and generates more revenue.
Both companies are pursuing a huge market opportunity, allowing for multiple winners, at least in the near term. As part of the Financial update during Snowflake’s Investor Day, they sized the market at $290B by 2027, or about 3-4 years from now. This applies to both Snowflake and Databricks, as well as the hyperscalers. Between those two, we know that current annual revenue is below $5B and even Snowflake has only projected their revenue to $10B by 2028.
That leaves a lot of market share for Snowflake and Databricks to grow into, even with the hyperscalers positioning for their portions of spend. Both companies are rapidly converging in terms of platform capabilities, but still appeal to slightly different customer segments. I think this will allow them both to continue growing substantially from here, much like the hyperscalers coexisted during the surge in cloud migration.
Of course, the hyperscalers themselves are pursuing their own slices of this market. I purposely leave out comparisons of their product positioning partially for brevity, but primarily because many enterprises still prefer a neutral solution for their data storage and analytics. Many of the customer testimonials at both Snowflake and Databricks Summits emphasized the value of using an independent platform to avoid lock-in with one of the hyperscalers. This bias towards a hyperscaler-neutral data platform may not be a requirement for all enterprises, but will likely represent the preference for most.
While Databricks is a private company, investors have SNOW for consideration in the public market. I think Snowflake still has tremendous potential, particularly with the probable demand tailwind that generative AI introduces. We will get another update from SNOW’s Q2 earnings report in a couple of months. At some point, possibly in 2024, we might have Databricks to consider for investment as well.
Further Reading
- Peer analyst Muji over at Hhhypergrowth has published a detailed review of AI and its implications for software development. He also generated several posts covering all the announcements from the Snowflake and Databricks user conferences, going into even more depth than I have. While this content is behind a paywall, I think it is necessary reading for investors interested in this rapidly evolving space.
NOTE: This article does not represent investment advice and is solely the author’s opinion for managing his own investment portfolio. Readers are expected to perform their own due diligence before making investment decisions. Please see the Disclaimer for more detail.
Many thanks for your insights. I hadn’t heard or thought of AI favoring structured models as in a data warehouse. It was very interesting to read your thoughts on Confluent. With such a big opportunity, is there much risk of Snowflake or Databricks trying to eat Confluent’s lunch, or might that be bad for their image, making them look like unsafe partners for other software firms? (I guess I’m a “hold and worry” kind of investor.)
Hi – thanks for the feedback. Regarding Confluent, I don’t see Snowflake or Databricks (or MongoDB, etc.) encroaching on Confluent’s core use case, which is the efficient distribution of data between many producer and consumer systems in near real-time. That would require a completely different architecture, resembling more of a queue (or series of queues, like Kafka) than a dedicated data store. It does make sense for them to improve the ability to ingest data in real-time, but will always support Kafka as the primary distribution source. If anything, Confluent could evolve their capabilities to query data while in-flight, which in theory moves them closer to becoming a data store. But, that too would never evolve to the point where it could be used for large scale analytics. So, in summary, I still see a large market for both sets of providers without much encroachment.
Thanks!