Onehouse’s support for vector embeddings aims to reduce the cost of AI training

Onehouse’s support for vector embeddings aims to reduce the cost of AI training

Onehouse Inc., a company that sells an Apache Hudi-based data lakehouse as a managed service, today announced that it has launched a vector embedding generator to automate embedding pipelines as part of its cloud service.

Vector embeddings are mathematical representations of objects such as words and images in a continuous space where each point is defined by a vector or an ordered list of numbers representing features of an object, coordinates in a space, or a complex data type. Embeddings are typically used in machine learning and natural language processing to capture the semantic meaning of objects or other relevant features in a way that a computer can process.

Vector embedding pipelines continuously deliver data from streams, databases, and files in cloud storage to base models used in generative AI. Onehouse can now accept embeddings returned from models and store them in the Data Lakehouse.

Cheaper storage

This can be a big money saver, as vector databases typically require powerful hardware and fast storage that is tightly coupled to the computer. Vector databases have been the hottest part of the database management systems market since the hype around generative AI began last year. Forrester Research Inc. estimates that 75% of traditional databases, including relational and NoSQL models, will contain vector databases.
Capabilities by 2026.

Onehouse essentially positions its service as a clearinghouse for vector embeddings. Instead of storing data in a DBMS, customers can leverage the low cost of lakehouse storage, which is based on low-cost, scalable object storage that is decoupled from compute resources.

“Companies need to store large amounts of data in their vector databases on local storage, so they need a much larger vector database instance to achieve the speed and scalability they need,” said Vinoth Chandar, CEO of Onehouse and co-developer of Apache Hudi. “Many companies run multiple vector databases for different parts of their data, so there is no single common source of truth they can use to manage vector embedding data.”

Hudi has unique capabilities in update management, late-incoming data, concurrency control and other factors needed to scale to the data volumes that AI applications require. The company said Onehouse can also support low-latency vector serving for real-time use cases.

The Data Lakehouse deploys vectors in batch, with hot vectors dynamically pushed into the vector database for real-time deployment, providing scaling, cost, and performance benefits for applications such as large language models.

Fewer API calls

Chandar said using an intermediate lakehouse can also reduce the number of application programming interface calls to LLMs such as OpenAI LLC’s GPT-4 required to generate vector embeddings.

“Hudi is one of the few lakehouse technologies that supports advanced indexing and incremental queries, which can dramatically reduce the number of calls required to OpenAI or any other vector embedding generator,” said Chandar. Incremental queries are a Hudi feature that allows users to efficiently query only the data that has changed since the last query or a specific point in time.

“Hudi can give you a single image so you can run a job on each sheet asynchronously, and it can make an API call for N updates to an upstream data object,” he said.

Low cost and flexibility are two of the main reasons for the growing popularity of data lakehouses. A survey of senior executives, chief architects, and data scientists by MIT Technology Review, sponsored by Databricks Inc., found that nearly three-quarters of organizations have adopted a lakehouse architecture. Of those, 99% said the lakehouse helps them achieve their data and AI goals.

Image: SiliconANGLE/DALL-E

Your support is important to us and helps us keep the content FREE.

By clicking below you support our mission to provide free, in-depth and relevant content.

Join our community on YouTube

Join the community of more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, ​​Dell Technologies Founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner for the industry. You are truly a part of our events and we are very happy that you are coming. And I know that people also appreciate the content that you create” – Andy Jassy

THANKS

Leave a Reply

Your email address will not be published. Required fields are marked *