The hidden reason for rising AI costs – and it’s not because Nvidia chips are more expensive

The hidden reason for rising AI costs – and it’s not because Nvidia chips are more expensive

Building massive AI models can cost hundreds of millions of dollars today, and forecasts suggest that this amount could rise to a staggering billion dollars in a few years. Much of this cost is in the processing power of specialized chips—typically Nvidia GPUs, of which tens of thousands are needed and which cost up to $30,000 each.

But companies training AI models or optimizing existing models to improve performance on specific tasks also struggle with another, often overlooked and rising cost: data labeling. This is a laborious process that involves training generative AI models with data that is tagged so the model can recognize and interpret patterns.

Data labeling has long been used to develop AI models for self-driving cars. A camera captures images of pedestrians, street signs, cars, and traffic lights, and human annotators label the images with words like “pedestrian,” “truck,” or “stop sign.” The labor-intensive process has also raised ethical concerns. After releasing ChatGPT in 2022, OpenAI was widely criticized for outsourcing the data labeling work that helped make the chatbot less toxic to Kenyans earning less than $2 an hour.

Today’s generic Large Language Models (LLMs) go through a data labeling exercise called Reinforcement Learning Human Feedback, where humans provide qualitative feedback or ratings on the model’s output. This is a significant source of increasing costs, as is the effort required to label private data that companies want to incorporate into their AI models, such as customer information or internal company data.

In addition, labeling highly technical, subject-specific data in areas like law, finance and healthcare drives up costs. That’s because some companies hire expensive doctors, lawyers, PhDs and scientists to label certain data, or outsource the work to third-party companies like Scale AI, which recently secured a staggering $1 billion in funding as its CEO forecast strong revenue growth by year’s end.

“You now need a lawyer to label things, (which) is a crazy use of lawyer hours,” said William Falcon, CEO of AI development platform Lightning AI. “Anything that’s high stakes” requires expert-level labeling, he explained. “Chatting with a ‘virtual BFF’ is not high stakes, providing legal advice is.”

Alex Ratner, CEO of data labeling startup Snorkel AI, says enterprise customers can spend millions of dollars on data labeling and other data tasks, which can eat up 80% of their time and AI budget. Over time, data also needs to be relabeled to stay up to date, he added.

Matt Shumer, CEO and co-founder of AI assistant startup Otherside AI, agreed that fine-tuning LLMs has become expensive. “In recent years, middle school-level data was fine, today you need high school, college and now expert skills,” he said. “It’s not cheap, of course.”

That can create budget problems for tech startups working in critical areas like healthcare. Neal Shah, CEO of CareYaYa, a platform for geriatric caregivers, says his company received a grant from Johns Hopkins University to develop “the world’s first AI care coach for dementia patients,” but the cost of data labeling is “eating us up.” Costs have shot up 40% in the past year because it requires specialized information from gerontologists, dementia experts and experienced caregivers. He’s working to reduce those costs by putting health students and college professors in charge of labeling.

Bob Rogers, CEO of Oii.ai, a data science company specializing in supply chain modeling, said he has seen data labeling projects costing millions. Platforms like BeeKeeper AI, he said, could help reduce costs by allowing multiple companies to share experts, data and algorithms without exposing their private data to the others.

Kjell Carlsson, head of AI strategy at Domino Data Lab, added that some companies are cutting costs by using “synthetic” data – or data generated by the AI ​​itself – to at least partially automate data collection and labeling. In some cases, models can fully automate data labeling. For example, biopharmaceutical companies train generative AI models to develop synthetic proteins for conditions such as colon cancer, diabetes and heart disease. The companies automatically run experiments based on the results of generative AI models, which provide new training data with labels.

The bottom line is that data labeling is costly and time-consuming, but it’s worth it. “Data labeling is a beast,” said CareYaYa’s Shah. “But the potential reward is enormous.”

Sharon Goldman

Would you like to send your thoughts or suggestions about the datasheet? Write a message here.

WORTH WORRYING

Military protest by DeepMind. Nearly 200 DeepMind employees want Google’s AI department to stop working with the military. Time Reports. A letter to company management says Google’s cloud business violates company rules by selling AI to military forces at war — no names are mentioned, but there are links to reports about Google’s dealings with the Israeli military and (allegedly) Israeli weapons companies. Google claims that only Israeli ministries use its cloud services, and there are no “military workloads relevant to weapons or intelligence.”

China’s Amazon route. Reuters reports that state-linked companies in China are using Amazon’s cloud services to access the kind of sophisticated chips and AI that U.S. export controls seek to keep out of China. U.S. rules prohibit the export and transfer of sophisticated chips and AI software to Chinese companies, but access via the cloud is permitted. Amazon Web Services says it is not violating any regulations.

Cruise + Uber. GM’s Cruise robotaxi division, which is trying to get back on its feet after serious setbacks, has signed a deal with Uber to offer self-driving services in an unspecified US city, the Financial Times reported. Uber already has a similar agreement with Alphabet’s Waymo for robo-taxi services in Phoenix. However, Cruise is not currently offering autonomous rides – the company is still testing its cars with human drivers after taking a long hiatus after a pedestrian was dragged under one of its cars in San Francisco.

ON OUR FEED

“There are some interesting use cases, but overall there seems to be a lot of caution here… especially in larger companies that have complex permissions for SharePoint or Office 365 or something like that. There, the copilots are basically aggregating information that people technically have access to, but really shouldn’t have access to.”

Jack Berkowitz, Chief Data Officer at Securiti tells The Register that half of the colleagues he surveyed have stopped the rollout of Microsoft’s Copilot, an AI assistant he claims has unauthorized access to internal company data.

IF YOU MISSED IT

AI is making self-driving cars possible, so why is the industry holding back? By Sage Lazzaro

Alibaba upgrades its Hong Kong listing to primary market, potentially unlocking billions in new investment, by Lionel Lim

The stranded Boeing Starliner astronauts wanted to go home with SpaceX, but their spacesuits are not compatible with Elon Musk’s spacecraft. By Marco Quiroz-Gutierrez

A California woman outsmarted two suspected mail thieves by sending herself an AirTag (Source: Associated Press)

I sold a $1.4 billion big data startup to IBM—then started a wildlife refuge. Here are the dangers of AI’s energy consumption, by Chris Gladwin (Commentary)

BEFORE DEPARTURE

Jelly Pong. Scientists have managed to create a “soft and squishy, ​​water-rich gel.” Learn how to play the vintage video game PongThe Guardian reported. In addition, the hydrogel actually gets better at the game over time because it has a memory, even though it is not sentient, the British researchers said. However, the jelly-like material is not as good as Pong Player, and another system that was demonstrated a few years ago that was based on a series of neurons in a dish. Fortunately, this system was called DishBrain.

Leave a Reply

Your email address will not be published. Required fields are marked *