The hidden reason for rising AI costs – and it’s not because Nvidia chips are more expensive

Building massive AI models can cost hundreds of millions of dollars today, and forecasts suggest that this amount could rise to a staggering billion dollars in a few years. Much of this cost is in the processing power of specialized chips—typically Nvidia GPUs, of which tens of thousands are needed and which cost up to $30,000 each.

But companies training AI models or optimizing existing models to improve performance on specific tasks also struggle with another, often overlooked and rising cost: data labeling. This is a laborious process that involves training generative AI models with data that is tagged so the model can recognize and interpret patterns.

Data labeling has long been used to develop AI models for self-driving cars. A camera captures images of pedestrians, street signs, cars, and traffic lights, and human annotators label the images with words like “pedestrian,” “truck,” or “stop sign.” The labor-intensive process has also raised ethical concerns. After releasing ChatGPT in 2022, OpenAI was widely criticized for outsourcing the data labeling work that helped make the chatbot less toxic to Kenyans earning less than $2 an hour.

Today’s generic Large Language Models (LLMs) go through a data labeling exercise called Reinforcement Learning Human Feedback, where humans provide qualitative feedback or ratings on the model’s output. This is a significant source of increasing costs, as is the effort required to label private data that companies want to incorporate into their AI models, such as customer information or internal company data.

In addition, labeling highly technical, subject-specific data in areas like law, finance and healthcare drives up costs. That’s because some companies hire expensive doctors, lawyers, PhDs and scientists to label certain data, or outsource the work to third-party companies like Scale AI, which recently secured a staggering $1 billion in funding as its CEO forecast strong revenue growth by year’s end.

“You now need a lawyer to label things, (which) is a crazy use of lawyer hours,” said William Falcon, CEO of AI development platform Lightning AI. “Anything that’s high stakes” requires expert-level labeling, he explained. “Chatting with a ‘virtual BFF’ is not high stakes, providing legal advice is.”

Alex Ratner, CEO of data labeling startup Snorkel AI, says enterprise customers can spend millions of dollars on data labeling and other data tasks, which can eat up 80% of their time and AI budget. Over time, data also needs to be relabeled to stay up to date, he added.

Matt Shumer, CEO and co-founder of AI assistant startup Otherside AI, agreed that fine-tuning LLMs has become expensive. “In recent years, middle school-level data was fine, today you need high school, college and now expert skills,” he said. “It’s not cheap, of course.”

That can create budget problems for tech startups working in critical areas like healthcare. Neal Shah, CEO of CareYaya, a platform for geriatric caregivers, says his company received a grant from Johns Hopkins University to develop “the world’s first AI care coach for dementia patients,” but the cost of data labeling is “eating us up.” Costs have shot up 40% in the past year because it requires specialized information from gerontologists, dementia experts and experienced caregivers. He’s working to reduce those costs by putting health students and college professors in charge of labeling.

Bob Rogers, CEO of Oii.ai, a data science company specializing in supply chain modeling, said he has seen data labeling projects costing millions. Platforms like BeeKeeper AI, he said, could help reduce costs by allowing multiple companies to share experts, data and algorithms without exposing their private data to the others.

Kjell Carlsson, head of AI strategy at Domino Data Lab, added that some companies are cutting costs by using “synthetic” data – or data generated by the AI itself – to at least partially automate data collection and labeling. In some cases, models can fully automate data labeling. For example, biopharmaceutical companies train generative AI models to develop synthetic proteins for conditions such as colon cancer, diabetes and heart disease. The companies automatically run experiments based on the results of generative AI models, which provide new training data with labels.

The bottom line is that data labeling is costly and time-consuming, but it’s worth it. “Data labeling is a beast,” said CareYaya’s Shah. “But the potential reward is enormous.”

Sharon Goldman

Would you like to send your thoughts or suggestions about the datasheet? Write a message here.

This story originally appeared on Fortune.com

Breaking News

Grandtkitchenfilipinocuisine

The hidden reason for rising AI costs – and it’s not because Nvidia chips are more expensive

Leave a Reply Cancel reply