The TAO of data: How Databricks is optimizing AI LLM fine-tuning without data labels

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

AI models perform only as well as the data used to train or fine-tune them.

Labeled data has been a foundational element of machine learning (ML) and generative AI for much of their history. Labeled data is information tagged to help AI models understand context during training.

As enterprises race to implement AI applications, the hidden bottleneck often isn’t technology – it’s the months-long process of collecting, curating and labeling domain-specific data. This “data labeling tax” has forced technical leaders to choose between delaying deployment or accepting suboptimal performance from generic models.

Databricks is taking direct aim at that challenge.

This week, the company released research on a new approach called Test-time Adaptive Optimization (TAO). The basic idea behind the approach is to enable enterprise-grade large language model (LLM) tuning using only input data that companies already have – no labels required – while achieving results that outperform traditional fine-tuning on thousands of labeled examples. Databricks started as a data lakehouse platform vendor and increasingly focused on AI in recent years. Databricks acquired MosaicML for $1.3 billion and is steadily rolling out tools that help developers create AI apps rapidly. The Mosaic research team at Databricks developed the new TAO method.

“Getting labeled data is hard and poor labels will directly lead to poor outputs, this is why frontier labs use data labeling vendors to buy expensive human-annotated data,” Brandon Cui, reinforcement learning lead and senior research scientist at Databricks told VentureBeat. “We want to meet customers where they are, labels were an obstacle to enterprise AI adoption, and with TAO, no longer.”

The technical innovation: How TAO reinvents LLM fine-tuning

At its core, TAO shifts the paradigm of how developers personalize models for specific domains.

Rather than the conventional supervised fine-tuning approach, which requires paired input-output examples, TAO uses reinforcement learning and systematic exploration to improve models using only example queries.

The technical pipeline employs four distinct mechanisms working in concert:

Exploratory response generation: The system takes unlabeled input examples and generates multiple potential responses for each using advanced prompt engineering techniques that explore the solution space.

Enterprise-calibrated reward modeling: Generated responses are evaluated by the Databricks Reward Model (DBRM), which is specifically engineered to assess performance on enterprise tasks with emphasis on correctness.

Reinforcement learning-based model optimization: The model parameters are then optimized through reinforcement learning, which essentially teaches the model to generate high-scoring responses directly.

Continuous data flywheel: As users interact with the deployed system, new inputs are automatically collected, creating a self-improving loop without additional human labeling effort.

Test-time compute is not a new idea. OpenAI used test-time compute to develop the o1 reasoning model, and DeepSeek applied similar techniques to train the R1 model. What distinguishes TAO from other test-time compute methods is that while it uses additional compute during training, the final tuned model has the same inference cost as the original model. This offers a critical advantage for production deployments where inference costs scale with usage.

“TAO only uses additional compute as part of the training process; it does not increase the model’s inference cost after training,” Cui explained. “In the long run, we think TAO and test-time compute approaches like o1 and R1 will be complementary—you can do both.”

Benchmarks reveal surprising performance edge over traditional fine-tuning

Databricks’ research reveals TAO doesn’t just match traditional fine-tuning – it surpasses it. Across multiple enterprise-relevant benchmarks, Databricks claims the approach is better despite using significantly less human effort.

On FinanceBench (a financial document Q&A benchmark), TAO improved Llama 3.1 8B performance by 24.7 percentage points and Llama 3.3 70B by 13.4 points. For SQL generation using the BIRD-SQL benchmark adapted to Databricks’ dialect, TAO delivered improvements of 19.1 and 8.7 points, respectively.

Most remarkably, the TAO-tuned Llama 3.3 70B approached the performance of GPT-4o and o3-mini across these benchmarks—models that typically cost 10-20x more to run in production environments.

This presents a compelling value proposition for technical decision-makers: the ability to deploy smaller, more affordable models that perform comparably to their premium counterparts on domain-specific tasks, without the traditionally required extensive labeling costs.

TAO enables time-to-market advantage for enterprises

While TAO delivers clear cost advantages by enabling the use of smaller, more efficient models, its greatest value may be in accelerating time-to-market for AI initiatives.

“We think TAO saves enterprises something more valuable than money: it saves them time,” Cui emphasized. “Getting labeled data typically requires crossing organizational boundaries, setting up new processes, getting subject matter experts to do the labeling and verifying the quality. Enterprises don’t have months to align multiple business units just to prototype one AI use case.”

This time compression creates a strategic advantage. For example, a financial services company implementing a contract analysis solution could begin deploying and iterating using only sample contracts, rather than waiting for legal teams to label thousands of documents. Similarly, healthcare organizations could improve clinical decision support systems using only physician queries, without requiring paired expert responses.

“Our researchers spend a lot of time talking to our customers, understanding the real challenges they face when building AI systems, and developing new technologies to overcome those challenges,” Cui said. “We are already applying TAO across many enterprise applications and helping customers continuously iterate and improve their models.”

What this means for technical decision-makers

For enterprises looking to lead in AI adoption, TAO represents a potential inflection point in how specialized AI systems are deployed. Achieving high-quality, domain-specific performance without extensive labeled datasets removes one of the most significant barriers to widespread AI implementation.

This approach particularly benefits organizations with rich troves of unstructured data and domain-specific requirements but limited resources for manual labeling – precisely the position in which many enterprises find themselves.

As AI becomes increasingly central to competitive advantage, technologies that compress the time from concept to deployment while simultaneously improving performance will separate leaders from laggards. TAO appears poised to be such a technology, potentially enabling enterprises to implement specialized AI capabilities in weeks rather than months or quarters.

Currently, TAO is only available on the Databricks platform and is in private preview.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link

The technical innovation: How TAO reinvents LLM fine-tuning

Benchmarks reveal surprising performance edge over traditional fine-tuning

TAO enables time-to-market advantage for enterprises

What this means for technical decision-makers

Start typing and press enter to search