Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Large language models (LLMs) are prone to factual and logical errors, especially when dealing with complex reasoning tasks. To address this challenge, researchers often use verifiers or reward models to evaluate and select the most accurate responses from a set of LLM-generated outputs.
In a new paper, researchers at Google DeepMind, University of Toronto, Mila, and UCLA introduce GenRM, a novel approach that leverages the generative capabilities of LLMs to create more effective verifiers. GenRM can be a practical tool for LLM applications where current verification methods fail.
The limitations of classic verifiers and reward models
One of the common methods to improve the accuracy of LLMs is to have them generate several candidate answers and then use a separate component to select the best one. This approach requires a reliable verifier or reward model.
In reasoning domains, LLM-based verifiers are typically trained as discriminative reward models (RMs) to assign numerical scores to candidate solutions, which are then used to classify them as correct or incorrect. However, these RMs do not fully use the strengths of LLMs in generating and processing responses.
“Even though classic reward models (RMs) / verifiers are trained by fine-tuning LLMs, they do not leverage the text generation capabilities that LLMs are fundamentally designed for,” Rishabh Agarwal, co-author of the paper and Senior Research Scientist at DeepMind, told VentureBeat.
Another popular technique, LLM-as-a-Judge, uses advanced prompting techniques to evaluate responses. However, while flexible, LLM-as-a-Judge lacks the abilities that reward models obtain during training.
Generative reward models
DeepMind’s GenRM proposes a different approach: training verifiers using next-token prediction to leverage the text generation capabilities of LLMs.
“Training RMs via next token prediction enables them to tap into numerous benefits of generative LLMs,” Agarwal said. “We showed how the same model can both verify and generate solutions, think ‘more’ before verification by using chain-of-thought, and use additional compute at test-time to improve accuracy.”
In GenRM, the verification decision is represented as a token. For example, to produce a numerical score for a solution, the verifier uses a prompt such as “Is the answer correct?”, and represents the score as the probability of a single text token (e.g., “Yes” or “No”) under the context and the prompt.
Since verification often involves complex reasoning, generative verifiers can naturally benefit from advanced prompting techniques such as chain-of-thought (CoT) reasoning, where the model is prompted to generate a thought process before the answer.
“Specifically, we can generate intermediate reasoning steps or critique (CoT) before making a decision about the solution correctness, which may identify subtle reasoning errors missed by direct verifiers,” the researchers write.
The CoT rationales used to train the GenRM model can either be generated by humans or by another LLM. During inference, the GenRM first generates a CoT rationale and then uses the probability of the “Yes” token to assign a correctness score.
The researchers further enhance the verification accuracy of CoT verifiers using majority voting. They sample multiple CoT chains and calculate the average score of the “Yes” token across all samples, making effective use of test-time computation.
“GenRM can be viewed as unifying LLM-as-a-Judge with classic verifiers: it corresponds to a trained LLM-as-a-Judge on domain-specific verification data,” Agarwal said. “As such, GenRM makes sense for any domain where off-the-shelf prompted LLMs are not good enough.”
GenRM in action
To evaluate GenRM’s effectiveness, the DeepMind researchers tested it on several reasoning tasks, including last letter concatenation, word sorting, and word-math problems. They compared GenRM against standard approaches, including discriminative reward models, LLM-as-a-Judge, and “self-consistency,” where the model generates several answers and the most common answer is selected as the final response.
Across all tasks, GenRM with CoT consistently outperformed the other methods by several percentage points, including the specially trained discriminative reward model. On the GSM8K math reasoning benchmark, a Gemma-9B model trained for GenRM solved 92.8% of the problems, surpassing the performance of GPT-4 and Gemini 1.5 Pro.
“Unifying solution generation with verification, as done by GenRM using the next-token-prediction objective, consistently improves verification performance across all tasks,” the researchers write. “This improvement is observed for both direct and CoT-based generative verifiers, suggesting that teaching the verifier to imitate correct solutions generally helps.”
The experiments also showed that GenRM scales favorably with increasing dataset size and model capacity. Furthermore, GenRM with CoT continues to improve when allowed to sample more responses. This gives more flexibility to LLM application developers to balance accuracy and compute costs.
“Compared to classic verifiers, GenRM using the same data can still outperform them (by jointly training on generation and verification), and GenRM training is just standard fine-tuning,” Agarwal said. “That said, to fully utilize the GenRM abilities, we need critiques/verification rationales that explain the reward label. For high-quality data, this can be done using humans, but a more scalable option would be to use synthetic LLM-generated rationales.”
Possible future directions for GenRM could include scaling synthetic verification rationales on open-ended generation tasks, integrating GenRMs into reinforcement learning pipelines, and leveraging advanced LLM capabilities such as few-shot learning, retrieval-augmented generation, ReAct, and code generation and execution to enhance verification.
Source link