Building with LLMs — Key considerations and emerging research topics

Finetuning LLMs vs RAG: A technical overview

Generative AI has revolutionised the field of machine learning by introducing large language models (LLMs) that can understand and generate human-like text. As these foundational models grow in capability, the need for use-case-specific or client-specific adaptations becomes essential. Both fine-tuning and Retrieval-Augmented Generation (RAG) are pivotal techniques in this customisation process, addressing the issue of “hallucinations” — instances where models generate plausible but factually incorrect information.

A notable challenge when deploying LLMs is their tendency to produce “hallucinations” or factually incorrect responses. Recent advancements in understanding the mathematical foundations of prompt engineering offer new perspectives on mitigating these issues. Research formalizing LLMs as discrete stochastic dynamical systems reveals significant insights into the controllability of model outputs through prompt engineering, viewed through the lens of control theory.

Mathematical and Empirical Analysis

This approach analyzes the limitations on the controllability of self-attention mechanisms, which are central to the functioning of LLMs. The analysis focuses on how the singular values of the parameter matrices influence the model’s response to prompts. Empirical studies on a range of models like Falcon-7b, Llama-7b, and Falcon-40b have shown that with prompts of up to 10 tokens from initial states like Wikitext, the “correct” next token — aligned with factual accuracy — is reachable at least 97% of the time. Additionally, the top 75 most likely next tokens are reachable 85% of the time, indicating a high degree of control over the output probabilities.

Intriguingly, this research highlights that even short prompt sequences can dramatically shift the likelihood of specific outputs, potentially making the least likely tokens become the most probable ones. This insight is pivotal for both guided fine-tuning and RAG approaches that rely heavily on prompt engineering to guide LLM outputs towards more accurate and relevant responses.

Understanding the Techniques: Fine-Tuning vs. RAG

Fine-Tuning: Fine-tuning involves retraining an existing model on a new dataset to refine its knowledge or adjust its reasoning capabilities. This approach can incorporate new knowledge directly into the model’s parameters, enhancing its contextual relevance. Techniques like Low-Rank Optimization (LORA) and Quantized LORA (QLORA) are instrumental here, offering ways to efficiently scale down the model size while maintaining performance. However, fine-tuning is not foolproof; the model remains susceptible to generating hallucinations under certain conditions. Recent findings from an OpenAI study highlight an inherent limitation in fine-tuning language models. The study reveals a statistical lower bound on the rate at which even well-calibrated, pretrained language models hallucinate certain types of facts. Specifically, for arbitrary facts — those whose truth cannot be ascertained from the training data — hallucinations are statistically unavoidable if the model meets a minimal calibration criterion. Unfortunately, in most interesting usecases, the data training does not fully cover the facts required to answer queries, leading to univocal model hallucinations. This intrinsic predisposition arises regardless of the transformer architecture or data quality, presenting a significant challenge for fine-tuning efforts aimed at mitigating such errors.

Retrieval-Augmented Generation (RAG): Conversely, RAG leverages external data sources to supplement the model’s responses, enhancing accuracy and reducing hallucinations. By dynamically pulling in relevant information during the generation process, RAG provides a robust solution to maintain factual correctness. However, integrating RAG is an engineering challenge, requiring sophisticated systems to manage data retrieval across multiple sources effectively.

At UTHEREAL.ai, we are building the future of RAG solutions, offering unparalleled accuracy and flexibility. Our platform is designed to work seamlessly with a diverse set of formats, adapting to both textual and multimedia information sources. With a multi-stage approach to RAG and data-AI processing, we ensure the highest retrieval capabilities for every possible query. Our fact-based methodology guarantees that all user interactions yield references to specific textual sources, providing transparency and reliability. Furthermore, our data ingestion and retrieval solution supports over 10 different languages, enabling seamless integration across diverse linguistic landscapes and simultaneous multilingual document processing.

Conclusion: Weighing the Options

While both fine-tuning and RAG offer significant benefits, choosing between them depends on specific project requirements and constraints. Fine-tuning is effective for adding tailored knowledge to a model but comes with inherent risks of inaccuracies. RAG, though complex to implement, provides a safeguard against such errors by ensuring that external, validated information supports generated content.

J@UTHEREAL.AI

CTO & Chief Scientist

Previous
Previous

Building Scalable Architectures for LLM-based Apps

Next
Next

Data Processing Challenges for RAG-based Applications