Data Processing Challenges for RAG-based Applications

The retrieval and processing of data is the cornerstone of any RAG-based solution, as it provides the means for the generation of a knowledge base that can be adapted to serve different purposes. Any retrieval pipeline is ultimately a transformation of input files into a set of facts that can be fetched by the user. The particularities of the RAG application will determine the expected nature of the knowledge to be delivered, and therefore must be taken into account in the fact generation process.

Adequating the document-to-fact transformation to our purposes presents many challenges, namely the text extraction from differently-formated files, the selection of relevant information, and the processing of the retrieved information in a way that serves the expected user demands. These three challenges are associated with the main steps in the pathway of transforming raw text into useful knowledge, namely (raw) extraction, information extraction, and knowledge generation.

The Extraction Challenge

The first step of any knowledge retrieval pipeline is the extraction of raw text from user input files. Several approaches have to be taken depending on the nature of the files, ranging from specific document format libraries to general optical character recognition (OCR) models. Ideally, extraction should encompass more than just capturing the textual content of a file, but also include additional metadata measurements such as the position of the content in the file or its relative size.

Assuming that reliable models exist for our formats of interest, the extraction challenge lies in the fact that every extraction model entails a different subset of available metadata information, which makes it impossible to treat each input type in the same way, even if the actual textual content can be retrieved for all of them.

For instance, information extraction from PDF files can be achieved with OCR models such as Tesseract, which for every extracted fragment provides information about its page number and bounding box coordinates. These measures provide a way of classifying fragments into categories (e.g. titles, paragraphs, or bullet points) and semantic groups, such as sections or tables. This information, which is fundamental to reconstruct the semantic relationships between fragments that ultimately constitute the knowledge base, is not available in other document formats such as EPUB files, in which only raw text is available, and other general information carriers, such as images, videos, or voice notes.

Eventually, the only information that is common in all cases is the textual content itself, and for that will be the source of factual information. Nevertheless, knowledge itself will have to be retrieved (or inferred) from that by leveraging metadata information gathered by the different extraction processes.

The Information Challenge

Once the textual content and its corresponding metadata have been extracted from the user files, a data processing pipeline must be designed to assess the relevance of each extract and its relationship with neighbouring extracts. This requires making sense of the extracted metadata so that the whole information content of the extract is retrieved.

In most cases, a human will generate the content of the files, and another human (sometimes the same, in the future) will be retrieving that information in some way. For that reason, the querying human will expect to find the same implicit semantic relations that have been encoded by the content-generating human. These implicit semantic relations are encoded in many ways:

  • Assumption of causality or sequentiality of the text corpus. For that reason, the semantic similarity of a text extract with its neighbours is expected to decay with increasing neighbour distance.

  • Implicit clustering of the text corpus in paragraphs through punctuation marks and in-line spacing, and sections and subsections through titles. These groups represent semantically differentiable units that have been implicitly set by the human writer. It is highly likely that the querying human will expect to find these associations in the same way as they were encoded. The same argument can be made with rhythm and tone for voice messages or videos.

  • Implicit encoding of information density through formatting. For example, data contained in tables is expected to be more dense and therefore consumed (and queried!) in a different way than information contained in paragraphs or images.

It is only through metadata that we can infer those implicit relationships and therefore generate relevant information out of our raw extracts. For instance, any knowledge extraction pipeline for text documents must account, at least, for these four aspects:

  • Paragraph grouping

  • Section grouping

  • Tabular format

  • Figures and other graphics

The Knowledge Challenge

Extracting relevant information pieces is not the last step of the process of building a knowledge base, or at least one that is suitable for RAG-based applications. That is because knowledge refers not only to the information contained in the facts themselves but to the relationships existing between them. Ideally, our retrieval pipeline should be able to provide, at least, a fact for any possible question, provided that the question is relevant. At the same time, each fact should implicitly represent an answer to, at least, one relevant question. It is clear that the ability to pair a query to an existing fact is heavily determined by the way that facts are organised into a knowledge base.

To begin with, a properly set knowledge base groups information by similarity, and therefore by probability of being retrieved when a query about a specific topic is asked. This helps to navigate the knowledge space in an efficient way, which in turn allows for a greater degree of scalability of the RAG pipeline.

Besides, these relations should be formed at different levels of context precision. For instance, relations can be made at the topic level, document level, section level, and even paragraph level. The knowledge encoded in the text is complete only when these different relationships are leveraged in a multi-step process, as each encodes a specific degree of semantic precision that might be (or might not) required by a query. It is intuitively deduced that the semantic precision of the query will also have to be determined.

Finally, knowledge formation should also utilise augmentation techniques, often requiring the intervention of language models. This is because knowledge can also be generated through a deductive process from a sequence of facts, which can sometimes be very lengthy, rather than solely from a specific set of interconnected facts without hierarchy or order. One example of this is narrative text. Only after reading the entire Harry Potter series can one truly explain who Harry Potter is.

These and other challenges require intensive research and customised information retrieval pipelines that go well beyond the limited capabilities of large language models and their always insufficient context size length.

At UTHEREAL.ai we are leading in our solution to address those challenges, we created our own state-of-the-art solutions around OCR, Processing, Embedding and searching to address those challenges and we continue to push the edge of what is possible.

Additional good reads

https://arxiv.org/pdf/2401.18059v1

https://arxiv.org/pdf/1810.08473

J@UTHEREAL.AI

CTO & Chief Scientist

Previous
Previous

Building with LLMs — Key considerations and emerging research topics

Next
Next

Revolution of Personal AI with UTHEREAL.ai