Retrieval augmented generation (RAG) is an exciting new paradigm in natural language processing that combines the powers of neural retrieval and neural text generation. For data scientists, RAG opens up new possibilities for building AI systems that can generate high-quality, knowledge-grounded text.
In RAG, a retriever module first searches over a large corpus to find relevant information to the current context. This retrieved information then conditions a generator module to produce the final text output. By retrieving external knowledge, the generator can stay grounded in facts rather than hallucinating text.
RAG has shown promising results on open-domain QA, dialogue systems, summarization, and other language generation tasks. For data scientists, it provides a way to imbue AI systems with reasoning capabilities, common sense, and a shared understanding of the world.
In this comprehensive guide, we will cover the fundamentals of RAG, use cases, risks of not using RAG, implementation examples, and the future outlook for this burgeoning research area. By the end, data scientists will have a solid grasp of RAG and how to apply it in real-world systems.
Use Cases
RAG combines the best of search and text generation. This one-two punch makes it ideal for use cases like:
- Open-domain question answering – RAG systems like REALM from Google can answer factual questions by retrievingevidence from Wikipedia before generating a response.
- Dialogue systems – RAG allows conversations to stay grounded in facts rather than hallucinating responses. The retriever can find persona/profile information and conversation history to contextualize the dialogue.
- Summarization – The retriever can find other similar documents or passages to augment the summary. This improves factual consistency.
- Creative writing – The retriever can provide writers with relevant world knowledge or pieces of information to inspire the writing process.
- Any application that benefits from external knowledge – By scaling to corpus sizes in the billions of documents, RAG systems can incorporate knowledge that would be difficult to memorise in a standard seq2seq model.
Data scientists working on any kind of generative AI can consider RAG to increase the longevity, factuality, and coherence of systems. Next we discuss the risks of not utilizing RAG in such systems.
Risks of Not Using RAG
Neural text generators like GPT-4 make writing convincingly easy—perhaps too easy. Without any grounding, these models are prone to hallucinating facts, repeating disinformation, or responding inconsistently when conversing. Some risks include:
- Factual inaccuracy – Lack of grounding can lead models to generate plausible-sounding but incorrect statements. This is dangerous for any applications in medicine, science, or journalism.
- Lack of a consistent personality – With no memory, models will repeatedly change personalities when conversing. This hurts believability.
- Repetition of false premises – Neural generators are prone to affirming and repeating false premises if stated to them. Real-world grounding is needed to avoid this.
- No shared understanding – Without common grounding, different models will not align in their understanding of concepts, people, places, and things.
- Difficult to maintain and extend – Knowledge needs to be manually re-inserted each time the model is updated. New knowledge must also be inserted from scratch.
RAG provides a solution that grounds generation in shared facts. This mitigates the risks above to increase robustness. Data scientists should strongly consider RAG for deployable systems where these risks could prove costly over time.
Getting Started with RAG
For data scientists new to working with large pretrained language models, RAG can appear daunting to implement from scratch. Thankfully, frameworks like HuggingFace Transformers make it easy to utilize state-of-the-art RAG systems with just a few lines of code.
Here is a step-by-step guide to start applying RAG:
- Pick a pretrained model like RAG-Token or RAG-Sequence. These embed retriever and generator modules in one system.
- Prepare a corpus of documents to index as external knowledge. This can be Wikipedia, company docs, dialogue transcripts, etc.
- Index the corpus with sparse retrieval tools like Annoy, SciPy, or Elasticsearch. This allows fast nearest neighbor search.
- Tokenize queries/inputs and retrieve relevant documents from the index. HuggingFace integrators like RetrievAug make this simple.
- Concatenate or embed the retrieved documents with the original input. This conditions the generator.
- Feed the conditioned input to the model and generate text as usual!
While simple conceptually, additional care should be taken to:
- Select corpus relevant to the end domain.
- Optimize retrieval throughput and latency for production.
- Employ redundancy to improve recall.
- Scale model size for longer generated texts.
Let’s walk through a Python example to make RAG concepts more concrete. We will build a simple QA system on Wikipedia data using a pre-trained REALM model from HuggingFace.
To start, we need a Wikipedia corpus (I am using it as an example, you can always use a domain specific corpus). We can install the HuggingFace Datasets library and load a pre-processed subset:
from datasets import load_dataset
wiki = load_dataset("wiki_snippets", split="train")
This provides titled Wikipedia snippets we can index for retrieval. Next we need to install a vector index. We’ll use Annoy for approximate nearest neighbors:
import annoy
index = AnnoyIndex(512, "angular")
for i, doc in enumerate(wiki):
index.add_item(i, doc["text_embedding"])
index.build(10) # 10 trees
Now we can efficiently find snippets relevant to an input question. We’ll tokenize the input and find the 5 nearest snippets:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")
query_embedding = tokenizer(question, return_tensors="pt").text_embeddings.numpy()
nearest_doc_ids = index.get_nns_by_vector(query_embedding, n=5)
retrieved_docs = [wiki[i]["text"] for i in nearest_doc_ids]
Our retriever module is complete! Now, to the most important step.
We can now initialize a RAG model and pass the retrieved_docs to generate an answer:
from transformers import RagTokenForGeneration
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq")
input_ids = tokenizer(question, retrieved_docs, return_tensors="pt").input_ids
gen_output = model.generate(input_ids)
print(tokenizer.decode(gen_output[0]))
And we have an end-to-end RAG system! With the modular HuggingFace integrations, it’s easy to prototype and iterate. There are also options to scale up for production.
This shows how RAG can enable QA systems with only 20 lines of code. The retriever provides relevant grounding documents to improve downstream answer generation. There are many possibilities to build from here.
What to look out for in future?
RAG remains an active research area as part of the broader paradigm of augmented language models. Here are some promising directions as RAG evolves:
- Multi-step reasoning – Current systems retrieve information just once. Future work may iterate retrieval and generation for multi-step inference.
- Discrete reasoning – Retrieval currently surfaces full documents. Retrieving specific entities, facts, or structured data could improve precision.
- Low-resource domains – RAG could help transfer strong linguistic models to specialized domains with limited training data.
- Personalisation – Retrieval can pull personalized information like chat history, user profiles, or emails to contextualize generation.
- Mixed systems – Combining retrieval, generation, and reasoning modules provides an appealing path to more capable AI assistants.
So, the key takeaways from this discussion are as follows:
- RAG improves generative AI through grounding in external knowledge. This mitigates risks like hallucination.
- Retriever and generator modules can be combined in a wide range of applications needing knowledge.
- Frameworks like HuggingFace make state-of-the-art RAG models easy to implement in just a few lines of code.
- Active research is improving RAG reasoning, personalization, and precision capabilities.
We encourage data scientists to experiment with RAG on new datasets and domains. The possibilities are vast to create AI systems that learn, reason, and communicate at an expert level.
Didn’t get what you were looking for? Please write me in the comments and I will try to cover that.
THIS POST IS WRITTEN BY SYED LUQMAN, A DATA SCIENTIST FROM SHEFFIELD, SOUTH YORKSHIRE, AND DERBYSHIRE, UNITED KINGDOM. SYED LUQMAN IS OXFORD UNIVERSITY ALUMNI AND WORKS AS A DATA SCIENTISTFOR A LOCAL COMPANY. SYED LUQMAN HAS FOUNDED INNOVATIVE COMPANY IN THE SPACE OF HEALTH SCIENCES TO SOLVE THE EVER RISING PROBLEMS OF STAFF MANAGEMENT IN NATIONAL HEALTH SERVICES (NHS). YOU CAN CONTACT SYED LUQMAN ON HIS TWITTER, AND LINKEDIN. PLEASE ALSO LIKE AND SUBSCRIBE MY YOUTUBE CHANNEL.