The most common application for retrieval-augmented generation (RAG) is answering questions from a large source, like a codebase, a book, or documentation. I made a Colab notebook that demonstrates both a basic RAG workflow and a different use case: customer service.
Customer service teams often receive repetitive questions that require near-identical answers. The classic way to deal with this is an FAQ, but even with those nonetheless near-identical questions keep arriving.
This seems like a promising application for RAG. If you have a database of past questions and answers, you can use RAG to map an incoming question to the most closely-related past questions. Then you provide an LLM with those past questions and answers, as well as the new question, and a generic prompt with basic instructions. If the past questions are very similar to the new one and the past answers are all very similar to each other, it’s easy for an LLM to construct the same answer for the new question, adding whatever slight tweaks might be necessary.
Below is a tour of the notebook.
The basic idea
Here’s what the notebook does:
- Embed every historical ticket (
subject+ metadata) and store the embeddings in a vector database. - When a new question arrives, embed it as well. Search over the vector database to retrieve the top-k most similar tickets, plus their accepted answers.
- Feed those examples to an LLM with a prompt instructing it to mimic the answer style but tailor it to the new question.
The result is an answer that stays on-brand, respects any domain-specific quirks (version numbers, escalation rules, tone), and lands in seconds. Because the model is conditioned on real historical answers, it’s less likely to hallucinate.
Dataset & preprocessing
I started with a customer-support-ticket corpus on Kaggle (link). A Python script (process_tickets.py, included in the Colab notebook) filters for language == "en", narrows to the Product Support queue (for the sake of getting good results without needing to embed too many tickets—more on that in a moment), and rolls tag columns plus other metadata into a single key string. That composite key is what I embed.
Embedding & vector DB setup
I used the Gemini API both to generate the embeddings and for the LLM that generates the final response. (You’ll need an API key if you want to run the notebook yourself, but it’s easy and quick to get a free one. The notebook links to instructions.) The notebook sleeps for a minute after every batch of 50 embeddings to avoid issues with rate limits in Gemini’s free tier. I used 2,500 tickets for the database in my example, so it took a bit over an hour to create the database. But that cost is paid once. After that, querying is instant.
The custom GeminiEmbeddingFunction() wraps models/embedding-001 with task_type="retrieval_document", which are embeddings meant for a retrieval database. I used Chroma for the vector database. Each ticket row is stored with its vector, the human-readable subject in metadata, and the row ID as primary key that I use to pull the ground-truth answer later. Chroma handles persistence locally in Colab. (For something this small scale, you could honestly just use a pandas dataframe, but Chroma is a bit more scalable and also provides convenience functions for querying the database.)
Prompt engineering ⟶ Gemini
Two helper functions keep the business logic tidy:
get_relevant_passages()→ returns a Markdown blob containing the k best historical question/answer pairs for a given query.make_prompt()→ wraps those exemplars in instructions (“imitate style, similar length, ignore irrelevant refs if any”).
I then send the prompt to gemini-2.5-flash (which seems plenty good enough for this task in practice) and stream back the answer.
Does it work?
Yes—on ten held-out tickets, the top three neighbors were near-duplicates every time. Gemini’s responses reused the correct solution steps (reset tokens, warranty SKUs, whatever) and mirrored the terse, polite tone of the historical answers. Because the retrieved context is factual, hallucinations were effectively zero.
Check it out yourself
Check out the full notebook here: RAG example.ipynb. You can just look at the results I got yourself. Or plug in your Gemini API key, point process_tickets.py at your CSV (as explained in the notebook), and run the notebook yourself.
Next steps & ideas
This is just a proof-of-concept. In production you’d likely:
- Eject Chroma’s in-memory store for a managed vector database.
- Swap in incremental updates so new completed tickets are embedded on ingest.
- Layer a policy checker/guardrail on top of Gemini’s output before sending it to customers.
- Log user feedback and possibly fine-tune a smaller model to further minimise latency/cost.
(Featured image generated with DALL·E 3.)
