The second part of the project for this year will be on retrieval-augmented generation (RAG), a technique that exploits the power of neural methods for information retrieval combined with generative models to construct virtual assistants for domain specific applications. RAG has also been presented in lecture 10.
You are required to go through the following steps.
Choose a domain of interest for your assistant; this can be anything you like, for example culinary art, rock music, etc. Find a large enough textual dataset on the web for your domain and clean up the data. Long documents must be split into chunks of approximately 512 tokens with a 10-15% overlap, to ensure context isn't lost at the boundaries. This creates your document repository.
Choose a sentence embedding model for the chunks in your repository and for the user queries. Choose a library for setting up your vector store, which is basically a vector database together with a retrieval model based on vector similarity for ranking documents on the basis of the relevance to the input query and for the selection of the N-best documents. Suggested library for the vector storage: faiss-cpu (Facebook AI Similarity Search); this is a lightweight library, works directly with NumPy/PyTorch tensors, and is perfect for toy scales.
Create a template that takes as input the query text and the N-best chunks, and provides as output a suitable prompt. Choose a LLM that has been instructed as a chatBot. The model is then used for inference on your prompt, to get the answer to your query. Suggested library for the LLM: transformers, sentence_transformers (Hugging Face).
For the evaluation of your system, you need to construct a micro-test set as follows. Choose around 10 to 20 passages from your document repository, and for each passage extract question/answer pairs. This can be done by prompting some chatBot that is more powerful than the chatBot you used for your RAG, in terms of number of parameters. To evaluate your system on the micro-test set, use the same chatBot that has extracted the question/answer pairs, and prompt it to act as a judge, passing the question, the gold answer, and the answer you obtain from your system, and ask to provide some score in a given range. Then analyse possible errors and inaccuracies, and discuss possible ways of improving your system.
If your LLM uses a number of parameters of 8B or larger, you need to use special libraries for model quantization. These libraries reduce the size of LLMs by modifying the precision of their weights. This is needed to fit in the amount of RAM usually available in Google Colab. Suggested library for model quantization: bitsandbytes
Your project should be presented as a notebook, to be uploaded in the Google shared folder, see the 'Project registration' post in this forum. Use a file name that includes as prefix your family names. The notebook should be organized as follows.
Student names, badge number and master program should be reported at the forefront of the notebook.
Domain and dataset should be described in a special section. Dataset profiling (summary of your dataset through descriptive statistics) is also required.
Basic libraries for sentence embedding, vector store, LLM quantization, etc. should all be introduced and briefly documented.
The prompt engineering step should be carefully described in a special section. Design choices should be discussed, including those that have been discarded because of low quality results.
If you use generative AI to write lines of your code or if you use solutions that have been inspired by similar projects publicly available on the web, you should add a proper acknowledgement section in your notebook and report the prompts you have used.
If you have any question about the second part of the project, please post a message in this thread.