2023-IN2547-001PD-2022-INQ0091105-N0-IN2547: Building FAISS Index takes way too much time

Hi,

I am currently working on the second assignment and I am facing some difficulties with building the FAISS db for the docs retrieval part.
Specifically the problem lies in the building time of the index. I am currently working with a data collection with the following statistics (after applying recursiveCharacterTextSplitter):

Total number of words: 116472266
Number of documents: 466443

From what I have seen online, it is not even a big dataset (<500k) but it takes so much time to build. Currently, due to timeout errors in Colab I had to switch to the DEI's cluster in order to have enough runtime and after 25hrs it is still building the index.

I am using faiss-cpu (faiss-gpu didn't show any significant improvement) and invoking it quite literally as done here.
Am I doing something wrong? Perhaps the dataset is too large? What am I missing?

I get that it is surely about how much data I am feeding FAISS with, but still from what I've read online I was not expecting it would take more than a day to compute. If anyone could help me I would really appreciate it!

Re: Building FAISS Index takes way too much time

by Giorgio Satta - Tuesday, 14 May 2024, 11:26 PM

Pietro, I think the problem here is the size of the dataset. As mentioned in class, for this type of project it is recommended to use a much smaller collection, few thousand documents. To do this, it is perfectly ok to narrow down to a very specific topic. As an example, if you are working on traveling, narrow down to just one city or district.

Re: Building FAISS Index takes way too much time

by PIETRO GIROTTO - Sunday, 19 May 2024, 3:39 PM

Thank you!
I have narrowed down the dataset and this helped a lot.
Furthermore, I would also suggest to carefully choose the embedding model for the FAISS index as I was using a 768 dimensional one, but switching to 368 greatly reduced computation time and gave even better results.