Isn’t it basically a traditional search (either keyword based, vector based -embeddings have been around for years-, or a combination of both) where you take the top N results (usually not even full docs, but chunks due to context length limitations) and pass them to an LLM to regurgitate a response (hopefully without hallucinations), instead of simply listing the results right away? I think some implementations also ask the LLM to rewrite the user query to “capture the user intent”.
What I’m missing here? What makes it so useful?
ru552 108 days ago [-]
*What makes it so useful?
One example is in finance, you have a lot of 45 page PDFs laying around and you're pretty sure one of them has the Reg, or info you need. You aren't sure which so you open them one by one and do a search for a word, then jump through a bunch of those results and decide it's not this PDF. You do that till you find the "one". There are a non trivial amount of Executive level jobs that pretty much do this for half of their work week.
RAG purports to let you search one time.
jumploops 108 days ago [-]
This is true for traditional full-text document search as well.
When most people mention RAG, they’re using a vector store to surface results that are semantically similar to the user’s query (the retrieval part). They then pass these results to an LLM for summary (the generation part).
In practice, the problems with RAG are similar to the traditional problems of search: indices, latency, and correctness.
traverseda 108 days ago [-]
* indices
Doesn't vector search solve a lot of these problems? These AI vector spaces seem like a really easy win here, and they're reasonably lightweight compared to a full LLM.
* Latency
I don't want to call this a solved problem, but it is one that scales horizontally very easily and that a lot of existing tech is able to take advantage of easily
* Correctness
They LLM tooling doesn't necessarily need to make things worse here, although poorly designed it definitely could. AI can do a first pass at fact checking, even though I suspect we'll need humans in the loop for a long while.
---
I think that vector-space at least bring some big advantages for indexing here, being able to search for more abstract concepts.
jumploops 108 days ago [-]
* indices
> Doesn't vector search solve a lot of these problems? These AI vector spaces seem like a really easy win here, and they're reasonably lightweight compared to a full LLM.
Yes and no. What do you vectorize? The whole document? The whole page? The whole paragraph? How you split your data, and then index into it, is still problem-space dependent.
* Latency
> I don't want to call this a solved problem, but it is one that scales horizontally very easily and that a lot of existing tech is able to take advantage of easily
Any time you add steps, you increase latency. This is similar to traditional search where you e.g. need to fetch relevant data but scored based on some user-specific metric. Every lookup adds latency. Same is true for RAG.
* Correctness
> They LLM tooling doesn't necessarily need to make things worse here, although poorly designed it definitely could. AI can do a first pass at fact checking, even though I suspect we'll need humans in the loop for a long while.
Again, this comes back to how you index your data and what results are returned; similar to traditional search. This is problem-space dependent. Plus, we haven't solved LLM hallucinations -- there are strategies to mitigate it, but not clearcut solution.
cpursley 108 days ago [-]
Any tips on effectively getting financial data out of PDFs into a RAG system (especially data contained in tables)? And locally, not via proprietary cloud PDF parsing thingy. That's the current nut I'm trying to crack.
I’m probably missing the point: doesn’t https://pdfgrep.org solve this problem?
soneca 108 days ago [-]
What if they don’t remember the regulation code?
”What is the regulation that covers M&A of companies in the pharmaceutical industry?”
It seems much easier to get that response from a LLM than searching words with grep.
rawsh 108 days ago [-]
I built a web version with WASM at https://pdfgrep.com a few years ago in case it’s helpful to anyone
ww520 108 days ago [-]
RAG is not just traditional search. It's any augmented data that can be fed to the LLM.
The most useful and verifiable RAG setup I've seen is hooking up a RDBMS and LLM, and asking querying questions in English to retrieve the table data. You can do it in several steps.
1. Extract the metadata of the tables, e.g. table names, columns of each table, related columns of the tables, indexed columns, etc. This is your RAG data.
2. Build the RAG context with the metadata, i.e. listing each table, its columns, relationships, etc.
3. Feed the RAG context and the user's querying questions to the LLM. Tell LLM to generate a SQL for the question given the RAG context.
4. Run the SQL query on the database.
It's uncannily good. And it can be easily verified given the SQL.
Xenoamorphous 107 days ago [-]
Is that RAG though? Perhaps I’m missing something but I don’t see where the retrieval step is. Extracting the metadata and passing it to the LLM in the context sounds like a non-RAG LLM application. Or you’re saying that the DB schema is so big and/or the LLM context too small so not all the metadata can be passed in one go and there’s some search step to prune the number of tables?
ww520 107 days ago [-]
RAG is augmenting the llm generation with external data. How the external data is retrieved is irrelevant. A search is not necessary.
Of course you can do a search on the related tables with regard to the question to narrow down the table list to help the llm to come up with the correct answer.
simonw 108 days ago [-]
That's exactly what it is, and it's useful because when it works it means you can ask a question and get an answer to your question, rather then having to read the documents and then answer that question yourself.
lukev 108 days ago [-]
It also lets a language model answer questions while citing a source, something it fundamentally cannot do on its own.
Everyone talks about "reducing hallucinations" but from a system perspective, everything a LLM emits is equally hallucinated.
Putting the relevant data in context gets around this and provides actual provenance of information, something that is absolutely required for real "knowledge" and which we often take for granted in practice.
Of course, the ability to do so is entirely reliant on the retrieval's search quality. Tradeoffs abound. But with enough clever tricks it does seem possible to take advantage of both the LLMs broad but unsubstantiated content, and specific fact claims.
esafak 108 days ago [-]
You just described RAG: augmenting an LLM with external memory. Perhaps the part you are skipping is that the LLM synthesizes the retrieved information with its own knowledge into one coherent whole.
It's abstractive- (new) versus extractive (old) summarization.
What makes it useful is that it does the work of synthesizing the information. Imagine you ask a question that involves bits and pieces of numerous articles. In the past you had to read them all and mentally synthesize them.
thefourthchime 108 days ago [-]
I've used something like RAG for finding solutions to questions in slack. I take the question, break it into searchable terms, search slack and get a haystack of results. Then I use a LLM to figure out if the results are relivent, finally at the end i take the top 10 results and summarize them and link back to the slack discussion.
dragonwriter 108 days ago [-]
The intent is usually not to simply regurgitate the results, but to augment the prompt with them to enable a better, focussed answer to the user question than either search or an LLM alone would provide.
ingvar77 106 days ago [-]
The buzz is because it is really one of a most widely used new AI things, easily applicable to millions of businesses. Everyone has some large storage of unstructured data they want to search through and ask questions about - legal docs, candidates, books, articles.. At the same time it’s relatively straightforward to implement so it’s already tens or hundreds of startups / products pushing RAG agenda (all these “it seems easy but it’s not!”). Hopefully soon it will be added as a built in LLM feature - ability to upload own data for LLM to use. It also made many more developers aware of embeddings and vector search, which is great.
oriel 106 days ago [-]
I'm still building my understanding in this space, but so far I've seen its value when using chains and graphs of agents.
The overall system suggests degrees of freedom in search that might not have been available. This is by having a knowledge store in a format (vectors) primed for search, then having it be accessible in full or in partitions, by agents, working on one or more concurrent flows around a query.
I also see value in having a full circuit of native-format components that can be pieced together to make higher order constructs. Agents is just the most recent one to emerge and i can easily see a mixture of fine tuned experts alongside stores of relevant material.
/2c
jxnlco 108 days ago [-]
nothing, all i really say is 'add monitoring, do topic clustering'
which is how i did 'search' and 'recommendation' systems
1) are there filters we need to build
2) do we have inventory
nutanc 108 days ago [-]
It's useful because you get to increase your startup valuation if you use "RAG".
rldjbpin 108 days ago [-]
to me it feels like people are waking up to the fact that with current access to sw/hw, you can now make your own search engine and answering tool based on the data you own.
7thpower 108 days ago [-]
This is a great intro. I am amazed how many people don’t use the LLMs to analyze the questions themselves and apply filters to avoid pulling back irrelevant documents in the first place.
We run as many methods as practical in parallel (sql, vector, full text, other methods, etc.) and return the first one that meets our threshold. Vector search is almost never the winner relative to full text.
Instead, I see a lot of people in sister companies using the most robust models they can find and having agents to do chain of thought, while their users are wondering when, if ever, they’ll get a response back.
schmidt_fifty 108 days ago [-]
> Vector search is almost never the winner relative to full text.
Full text search is certainly the winner in the time dimension, but can it compete in quality? Presumably which method is likely to provide relevant results depends greatly on the query. Invoking LLMs to pre-process the query and select a retrieval method is going to be quite expensive compared to each of the search methods.
7thpower 108 days ago [-]
I mean from a retrieval quality perspective, not a latency perspective. Search latency is not a constraint because the long pole in the tent for us is always the user facing model.
We also have a lot of numbers in our customer requests, which do not typically play to the strengths of the vector searches.
COGs is not a large concern as our audience is internally facing along with a few of our partners, so inference and infrastructure costs are nothing compared to engineering time as we don’t have a way to amortize our costs out across a bunch of customers.
It is also a very high value use case for us.
The other factor is that we’re using fast and cheap models like haiku and mixtral to do the pre processing before we hand things to the retrieval steps, so it’s not much of a cost driver.
treprinum 108 days ago [-]
We are optimizing for latency and vector search is sufficient in 80-90% of cases and 0.6s is about the threshold for acceptable end-user experience. Hybrid search with SPLADE is marginally better but it limits the number of human languages we can use. I am wondering when is full-text better compared to vector search outside of very specific keywords.
7thpower 108 days ago [-]
Latency of search isn’t much of a concern, I was speaking to quality but did not word it well.
We have just found that vector search does not play well with numbers and does not provide consistent results, so we end up needing more chunks which results compounding token usage, slower responses, and higher chances of incorrect responses due to the customer facing model getting confused by similar results. I’m sure we could optimize our approach but full text has worked far more reliably than expected so we have invested more resources into how we handle documents, latency reduction, and pulling in structured data.
cpursley 108 days ago [-]
This sounds really interesting. Do you have any longer-form writeup on this approach (or could you point us towards related info)?
7thpower 108 days ago [-]
I do not but my twitter handle is in my profile and I am always more than happy to hop on a call and share what I know.
For reference our subject matter is engineering specs for high precision electronics manufacturing. We have ~100k products and a lot of them have identical documentation except for a few figures (which make all the difference in the world), so it’s a challenging use case that is very unforgiving. Totally doable though and the basis for a lot of capabilities we’ll be investing in moving forward.
Happy to share as I think we’re ahead in a few areas but believe others will catch up and we’ve learned so much from others willing to share info, so we always try to pay forward.
danenania 108 days ago [-]
This all seems pretty sensible. Another area that would be nice to see addressed are strategies for balancing latency/cost/performance when data is frequently updated. I'm building a terminal-based AI coding tool[1] and have been thinking about how to bring RAG into the picture, as it clearly could add value, but the tradeoffs are tricky to get right.
The options, as far as I can tell, are:
- Re-embed lazily as needed at prompt-time. This should be the cheapest as it minimizes the number of embedding calls, but it's the most expensive in terms of latency.
- Re-embed eagerly after updates (perhaps with some delay and throttling to avoid rapid-fire embed calls). Great for latency, but can get very expensive.
- Some combination of the above two options. This seems to be what many IDE-based AI tools like GH Copilot are doing. An issue with this approach is that it's hard to ever know for sure what's updated in the RAG index and what's stale, and what exactly is getting added to context at any given time.
I'm leaning toward the first option (lazy on-demand embedding) and letting the user decide whether the latency cost is worth it for their task vs. just manually selecting the exact context they want to load.
The article talks about full text search and meta data so maybe that's the path I should be taking instead of vector search? Where would I store the Metadata in this case? A regular db?
I wish articles like this would go into more details about the nitty gritty. But I appreciate high level overview in the article as well.
PheonixPharts 108 days ago [-]
Once you have vector representations the "similarity" scores are just basic linear algebra. It's fundamentally no different than any other IR/recsys task.
A good overview is chapter 6 of the Stanford NLP group's IR book [0].
Engineering LLMs still requires a good foundation in the basics of ML/NLP so it's worth the time to catch up a bit.
It generates synthetic questions, tests different embedding models, chunking strategies, etc. You end up with clear data that shows you what will give you the optimal results for your RAG app: https://platform.vectorize.io/public/experiments/ca60ce85-26...
I’m always suspicious with low-signal articles written about LLM based systems, as I suspect that the crowd involved are very trigger happy in using an LLM to write human facing text.
Maybe not what’s happening in this case, but it’s what springs to mind.
minimaxir 108 days ago [-]
The author is legitimately knowledgeable about LLM/RAG systems and develops open-source tooling around it.
But yes, this isn't a good HN submission without detail.
jxnlco 108 days ago [-]
how can i make it better?
this was a quick post that i wrote up after a 30 minute call with someone, mostly notes to take in preparation for a bigger talk im giving.
minimaxir 108 days ago [-]
Bullet points in general aren't good for HN discussion even if it's on an interesting/complex topic. Examples are good.
A talk + written summary/transcript of the talk when it's made would be much better.
jxnlco 108 days ago [-]
I'll try to add more details!
yumraj 108 days ago [-]
> this was a quick post that i wrote up after a 30 minute call with someone
Then you shouldn't have posted it on HN until there was more meat. Notes from a 30min call are OK for personal consumption, but not mass sharing as it is not useful in general and devalues personal brand. My 2 cents..
groby_b 108 days ago [-]
fwiw, I found this high value. I'm currently in the space of "conceptually, what do I need to think about in this space", and this was a great set of notes for me.
Awesome if the author wants to flesh it out further, but sometimes raw knowledge is more than enough.
all based on consulting calls and advisory work (i sell the implementation so sorry i don't post one click deploys)
groby_b 108 days ago [-]
Oooh, nice - thank you for sharing those! (At least from my POV, that's already a boatload of value you're giving away for free!)
timack 108 days ago [-]
This comment seems a bit harsh.
My 2 cents.
jxnlco 108 days ago [-]
its ok, im on the front page, the market has decided the worth
timack 108 days ago [-]
The invisible hand of collective interest!
ofermend 108 days ago [-]
Building RAG can be easy for a simple example, but it's much more nuanced than you might think when you try to do it at larger scale.
With larger-scale real-world enterprise RAG-based applications, you soon realize the enormous time and effort required to experiment with all these levers to optimize the RAG pipeline: which vector DB to use and how, which embedding model to use, pure vector search or hybrid search, chunking strategies, and on and one...
With Vectara's RAG-as-a-service (www.vectara.com) we try to help address exactly this issue: you get an optimized, high performance, secure and scalable RAG pipeline, so you don't need to go through this massive hyper-parameter tuning exercise. Yes, there are still some very useful levers you can experiment with, but only where it really matters.
Isn’t it basically a traditional search (either keyword based, vector based -embeddings have been around for years-, or a combination of both) where you take the top N results (usually not even full docs, but chunks due to context length limitations) and pass them to an LLM to regurgitate a response (hopefully without hallucinations), instead of simply listing the results right away? I think some implementations also ask the LLM to rewrite the user query to “capture the user intent”.
What I’m missing here? What makes it so useful?
One example is in finance, you have a lot of 45 page PDFs laying around and you're pretty sure one of them has the Reg, or info you need. You aren't sure which so you open them one by one and do a search for a word, then jump through a bunch of those results and decide it's not this PDF. You do that till you find the "one". There are a non trivial amount of Executive level jobs that pretty much do this for half of their work week.
RAG purports to let you search one time.
When most people mention RAG, they’re using a vector store to surface results that are semantically similar to the user’s query (the retrieval part). They then pass these results to an LLM for summary (the generation part).
In practice, the problems with RAG are similar to the traditional problems of search: indices, latency, and correctness.
Doesn't vector search solve a lot of these problems? These AI vector spaces seem like a really easy win here, and they're reasonably lightweight compared to a full LLM.
* Latency
I don't want to call this a solved problem, but it is one that scales horizontally very easily and that a lot of existing tech is able to take advantage of easily
* Correctness
They LLM tooling doesn't necessarily need to make things worse here, although poorly designed it definitely could. AI can do a first pass at fact checking, even though I suspect we'll need humans in the loop for a long while.
---
I think that vector-space at least bring some big advantages for indexing here, being able to search for more abstract concepts.
> Doesn't vector search solve a lot of these problems? These AI vector spaces seem like a really easy win here, and they're reasonably lightweight compared to a full LLM.
Yes and no. What do you vectorize? The whole document? The whole page? The whole paragraph? How you split your data, and then index into it, is still problem-space dependent.
* Latency
> I don't want to call this a solved problem, but it is one that scales horizontally very easily and that a lot of existing tech is able to take advantage of easily
Any time you add steps, you increase latency. This is similar to traditional search where you e.g. need to fetch relevant data but scored based on some user-specific metric. Every lookup adds latency. Same is true for RAG.
* Correctness
> They LLM tooling doesn't necessarily need to make things worse here, although poorly designed it definitely could. AI can do a first pass at fact checking, even though I suspect we'll need humans in the loop for a long while.
Again, this comes back to how you index your data and what results are returned; similar to traditional search. This is problem-space dependent. Plus, we haven't solved LLM hallucinations -- there are strategies to mitigate it, but not clearcut solution.
”What is the regulation that covers M&A of companies in the pharmaceutical industry?”
It seems much easier to get that response from a LLM than searching words with grep.
The most useful and verifiable RAG setup I've seen is hooking up a RDBMS and LLM, and asking querying questions in English to retrieve the table data. You can do it in several steps.
1. Extract the metadata of the tables, e.g. table names, columns of each table, related columns of the tables, indexed columns, etc. This is your RAG data.
2. Build the RAG context with the metadata, i.e. listing each table, its columns, relationships, etc.
3. Feed the RAG context and the user's querying questions to the LLM. Tell LLM to generate a SQL for the question given the RAG context.
4. Run the SQL query on the database.
It's uncannily good. And it can be easily verified given the SQL.
Of course you can do a search on the related tables with regard to the question to narrow down the table list to help the llm to come up with the correct answer.
Everyone talks about "reducing hallucinations" but from a system perspective, everything a LLM emits is equally hallucinated.
Putting the relevant data in context gets around this and provides actual provenance of information, something that is absolutely required for real "knowledge" and which we often take for granted in practice.
Of course, the ability to do so is entirely reliant on the retrieval's search quality. Tradeoffs abound. But with enough clever tricks it does seem possible to take advantage of both the LLMs broad but unsubstantiated content, and specific fact claims.
It's abstractive- (new) versus extractive (old) summarization.
What makes it useful is that it does the work of synthesizing the information. Imagine you ask a question that involves bits and pieces of numerous articles. In the past you had to read them all and mentally synthesize them.
The overall system suggests degrees of freedom in search that might not have been available. This is by having a knowledge store in a format (vectors) primed for search, then having it be accessible in full or in partitions, by agents, working on one or more concurrent flows around a query.
I also see value in having a full circuit of native-format components that can be pieced together to make higher order constructs. Agents is just the most recent one to emerge and i can easily see a mixture of fine tuned experts alongside stores of relevant material.
/2c
1) are there filters we need to build 2) do we have inventory
We run as many methods as practical in parallel (sql, vector, full text, other methods, etc.) and return the first one that meets our threshold. Vector search is almost never the winner relative to full text.
Instead, I see a lot of people in sister companies using the most robust models they can find and having agents to do chain of thought, while their users are wondering when, if ever, they’ll get a response back.
Full text search is certainly the winner in the time dimension, but can it compete in quality? Presumably which method is likely to provide relevant results depends greatly on the query. Invoking LLMs to pre-process the query and select a retrieval method is going to be quite expensive compared to each of the search methods.
We also have a lot of numbers in our customer requests, which do not typically play to the strengths of the vector searches.
COGs is not a large concern as our audience is internally facing along with a few of our partners, so inference and infrastructure costs are nothing compared to engineering time as we don’t have a way to amortize our costs out across a bunch of customers.
It is also a very high value use case for us.
The other factor is that we’re using fast and cheap models like haiku and mixtral to do the pre processing before we hand things to the retrieval steps, so it’s not much of a cost driver.
We have just found that vector search does not play well with numbers and does not provide consistent results, so we end up needing more chunks which results compounding token usage, slower responses, and higher chances of incorrect responses due to the customer facing model getting confused by similar results. I’m sure we could optimize our approach but full text has worked far more reliably than expected so we have invested more resources into how we handle documents, latency reduction, and pulling in structured data.
For reference our subject matter is engineering specs for high precision electronics manufacturing. We have ~100k products and a lot of them have identical documentation except for a few figures (which make all the difference in the world), so it’s a challenging use case that is very unforgiving. Totally doable though and the basis for a lot of capabilities we’ll be investing in moving forward.
Happy to share as I think we’re ahead in a few areas but believe others will catch up and we’ve learned so much from others willing to share info, so we always try to pay forward.
The options, as far as I can tell, are:
- Re-embed lazily as needed at prompt-time. This should be the cheapest as it minimizes the number of embedding calls, but it's the most expensive in terms of latency.
- Re-embed eagerly after updates (perhaps with some delay and throttling to avoid rapid-fire embed calls). Great for latency, but can get very expensive.
- Some combination of the above two options. This seems to be what many IDE-based AI tools like GH Copilot are doing. An issue with this approach is that it's hard to ever know for sure what's updated in the RAG index and what's stale, and what exactly is getting added to context at any given time.
I'm leaning toward the first option (lazy on-demand embedding) and letting the user decide whether the latency cost is worth it for their task vs. just manually selecting the exact context they want to load.
1 - https://github.com/plandex-ai/plandex
I've been using this as a starter. https://developers.cloudflare.com/workers-ai/tutorials/build... I put in text but I feel like my conception of what should get high relevancy scorrs doesn't match the percentages that come out.
The article talks about full text search and meta data so maybe that's the path I should be taking instead of vector search? Where would I store the Metadata in this case? A regular db?
I wish articles like this would go into more details about the nitty gritty. But I appreciate high level overview in the article as well.
A good overview is chapter 6 of the Stanford NLP group's IR book [0].
Engineering LLMs still requires a good foundation in the basics of ML/NLP so it's worth the time to catch up a bit.
0. https://web.archive.org/web/20231207074155/https://nlp.stanf...
I'd recommend taking a look at lancedb as they support text, vectors, and sql.
high relevancy scores are not percentages, they only make sense in 'ordering' but 0.7 does not mean relevant.
but .9 vs .7 means 'maybe more relevant.
It generates synthetic questions, tests different embedding models, chunking strategies, etc. You end up with clear data that shows you what will give you the optimal results for your RAG app: https://platform.vectorize.io/public/experiments/ca60ce85-26...
An implementation: github.com/infiniflow/ragflow
Maybe not what’s happening in this case, but it’s what springs to mind.
But yes, this isn't a good HN submission without detail.
this was a quick post that i wrote up after a 30 minute call with someone, mostly notes to take in preparation for a bigger talk im giving.
A talk + written summary/transcript of the talk when it's made would be much better.
Then you shouldn't have posted it on HN until there was more meat. Notes from a 30min call are OK for personal consumption, but not mass sharing as it is not useful in general and devalues personal brand. My 2 cents..
Awesome if the author wants to flesh it out further, but sometimes raw knowledge is more than enough.
all based on consulting calls and advisory work (i sell the implementation so sorry i don't post one click deploys)
My 2 cents.
With larger-scale real-world enterprise RAG-based applications, you soon realize the enormous time and effort required to experiment with all these levers to optimize the RAG pipeline: which vector DB to use and how, which embedding model to use, pure vector search or hybrid search, chunking strategies, and on and one...
With Vectara's RAG-as-a-service (www.vectara.com) we try to help address exactly this issue: you get an optimized, high performance, secure and scalable RAG pipeline, so you don't need to go through this massive hyper-parameter tuning exercise. Yes, there are still some very useful levers you can experiment with, but only where it really matters.