Compare the cost of using an LLM to using a rule-based system. One of the examples is the cost of looking up the capital of Delaware. The cost of a database of capitals and a tiny bit of parsing to extract questions like that is far cheaper than any of their proposed approaches. If you don’t need to handle arbitrary unstructured texts then there may be much simpler ways of solving the problem. If you do need to handle arbitrary unstructured text then you may still want to look at the distribution of queries. For example, if you’re providing a news service, ‘what are the results of {latest sports event / election}’ are likely to account for a significant fraction of your results at different times and can be handled far more cheaply by some quick parsing and lookup than by an LLM. In contrast, more open-ended queries that require summarising data from multiple sources are going to be much harder to build in a rule-based system, to the point where an LLVM may be vastly cheaper.
Running an LLM query through a GPU is very high latency
How much better would this be on a SoC with unified memory, like Apple Silicon? Or would the latency advantage, if any, on such systems come from the limited size of LLMs that such systems can run?
Add to that:
Compare the cost of using an LLM to using a rule-based system. One of the examples is the cost of looking up the capital of Delaware. The cost of a database of capitals and a tiny bit of parsing to extract questions like that is far cheaper than any of their proposed approaches. If you don’t need to handle arbitrary unstructured texts then there may be much simpler ways of solving the problem. If you do need to handle arbitrary unstructured text then you may still want to look at the distribution of queries. For example, if you’re providing a news service, ‘what are the results of {latest sports event / election}’ are likely to account for a significant fraction of your results at different times and can be handled far more cheaply by some quick parsing and lookup than by an LLM. In contrast, more open-ended queries that require summarising data from multiple sources are going to be much harder to build in a rule-based system, to the point where an LLVM may be vastly cheaper.
Sounds like many of these numbers might be outdated in years, if not months :)
All the more reason to start tracking it now!
How much better would this be on a SoC with unified memory, like Apple Silicon? Or would the latency advantage, if any, on such systems come from the limited size of LLMs that such systems can run?