Have you ever asked an LLM a complex question and received a disappointing answer? This is a common shortcoming of LLMs, which often struggle to reason through multi-step problems.
In this post I'll show you a couple of examples for improving an LLM's performance on reasoning tasks alongside some interactive demos.
The Problem
Let's look at an example I've cherry-picked from GSM8K, a collection of math word problems which are used to test a model's arithmetic reasoning capabilities.
The correct response here is: 52 - (21 + 4) = 27
. At the time I wrote this, Llama 3.1 8b returned 35
, an incorrect answer. To more reliably solve these classes of problems, we need to make our model smarter.
The most obvious way is to use a bigger model. A model with more parameters will typically result in better answers, but that comes with the downsides of higher cost and slower execution (all other things being equal).
Let's see how the big brother to Llama 3.1 8b, Llama 3.3 70b, performs:
It successfully solved the problem. Not surprising given that it's nearly ten times larger! But what if we could make our smaller model smarter?
The approaches we'll introduce today allow you to get the same answer from a smaller model, more quickly, and for a fraction of the cost.
In Context Learning
One way to improve the reasoning capabilities on our problem is to show the model a few examples of how to solve it.
27! Excellent!
This technique is commonly referred to as "few-shot prompting", a form of "in-context learning". It allows you to provide the model with some examples of how to solve a problem, which it can then use to generalize to new, similar problems.
One thing you may have noticed is the examples we gave to the model had different names and operands from the word problem we asked it to solve. Even though these differences were present, the model was able to figure out the "shape" of the problem and generalize to the new one. Pretty remarkable for a such a small model!
Chain of Thought
We've demonstrated that with few shot prompting we can give the model more input to help it reason through a problem. But what if it could also get better by producing a specific type of output?
This is the idea behind Chain of Thought (CoT), a technique originally described in a paper by Google's Deep Brain team. CoT enhances an LLM's reasoning capabilities by encouraging step-by-step problem-solving in their responses.
Let's see it in action:
By guiding the model to break the problem down into a series of steps prior to responding, we were able to elicit a correct answer!
Discussion
Coaxing LLMs into reasoning more is an active area of research. This is part of a broader type of model scaling known as test time compute. Several prominent AI labs have released models (e.g., Open AI's o1 and Alibaba's QwQ) that are designed to trade off more inference time for higher quality answers. Models like these are also trained to respond better when they're asked to enter into reasoning chains, so they combine several methods to achieve their superior results.
Hidden Tokens
Sometimes the end-user experience of interacting a reasoning model hides the steps before outputting the final answer. With Mixlayer, you can choose to hide these steps as well. Add { hidden: true }
to your gen
or prompt
parameters to hide any tokens you don't want to be shown in the output.
Here's how it works:
In this example, you'll see the model receive the intial word problem follow by a period of "thinking". At the end of the thinking period, the model should output its answer. You can use this technique to hide your reasoning logic behind an API, for example.
Conclusion
In this post, we've explored two techniques for improving an LLM's reasoning capabilities: few-shot prompting and Chain of Thought. These techniques we've discussed can be used to improve the reasoning capabilities of almost any LLM.
When you feel like you've got a hang of what's going on, here are some next steps you can try out:
- If you show the model only one example, does it still work?
- What happens when you combine these techniques?
- Change the prose structure of the example prompts. Can the model still generalize?
- Deploy your reasoning logic to Mixlayer's cloud to let others plug in their own inputs and use it as a API!
If you decide to try any of these out, I'd love to hear how it goes!
Join us on discord to share your results and get help with any issues you run into.
Cheers and happy prompting!