What's going on here?
A few months ago Llama2 took the world by storm. Many claimed performance similar to OpenAI's GPT-3.5.
I put that to the test by gamifying a comparison between the two. It ended in disastrous defeat for gpt-3.5-turbo, as llama2-70b decisively beat its opponent. Here was the score.
![llama boxing](/images/llama-3af71cdf7dacde3e9b95aa57270c22f7.png?vsn=d)
WINNER
Llama 2 70B Chat
22679 π
![gpt-3.5 boxing](/images/gpt-085b845e4ae2a5068d6abb7a43f1887f.png?vsn=d)
GPT 3.5
14013 π
Now, a new challenger is the scene... Mistral 7B. Lightweight, fast, and equipped with a nasty uppercut, Mistral talks big β it claims to outperform Llama 2 13B on all benchmarks. Let's see who wins this time! Results so far:
![llama boxing](/images/babyllama-0445bc648b33b7ddd2f5e26feb9fe8b1.png?vsn=d)
Llama 2 13B Chat
20275 π
![mistral boxing](/images/mistral-8bcf2d3fc470887e72c0c81aa41ba223.png?vsn=d)
Mistral 7B Instruct
24142 π
How it works
- Questions are generated by GPT-4 using this prompt:
I'm creating an app that compares large language model completions. Can you write me some prompts I can use to compare them? They should be in a wide range of topics. For example, here are some I have already:
Example outputs:
How are you today?
My wife wants me to pick up some Indian food for dinner. I always get the same things - what should I try?
How much wood could a wood chuck chuck if a wood chuck could chuck wood?
What's 3 x 5 / 10 + 9
I really like the novel Anathem by Neal Stephenson. Based on that book, what else might I like?
Can you give me another? Just give me the question. Separate outputs with a \n. Do NOT include numbers in the output. Do NOT start your response with something like "Sure, here are some additional prompts spanning a number of different topics:". Just give me the questions.