What's going on here?
A few months ago Llama2 took the world by storm. Many claimed performance similar to OpenAI's GPT-3.5.
I put that to the test by gamifying a comparison between the two. It ended in disastrous defeat for gpt-3.5-turbo, as llama2-70b decisively beat its opponent. Here was the score.
WINNER
Llama 2 70B Chat
22679 π
GPT 3.5
14013 π
Now, a new challenger is the scene... Mistral 7B. Lightweight, fast, and equipped with a nasty uppercut, Mistral talks big β it claims to outperform Llama 2 13B on all benchmarks. Let's see who wins this time! Results so far:
Llama 2 13B Chat
20772 π
Mistral 7B Instruct
24694 π
How it works
- Questions are generated by GPT-4 using this prompt:
I'm creating an app that compares large language model completions. Can you write me some prompts I can use to compare them? They should be in a wide range of topics. For example, here are some I have already:
Example outputs:
How are you today?
My wife wants me to pick up some Indian food for dinner. I always get the same things - what should I try?
How much wood could a wood chuck chuck if a wood chuck could chuck wood?
What's 3 x 5 / 10 + 9
I really like the novel Anathem by Neal Stephenson. Based on that book, what else might I like?
Can you give me another? Just give me the question. Separate outputs with a \n. Do NOT include numbers in the output. Do NOT start your response with something like "Sure, here are some additional prompts spanning a number of different topics:". Just give me the questions.