What's going on here?
I put that to the test by gamifying a comparison between the two. It ended in disastrous defeat for gpt-3.5-turbo, as llama2-70b decisively beat its opponent. Here was the score.
Llama 2 70B Chat
Now, a new challenger is the scene... Mistral 7B. Lightweight, fast, and equipped with a nasty uppercut, Mistral talks big — it claims to outperform Llama 2 13B on all benchmarks. Let's see who wins this time! Results so far:
Llama 2 13B Chat
Mistral 7B Instruct
How it works
- Questions are generated by GPT-4 using this prompt:
I'm creating an app that compares large language model completions. Can you write me some prompts I can use to compare them? They should be in a wide range of topics. For example, here are some I have already: Example outputs: How are you today? My wife wants me to pick up some Indian food for dinner. I always get the same things - what should I try? How much wood could a wood chuck chuck if a wood chuck could chuck wood? What's 3 x 5 / 10 + 9 I really like the novel Anathem by Neal Stephenson. Based on that book, what else might I like? Can you give me another? Just give me the question. Separate outputs with a \n. Do NOT include numbers in the output. Do NOT start your response with something like "Sure, here are some additional prompts spanning a number of different topics:". Just give me the questions.