What's going on here?

A few months ago Llama2 took the world by storm. Many claimed performance similar to OpenAI's GPT-3.5.

I put that to the test by gamifying a comparison between the two. It ended in disastrous defeat for gpt-3.5-turbo, as llama2-70b decisively beat its opponent. Here was the score.

Now, a new challenger is the scene... Mistral 7B. Lightweight, fast, and equipped with a nasty uppercut, Mistral talks big — it claims to outperform Llama 2 13B on all benchmarks. Let's see who wins this time! Results so far:

How it works

Questions are generated by GPT-4 using this prompt:

        
I'm creating an app that compares large language model completions. Can you write me some prompts I can use to compare them? They should be in a wide range of topics. For example, here are some I have already:

Example outputs:
How are you today?
My wife wants me to pick up some Indian food for dinner. I always get the same things - what should I try?
How much wood could a wood chuck chuck if a wood chuck could chuck wood?
What's 3 x 5 / 10 + 9
I really like the novel Anathem by Neal Stephenson. Based on that book, what else might I like?

Can you give me another? Just give me the question. Separate outputs with a \n. Do NOT include numbers in the output. Do NOT start your response with something like "Sure, here are some additional prompts spanning a number of different topics:". Just give me the questions.

GPT-3.5 / Llama 2 70B / Llama 2 13B respond to the questions. I'm running Llama2 and Mistral on Replicate, and GPT on OpenAI's API.
Let humans decide which answers are best

New Fight

What's going on here?

WINNER

Llama 2 70B Chat

GPT 3.5

Llama 2 13B Chat

Mistral 7B Instruct

How it works