A few months ago Llama2 took the world by storm. Many claimed performance similar to OpenAI's GPT-3.5.

I put that to the test by gamifying a comparison between the two. It ended in disastrous defeat for gpt-3.5-turbo, as llama2-70b decisively beat its opponent. Here was the score.

llama boxing


Llama 2 70B Chat

22679 πŸ†

gpt-3.5 boxing

GPT 3.5

14013 πŸ†

Now, a new challenger is the scene... Mistral 7B. Lightweight, fast, and equipped with a nasty uppercut, Mistral talks big β€” it claims to outperform Llama 2 13B on all benchmarks. Let's see who wins this time! Results so far:

llama boxing

Llama 2 13B Chat

20205 πŸ†

mistral boxing

Mistral 7B Instruct

24030 πŸ†

How it works

  • Questions are generated by GPT-4 using this prompt:
    I'm creating an app that compares large language model completions. Can you write me some prompts I can use to compare them? They should be in a wide range of topics. For example, here are some I have already:
    Example outputs:
    How are you today?
    My wife wants me to pick up some Indian food for dinner. I always get the same things - what should I try?
    How much wood could a wood chuck chuck if a wood chuck could chuck wood?
    What's 3 x 5 / 10 + 9
    I really like the novel Anathem by Neal Stephenson. Based on that book, what else might I like?
    Can you give me another? Just give me the question. Separate outputs with a \n. Do NOT include numbers in the output. Do NOT start your response with something like "Sure, here are some additional prompts spanning a number of different topics:". Just give me the questions.
  • GPT-3.5 / Llama 2 70B / Llama 2 13B respond to the questions. I'm running Llama2 and Mistral on Replicate, and GPT on OpenAI's API.
  • Let humans decide which answers are best
