The Most Effective Methods for Evaluating LLMs.

Measuring the Effectiveness of Large Language Models

Evaluating large language models (LLMs) is a complex task due to their ability to generate human-quality text, translate languages, write different kinds of creative content, and answer questions in an informative way. There are a number of different methods that can be used to evaluate LLMs, each with its own strengths and weaknesses.

Automatic evaluation metrics

Automatic evaluation metrics are a quick and efficient way to evaluate the performance of LLMs. These metrics typically measure the fluency and coherence of the generated text, as well as its similarity to a human-generated reference text. Some common automatic evaluation metrics include:

Perplexity: Perplexity measures the ability of an LLM to predict the next word in a sequence. Lower perplexity indicates better performance.
BLEU score: BLEU score measures the similarity between a generated text and a human-generated reference text. Higher BLEU scores indicate better performance.
ROUGE: ROUGE measures the overlap between n-grams in a generated text and a human-generated reference text. Higher ROUGE scores indicate better performance.

Human evaluation

Human evaluation is a more subjective method of evaluating LLMs, but it can provide more nuanced feedback than automatic evaluation metrics. Human evaluators can assess the quality of the generated text in terms of fluency, coherence, relevance, and creativity. They can also assess the LLM's ability to understand and respond to natural language prompts.

Real-world testing

Real-world testing is the most challenging method of evaluating LLMs, but it is also the most realistic. In real-world testing, LLMs are deployed in real-world applications and their performance is measured in terms of their ability to complete tasks and meet user needs.

Hybrid approaches

The most effective approach to evaluating LLMs often involves a combination of these methods. For example, an LLM may be evaluated using automatic evaluation metrics and then further evaluated by human evaluators. Additionally, LLMs may be evaluated in both synthetic and real-world settings.

The choice of evaluation method will depend on the specific LLM and the purpose of the evaluation. For example, if the goal is to evaluate the LLM's ability to generate fluent and coherent text, then automatic evaluation metrics may be sufficient. However, if the goal is to evaluate the LLM's ability to understand and respond to natural language prompts, then human evaluation may be necessary.

Here are some additional considerations for evaluating LLMs:

The diversity of the evaluation dataset: The evaluation dataset should be diverse enough to represent the range of tasks that the LLM will be expected to perform.
The relevance of the evaluation criteria: The evaluation criteria should be relevant to the purpose of the LLM.
The fairness of the evaluation process: The evaluation process should be fair and unbiased.

Evaluating LLMs is an ongoing challenge, but it is important to ensure that LLMs are evaluated in a way that is both rigorous and meaningful. This will help to ensure that LLMs are developed and deployed in a responsible way.

Current Approaches and Leaderboards

The first and usual initial form of evaluation is to run the model on several curated datasets and examine its performance. HuggingFace created an Open LLM Leaderboard where open-access large models are evaluated using four well-known datasets (AI2 Reasoning Challenge , HellaSwag , MMLU , TruthfulQA). This corresponds to automatic evaluation and checks the model's ability to get the facts for some specific questions.

This is an example of a question from the MMLU dataset.

Subject: College Medicine

Question: An expected side effect of creatine supplementation is.

A) Muscle weakness
B) Gain in body mass
C) Muscle cramps
D) Loss of electrolytes

Answer: (B)

Scoring the model on answering this type of question is an important metric and serves well for fact-checking but it does not test the generative ability of the model. This is probably the biggest disadvantage of this evaluation method because generating free text is one of the most important features of LLMs.

There seems to be a consensus within the community that to evaluate the model properly we need human evaluation. This is typically done by comparing the responses from different models.

Comparing two prompt completions in the LMSYS project - screenshot by the Author

Annotators decide which response is better, as seen in the example above, and sometimes quantify the difference in quality of the prompt completions. LMSYS Org has created a leaderboard that uses this type of human evaluation and compares 17 different models, reporting the Elo rating for each model.

Because human evaluation can be hard to scale, there have been efforts to scale and speed up the evaluation process and this resulted in an interesting project called AlpacaEval. Here each model is compared to a baseline (text-davinci-003 provided by GPT-4) and human evaluation is replaced with GPT-4 judgment. This indeed is fast and scalable but can we trust the model here to perform the scoring? We need to be aware of model biases. The project has actually shown that GPT-4 may favor longer answers.

LLM evaluation methods are continuing to evolve as the AI community searches for easy, fair, and scalable approaches. The latest development comes from the team at Toloka with a new leaderboard to further advance current evaluation standards.

Using Humans to Evaluate LLMs - A New Approach

The new leaderboard compares model responses to real-world user prompts that are categorized by useful NLP tasks as outlined in this InstructGPT paper. It also shows each model’s overall win rate across all categories.

The evaluation used for this project is similar to the one performed in AlpacaEval. The scores on the leaderboard represent the win rate of the respective model in comparison to the Guanaco 13B model, which serves here as a baseline comparison. The choice of Guanaco 13B is an improvement to the AlpacaEval method, which uses the soon-to-be outdated text-davinci-003 model as the baseline.

The actual evaluation is done by human expert annotators on a set of real-world prompts. For each prompt, annotators are given two completions and asked which one they prefer. You can find details about the methodology here.

This type of human evaluation is more useful than any other automatic evaluation method and should improve on the human evaluation used for the LMSYS leaderboard. The downside of the LMSYS method is that anybody with the link can take part in the evaluation, raising serious questions about the quality of data gathered in this manner. A closed crowd of expert annotators has better potential for reliable results, and Toloka applies additional quality control techniques to ensure data quality.

Summary

In this article, we have introduced a promising new solution for evaluating LLMs — the Toloka Leaderboard. The approach is innovative, combines the strengths of existing methods, adds task-specific granularity, and uses reliable human annotation techniques to compare the models.