AI-powered benchmarking platform helps leading companies fine-tune their models' results, study claims

The study found that there are concerns about the fairness and consistency of testing artificial intelligence (AI) models. (Image credit: Getty Images)

Scientists who claim that AI chatbot testing favors Big Tech companies' proprietary models are putting the test under scrutiny.

LM Arena essentially pits two unidentified large language models (LLMs) against each other to see which one can do a better job of solving a given prompt, with users of the benchmark voting for their favorite result. The results are then posted to a leaderboard that tracks which models are doing the best and how they are improving.

However, the researchers argue that the benchmark is skewed by giving large LLMs “closed proprietary testing methods,” giving them an advantage over open-source LLMs. The researchers published their findings on April 29 in the preprint database arXiv, and therefore the study has not yet undergone peer review.