Large Language Models (LLMs) & Algorithmic Software
- Ivan - Kystopia
- Feb 29, 2024
- 2 min read
Since many algorithmic trading software are starting to use large language models (LLMs) to help with market analysis, I think it’s important to know what the current best-performing AI language models are and how they compare to each other. Let’s dive into this topic.
There are many ways to compare different LLMs to each other and measure their performance. Luckily, there is a resource made by professional computer scientists that uses several different ways of measuring the performance of LLMs that you can find here (https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard). It presents a simple table with 3 different ways of measuring performance.
The first score column assesses the results based on human evaluation. There is a separate website where people compare the answers of different LLMs to each other. These comparisons are then processed into a score that you see in the first column.
The second score column does pretty much the same, but the judge here is the best-performing LLM, which is GPT-4. That’s why it can quickly assess many answers without the need for any human evaluation. Interestingly, the results are very similar to the human evaluation, as described in this scientific article (https://arxiv.org/abs/2306.05685). Although GPT-4 obviously prefers its own answers, so it’s biased when measuring its own responses.
The third score column is a variation of a huge database of questions on different topics, from elementary mathematics to US history, computer science, law, and many others. It was first introduced in 2020 (https://arxiv.org/abs/2009.03300) and was then used to evaluate many LLMs, starting from GPT-3.
What can we learn from this? The obvious conclusion is that GPT-4 is the best-performing large language model, however you measure it. That’s why it’s the base model for the Perceptrader AI. Then, we can see that the second-best model, Claude, is from the OpenAI rival Anthropic. It outperforms GPT-3.5 and all the other competitors while also providing an increased context window of up to 100k tokens (about 75000 words). If you want to summarize a book, that’s your LLM of choice.
Bard, which is based on the PaLM model, lags behind all the main competitors. However, it has access to the internet, which makes it a better choice in some scenarios as you don’t always need to explicitly provide it with up-to-date data. Moreover, the next LLM from Google is likely to surpass GPT-4 (https://manifold.markets/brubsby/will-googles-gemini-beat-gpt4-in-te).
In conclusion, while GPT-4 stands out as the premier choice, other contenders like Claude and Bard offer unique features that suit specific needs. Moreover, open-source models, like the Llama family, can be trained on specific data to be better than competitors at certain narrow tasks. That's why it's essential to select an LLM based not just on raw performance but also on the unique requirements of the task at hand. As we move forward, the landscape is bound to evolve, with companies like Google gearing up to challenge the supremacy of current frontrunners. Staying updated on these developments will be crucial for anyone seeking to harness the power of AI in trading or any other domain.