Large Language Models (LLMs) like those developed by Google’s DeepMind can be seen as strong data compressors, according to a recent research paper. The authors propose viewing LLMs from a compression perspective, highlighting their ability to transform input data into a smaller, compressed form. The researchers repurposed LLMs to perform arithmetic coding, a lossless compression algorithm, and found that they excelled in compressing text but also achieved impressive compression rates for image and audio data. However, due to their large size and slower processing speeds, LLMs are not currently practical for data compression compared to existing models.
One of the key insights gained by viewing LLMs as compressors is that the performance of larger models diminishes on smaller datasets. The researchers found that bigger models do achieve better compression rates on larger datasets, but there is a critical point where the adjusted compression rate starts to increase again due to the model’s size overpowering the dataset. This suggests that bigger models are not necessarily better for all tasks and that compression can serve as an indicator of how well the model learns from its dataset. These findings could have implications for the evaluation of LLMs in the future, particularly in addressing the issue of test set contamination in LLM training.
Overall, while LLMs demonstrate potential as data compressors, their limitations in terms of size and processing speed currently make them less practical than existing compression algorithms. However, the compression perspective provides insightful findings related to model performance and scalability. It also offers a new approach to evaluating LLMs, especially in addressing challenges like test set contamination.