Purists may disagree, but Attention, like Calculus is a generational advancement to shape the incoming machine era. Paying attention to Attention (sic!) is important for all, not just to ensure we do not miss woods for trees, aka LLMs in this case, but also to understand the likely implications in discovering new interrelations in fields as far apart as genetics, social networks or what could be connecting earthquakes.
From Calculus to tensors and complex numbers or Hilbert Space, mathematical advancements have been critical in most of the giant progress in our scientific quests. Attention, unlike these pure methods, operates as a powerful computational tool with an impact far exceeding other heuristical methods of the past.
The core of Attention lies in its ability to analyze data within its full context. Before attention, neural networks processed information sequentially or with a very narrow contextual window. This approach was akin to trying to understand a novel by reading one word at a time without remembering the previous page. Attention revolutionizes this by enabling AI models to consider the broader context, looking at information 'as a whole' and understanding how different parts relate, no matter their distance in the sequence.
Effectively, Attention allows machines to consider information from across an entire document, picture, video, codebase, or even external sources when interpreting a particular element.
Early applications of Attention, mainly when its abilities were small and growing, focused on natural language processing, leading to the rise of large language models (LLMs). As we increase the size of these context windows, our ability to extract long-range dependencies and relationships has grown exponentially. This has opened doors to new applications in diverse fields.
At the first level of scaling beyond text, Attention allows machines to analyze images and videos with a deeper understanding of the relationships between objects and regions, enabling breakthroughs in object detection, image captioning, and visual reasoning.
We are only now waking up to the potential at higher scaling levels. For example, the human genome, with its complex interplay of genes and regulatory elements spread across vast stretches of DNA, is a prime example where long-range dependencies are crucial. https://go.nature.com/3JfdeEZ shows how Genomic Language Models are already here. Expect similar models for climate sciences, earthquake predictions, financial market analysis, and social network dynamics. At even higher levels, Attention is likely to play a critical role in the applications we might be able to develop in Quantum Computing.
In sum, Attention is not just LLMs!