ATTENTION: The Rise Of Transformers

ATTENTION: The Rise Of Transformers
Source: Education Week

From a humble beginning, the transformer architecture has reached to heroic heights in the field of artificial intelligence. It adapts to any tasks such as mastering language translation, sentiment analysis, summarizing concepts and many more. With its self-attention mechanism, the transformer has unveiled its capability to capture relationships in data with high precision. The transformers parallelization has eased the way of sequential processing. It has shown its remarkable ability in understanding context and generating human-like text. This has changed the way we perceive and interact with these machines.

Source: PyImage Search

Let’s understand attention and its capabilities by diving deep into the transformer architecture.


The evolution of attention in biological systems has made the way for its integration into artificial intelligence models. Inspired by human brains, researchers have tried replicating attention mechanisms in artificial neural networks. Early attempts such as LeNet has introduced attention like mechanisms to focus on specific features of the images. It is the rise of transformers that led to the breakthrough of attention.

Top 25 Deep Learning Applications Used Across Industries [2022 Edition]
Source: Simplilearn


The transformer architecture was proposed by Vaswani et al. in 2017 in the research paper “Attention is all You Need”. It has revolutionized the way we perceive in the field of Natural Language Processing. Transformers embrace the essence of attention mechanisms, particularly self-attention to capture dependencies between words in a specific sequence. Following self-attention, there are other variants such as sparse attention, local attention, global attention and many more. Here, we are going to limit our discussion to self-attention and their mechanisms in transformers.


Self-attention or scaled dot-product attention is a pivotal mechanism within the Transformer architecture. It allows the model to capture dependencies between different positions or words within a sequence by selectively focusing on relevant information.

Initially each word in the input sequence is represented by three learnable vectors: key vector, value vector and the query vector. These vectors are derived from the input embeddings.

Let’s understand these vectors using a treasure hunt example:

  • Key Vector: Identifies important information

 Ex: A key vector is like a treasure map that tells you where the valuable things       are located.

  • Value Vector: Holds the actual information

 Ex: These are like treasures that you find by following the map.

  • Query Vector: Asks a question or seeks information we’re looking for

 Ex: It is a question when you want something specific like “Where is the              treasure hidden?”

What is all the fuss about Attention in Generative AI? | by Freedom  Preetham | Autonomous Agents | Medium
Source: Medium

Next, attention scores are computed between the query (Q) and key (K) vectors. These scores measure the compatibility or similarity between each word's query and the keys of all other words in the sequence. The higher the compatibility, the more attention the word will pay to other words. This is followed by a scaling operation.

To convert the attention scores into probabilities, we apply a softmax function, which normalizes the scores so that they add up to 1. These probabilities represent how much each word should contribute to the output for a particular Query.

The attention weights obtained in the previous step are multiplied with the value (V) vectors to compute a weighted sum. This means that words with higher attention weights have a more significant influence on the final representation of the Query. The weighted sum of these values represents the final output of the self-attention mechanism.

This process is repeated for every word in the sentence, allowing each word to attend to other words and gather contextual information.


In Simpler terms,

Self-attention in Transformers works by looking at each word in a sentence and deciding how much attention to give to other words. It does this by comparing the similarity between the word's meaning (query) and the meanings of all the other words (keys). These similarities are turned into attention weights, which determine how much each word contributes to the final representation. By doing this for each word, the model can capture the relationships and dependencies between words resulting in better understanding of the sentence.


a) The process of self-attention. (b) Multi-head attention. The MSA... |  Download Scientific Diagram
Source: ResearchGate

Multi-head attention is like having multiple sets of self-attention mechanisms working together, each focusing on different aspects of the input. It splits the embeddings into multiple sets of queries, keys, and values. Each set represents a "head" of attention. Each head independently learns to pay attention to different patterns and relationships in the input.

Once we have multiple sets of attention heads, we calculate attention scores for each head separately, just like in self-attention. Then, we combine the outputs from all the attention heads by concatenating them.

This way, the model becomes more powerful and can learn to attend to different parts of the input simultaneously.


Masked multi-head attention is an extension to multi-head attention that uses a special trick to cover the future words. Imagine that we want our model to predict the word in a sentence, but we don’t want it to look ahead at the words that come after the target word. To prevent this, we hide or mask the future words by setting their attention scores to a very low value, so the model won’t pay attention to them. This ensures that during training, the model can only focus on the words it has seen before predicting the next word.

Understanding Attention In Transformers Models | by Alvaro Henriquez |  Analytics Vidhya | Medium
Source: Medium

Steps involved in Masked Multi-head attention:

  1. Split the input into queries, keys and values.
  2. Compare each query to the keys to see how related they are.
  3. We apply a mask to prevent attending to future words. Create a mask where the values for the upper triangle above the diagonal are all ones and the rest values are zero.
  4. Now multiply this matrix by a large value like -1e^-20. This is done to set the attention scores for future words to a very low value (e.g., -inf). Since -inf is not possible, we have used -1e^-20.
  5. The masked attention scores are then passed through a softmax function, which converts them into probabilities. This ensures that the model pays attention to the relevant words while avoiding future information.
  6. The rest of the steps are same as multi-head attention: weighted sum of values is calculated, attention heads are concatenated and passed through a linear layer to obtain the final output.

I hope that you have gained a better understanding of attention! If you'd like to explore more insightful resources and deepen your knowledge, I recommend checking out this link.


From the beginning of attention in the human mind to its revolutionary impact on artificial intelligence, the evolution of it has been an impactful journey. The journey of attention has only just begun, and its influence will continue to shape the future of AI, transforming our interactions and revolutionizing the way machines perceive and understand the world around us.


  1. Attention is All You Need:
  2. The Illustrated Transformer:
  3. Transformer’s Self-attention:
  4. Understanding Attention:


  1. Curious about the different roles and career paths in data science, refer to this link for an in-depth exploration.
  2. Want to explore more such insightful articles then don’t forget to check this link.
  3. Check this link to explore a testing company.

By Soumya G