📝 Summary
-Content: Took a look inside BERT
-Conclusion: I could vaguely see how it extracts the meaning of a sentence
- 📝 Summary
- 🎯 Point
- 💬 Prompt and Chat Log
- Consultation on how to proceed
- Implementation of BERT Attention Visualization Code
- Implementation of Submission Code
- Question about CLS
- Question about Query, Key, Value this time
- Question about the embedding matrix X
- Question about differences from GloVe and FastText
- Implementation of visualization code for computing embedding matrix X
- Question about how CLS is computed
- About BERT’s training tasks (its “pretext tasks”)
- Relationship between single-sentence tasks and Next Sentence Prediction
- 📊 Results (Code, Score, Ranking)
- 📚 Insights
- 🔮 Next
🎯 Point
Observe, as much as possible, what kind of computations BERT performs and what they mean
💬 Prompt and Chat Log
*Blue text indicates input prompts to the AI
Consultation on how to proceed
Please support me with Kaggle again! I want to continue participating in the Disaster Tweets NLP competition.
Last time, I traced the behavior of Self-Attention and observed how parallel computation works.
This time, if possible, I’d like to actually run BERT and visualize the Transformer’s behavior so that I can understand how it works. But I’m not sure if the granularity jumps too much and becomes difficult? If that’s too hard, I’m also interested in “building a mini Transformer Encoder (1 layer) from scratch.”
First, please give me advice on how to proceed!
Copilot’s Answer
Both “visualizing the inside of Transformer while running BERT” and
“building a mini Transformer Encoder from scratch” are fully possible.
However, if you get the order or granularity wrong, the difficulty spikes quickly.
So it’s important to proceed strategically.
Given your previous steps
(Self-Attention manual calculation → understanding parallel computation → Kaggle experiments),
this is the route that will deepen your understanding most efficiently.
🎯 Conclusion: Recommended approach (best balance of difficulty and learning efficiency)
Step 1: Visualize BERT’s Attention “as is”
This is the fastest way to grasp how the Transformer behaves.
HuggingFace’s BertModel returns attention weights directly.
In other words, you can see the inside of the Transformer without building anything yourself.
And if you feed Disaster Tweets directly into it,
→ you can immediately see “which parts BERT looks at to judge disaster tweets.”
Benefits:
- Very little code
- Works immediately in Kaggle Notebook
- Gives an intuitive grasp of “what the Transformer is doing”
- Directly useful for misclassification analysis
What you will understand here:
- How Multi-Head Attention divides roles
- What the CLS token attends to
- How contextual relationships are captured
Step 2: Compare the attention values with your previous manual calculations
Extract BERT’s attention weights and compare them with your previous:
Q/K/V → score → softmax → weighted sum
Doing this will give you the strong intuition:
“Oh, what’s happening inside BERT is exactly the same as the manual calculations I did.”
Step 3: Build a mini Transformer Encoder (1 layer) from scratch
This is the best timing to move on to “building it yourself.”
Reasons:
- If you start building from scratch without seeing the correct behavior first, you’ll get lost easily
- After seeing real BERT’s behavior, it becomes much easier to understand
“what differs” and “why it differs” when you build your own
Scope of what to build (just the right amount):
- Multi-Head Self-Attention (1 layer)
- Feed Forward Network (2-layer MLP)
- LayerNorm
- Residual connections
Things you don’t need to build:
- Positional Encoding (fixed values are fine at first)
- Tokenizer (just use BERT’s)
- Weight training (random weights are fine initially)
Implementation of BERT Attention Visualization Code
Glad to hear that!
I’ll also send the training data, so keeping the competition’s disaster tweet classification in mind, please output code that visualizes BERT’s Attention “as is.”
Uploaded the first part of the training data 📝 train_100.csv
Copilot’s Answer
First, assuming this will run directly in a Kaggle Notebook:
- Load train_100.csv
- Use bert-base-uncased to output Attention
I’ll provide code that visualizes token-to-token Attention for a single tweet as a heatmap.
Examples:
Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Forest fire near La Ronge Sask. Canada
For tweets like these with target=1, we’ll visualize where BERT is focusing.
Kaggle-ready: BERT Attention Visualization Code (copy & paste as-is)
# =========================
# 0. Install libraries (first cell in Kaggle Notebook)
# =========================
!pip install -q transformers sentencepiece
~ Omitted below — full code is linked further down ~
Implementation of Submission Code
Thanks! It does look like “earthquake” is slightly more highlighted.
For now, I’d like to move toward submission code. Is my understanding correct that the current state is: we have computed Attention for each token in the sentence? If so, to classify whether a tweet is a disaster or not, do we need to create some features from BERT’s Attention and then build something like a linear regression classifier?
If possible, assuming the BERT model is already loaded, please output the remaining code needed for submission: STEP1 feature creation STEP2 classifier construction STEP3 load test data & convert to features STEP4 inference STEP5 create submission file If I’m misunderstanding something major and this approach is difficult, please feel free to point it out.
Copilot’s Answer
With just a slight course correction, you can get much closer to “submission-ready code.”
First, let’s confirm the premise (this is the most important part):
Your understanding is that the current state is “we have computed Attention for each token in the sentence,” right?
This is mostly correct, but strictly speaking:
What you are currently looking at is:
“Token-to-token Attention weights for each layer and each head.”
However, for the Kaggle task of classifying tweets as ‘disaster or not disaster,’
what is normally used is not Attention, but the “hidden states (embeddings).”
In BERT classification tasks, the typical approach is:
- Extract the final-layer hidden vector (768 dimensions) of the [CLS] token
- Use that as the feature vector
- Feed it into a linear layer (like logistic regression) for classification
So,
“Create features from Attention and classify with linear regression”
is conceptually possible,
but using “CLS embeddings instead of Attention” is far more natural,
simpler to implement, and tends to give better accuracy.
Therefore, here’s what we’ll do this time:
Assuming your BERT model is already loaded:
STEP1: Extract the [CLS] embedding from BERT and use it as a feature
STEP2: Train a classifier (logistic regression) using those features
STEP3: Load test.csv and convert it into the same features
STEP4: Run inference with the classifier
STEP5: Create the submission file (submission.csv)
We’ll write everything in one go.
*You can play with Attention-based features later,
but first let’s build something you can actually submit.*
Premise: BERT is already loaded
python
from transformers import BertTokenizer, BertModel
import torch
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
~ Omitted below — full code is linked further down ~
Question about CLS
What is CLS?
Copilot’s Answer
What is CLS (the [CLS] token)?
Understanding this makes it instantly clear
“how BERT converts an entire sentence into a single vector for classification.”
🎯 Conclusion: CLS is a special token used to create a “representative vector of the entire sentence.”
In BERT, a special token [CLS] is always added at the beginning of the input sentence.
[CLS] Our Deeds are the Reason ... [SEP]
The final-layer hidden state of this [CLS] (768 dimensions)
is used as the vector summarizing the meaning of the entire sentence.
In classification tasks,
only this [CLS] vector is extracted and passed to a linear layer.
🔍 Why does CLS contain “the meaning of the entire sentence”?
Transformer Self-Attention allows
all tokens to attend to each other.
As training progresses:
- [CLS] attends to important words in the sentence
- It incorporates that information into its own vector
- This process is repeated across 12 layers
- Eventually, it becomes a “summary vector of the entire sentence”
Therefore, in BERT classification:
sentence → BERT → [CLS] vector → linear layer → 0/1
🧠 Intuitive image:
- [CLS] is the “representative of the sentence”
- It absorbs information from other words
- Eventually it becomes capable of deciding “Is this sentence disaster-related?”
In your Attention visualization,
the fact that CLS strongly attended to “earthquake” or “fire”
is exactly evidence of this mechanism.
📌 Important: Why we use CLS as features instead of Attention
Attention only shows “where the model is looking,”
but it is not a vector that directly represents meaning.
On the other hand, the CLS vector:
- Is a dense 768-dimensional semantic representation
- Compresses the entire sentence’s information
- Is optimized for classification tasks
Therefore, in Kaggle submission code,
using the CLS vector as features is the correct approach.
✨ Summary (super important)
- CLS = Classification token
- Placed at the beginning of the sentence
- The final-layer CLS vector contains the meaning of the entire sentence
- BERT classification uses only this CLS vector
Question about Query, Key, Value this time
This time, my understanding is that the w and Q/K/V between each word are read from the pretrained BERT model, and based on the words and their order in the train/test sentences, those Q/K/V values are used to compute matrices to obtain CLS and Attention. Is this understanding correct?
Copilot’s Answer
Your understanding is almost correct.
But if we clarify one more level of detail about
“which parts are stored in BERT beforehand, and which parts are computed depending on the input,”
your understanding becomes perfect.
Since you’ve already done manual Self-Attention calculations,
getting this right will make BERT’s behavior fully intuitive.
🎯 Conclusion (correct understanding)
In one sentence:
BERT stores pretrained weights (W_Q, W_K, W_V, W_O, etc.),
applies them to the input embeddings to generate Q/K/V,
and computes Attention and the CLS vector from them.
This is completely correct.
But more precisely, the process is:
🧠 Exact flow (what BERT actually does)
① BERT internally stores “pretrained weights”
- Token embeddings (WordPiece embedding)
- Positional embeddings
- Self-Attention weights:
- W_Q (Query)
- W_K (Key)
- W_V (Value)
- W_O (Output projection)
- FFN (Feed Forward Network) weights
- LayerNorm weights
These are fixed parameters learned during pretraining.
② Convert input sentences (train/test) into embeddings
Example:
Our Deeds are the Reason of this earthquake
→ Split into WordPiece tokens
→ Convert each token into a 768-dimensional embedding
→ Add positional embeddings
③ Multiply the embeddings by pretrained W_Q, W_K, W_V
④ Compute Attention scores via Q·K
⑤ Softmax → Attention weights
⑥ Attention weights × V → contextual vectors
⑦ Repeat this for 12 layers → final CLS vector is obtained
CLS is trained from the beginning as the “representative of the sentence,”
so the final-layer CLS vector becomes a 768-dimensional compressed meaning representation.
✔ So your understanding can be summarized like this:
Your understanding (correct):
- BERT’s W_Q/W_K/W_V are pretrained
- When you input train/test sentences
- Q/K/V are computed according to the sentence
- Attention is computed
- CLS is computed through Attention
→ Completely correct
❗ One important clarification
You wrote:
“Based on the words and their order in train/test sentences,
Q/K/V values are used to compute CLS and Attention.”
The phrase “use Q/K/V values” is slightly misleading.
More precisely:
- Q/K/V are computed *fresh for each sentence* (input-dependent)
- But W_Q/W_K/W_V are fixed (pretrained)
So:
Q/K/V → dynamic (changes per sentence)
W_Q/W_K/W_V → static (fixed parameters)
If you understand this, your understanding is perfect.
Question about the embedding matrix X
“How is X — the ‘input sentence embedding matrix’ — computed in this case?”
Copilot’s Answer
To give the conclusion first:
🎯 X (the embedding matrix of the input sentence) is:
① Token Embedding + ② Positional Embedding + ③ Segment Embedding
added together.
And BERT computes this automatically.
🧩 So how is X computed in the code you used?
In your code, the moment you call:
model(input_ids, attention_mask)
BERT internally performs the following steps.
🧠 STEP 1: Convert tokens to IDs (tokenizer)
Example:
[CLS] our deeds are the reason ... [SEP]
→ Convert to IDs
→ input_ids (a sequence of integers) is created
🧠 STEP 2: Convert IDs into embedding vectors (Embedding Layer)
BERT internally has an embedding matrix of size:
vocab_size × 768 dimensions.
token_embedding = embedding_matrix[input_id]
This is the token embedding (768 dimensions).
🧠 STEP 3: Add positional embeddings
Since Transformer is not an RNN,
it adds a vector that represents the position (order) of each token.
X = token_embedding + position_embedding
🧠 STEP 4: Add segment embeddings
BERT is originally designed for “sentence A + sentence B” input,
so it adds a 0/1 embedding depending on the sentence.
For single-sentence tasks like this one, all segments are 0.
X = token_embedding + position_embedding + segment_embedding
✔ So the X in this case is computed as follows:
Although your code does not explicitly write this,
BERT automatically computes:
X = E_token + E_position + E_segment
This is the embedding matrix (X) of the input sentence.
🔍 After that, Q/K/V are computed:
Q = X W_Q, K = X W_K, V = X W_V
The W_Q / W_K / W_V used here are pretrained fixed parameters.
✨ Summary
X is the “embedding matrix of the input sentence.”
Its components are:
- Token embedding
- Positional embedding
- Segment embedding
BERT computes this internally.
Then BERT multiplies X by pretrained W_Q/W_K/W_V to create Q/K/V.
From there, Attention → CLS is computed.
Question about differences from GloVe and FastText
How do you convert IDs into embedding vectors? Is this like GloVe or FastText?
Copilot’s Answer
Understanding this correctly will instantly clarify the difference between
“BERT embeddings” and “Word2Vec / GloVe / FastText.”
To give the conclusion:
🎯 **The conversion from ID → embedding vector is NOT GloVe or FastText.
It uses BERT’s own internal pretrained embedding matrix.**
Meaning:
GloVe / FastText
→ External “word vector dictionaries” that you load manually
BERT embeddings
→ A “pretrained embedding layer” stored inside BERT itself
These two look similar but are fundamentally different in mechanism and training.
🧠 So how does BERT convert ID → vector?
✔ ① The tokenizer converts words into IDs
Example:
[CLS] our deeds are the reason ... [SEP]
→ Split into WordPiece tokens
→ Convert to IDs (a sequence of integers)
input_ids = [101, 2256, 6709, 2024, ... , 102]
✔ ② BERT has an internal “embedding matrix”
BERT contains a huge matrix:
vocab_size × hidden_size = 30522 × 768
This is the Embedding Matrix.
Meaning:
row index = token ID
row contents = 768-dimensional vector
✔ ③ Simply “lookup the row” to get the embedding
embedding_vector = embedding_matrix[input_id]
This is the token embedding.
🔥 Key point:
BERT embeddings are completely different from GloVe or FastText.
Item GloVe / FastText BERT
---------------------------------------------------------------
Embedding type Static (one vector per word) Contextual (changes by sentence)
Training method Co-occurrence / n-gram Pretrained via Transformer
Where stored? External file Inside BERT
Usage word → vector dictionary ID → Embedding Layer
Most importantly:
BERT embeddings are *contextual*,
while GloVe/FastText are *context-independent*.
This is why BERT is far more powerful.
✨ Summary
- ID → vector conversion is done by BERT’s internal Embedding Layer
- GloVe / FastText are NOT used
- BERT embeddings are contextual and strong
X is:
- token embedding
- position embedding
- segment embedding
added together
Then BERT multiplies X by W_Q/W_K/W_V to create Q/K/V.
Implementation of visualization code for computing embedding matrix X
Please output code that visualizes Token / Position / Segment embeddings individually so I can understand how each works. Assume BERT and the text are already loaded, and provide only the continuation that can be added to the previous code. I would appreciate it if the computation steps are displayed in detail (especially the Position embedding).
Copilot’s Answer
# =========================
# Extract and visualize Token / Position / Segment embeddings individually
# (Assumes BERT, tokenizer, device, model, train_df, etc. already exist)
# =========================
import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
~ Omitted below — full code is linked further down ~
Question about how CLS is computed
If I understand how CLS is computed, I feel like I’ll grasp the whole picture. How exactly is CLS calculated?
Copilot’s Answer
This is the “final missing piece” that completes your understanding of Transformers.
Once you understand how CLS is computed,
you will clearly see what is happening inside BERT.
To give the conclusion first:
🎯 **CLS is not a “special-case token.”
It goes through exactly the same computations as all other tokens,
but it is trained to “absorb information from the entire sentence” through Self-Attention.**
Meaning:
- CLS starts as just a learnable vector (randomly initialized)
- During training, it is optimized to serve as a summary of the entire sentence
- It undergoes the same computations as all other tokens
- The final-layer CLS hidden state becomes the “sentence meaning vector”
Let’s explain step by step.
🧩 ① CLS first enters X as an “embedding vector”
The input sentence becomes:
[CLS] our deeds are ... [SEP]
BERT internally computes:
X = E_token + E_position + E_segment
Here, CLS is treated like any other token:
- token embedding (learned)
- position embedding (position 0)
- segment embedding (sentence A)
These are added together to form X[0].
So CLS starts as just a vector.
🧩 ② CLS goes through Self-Attention exactly like other tokens
In each layer:
Q = X W_Q
K = X W_K
V = X W_V
CLS also has its own Q/K/V.
Then Attention is computed:
context_CLS = Σ α_CLS,j V_j
Here is the key point:
✔ CLS is trained to “gather information from the entire sentence.”
As training progresses:
- CLS learns to attend to important words (fire, earthquake, flood…)
- It incorporates that information into its own vector
After stacking 12 layers,
CLS becomes a compressed representation of the entire sentence.
🧩 ③ CLS passes through FFN and LayerNorm
After Self-Attention:
- Residual connection
- LayerNorm
- FFN (2-layer MLP)
- Residual connection
- LayerNorm
CLS undergoes the same processing as all other tokens.
🧩 ④ After 12 layers, CLS becomes the “sentence meaning vector”
The final-layer CLS (768 dimensions) contains:
- contextual information
- important word information
- sentence structure
- features indicating “disaster-likeness”
Therefore, in classification:
logit = W_cls · CLS_final + b
A single vector is enough to make the decision.
🎯 Final Summary
- CLS starts as just an embedding vector
- Through Self-Attention, it absorbs information from other tokens
- After 12 layers, it becomes the sentence meaning vector
- Only the final-layer CLS is used for classification
- CLS is not treated specially in computation
- It is trained to become the “representative of the entire sentence”
About BERT’s training tasks (its “pretext tasks”)
Does the 12-layer neural network have some kind of pretext task? For example, predicting surrounding words?
Copilot’s Answer
BERT’s 12-layer Transformer is trained using “pretext tasks (pre-training tasks).”
In other words:
🎯 **The 12 layers of BERT did NOT magically learn to understand language on their own.
They were given *two specific tasks* during pretraining,
and the internal structure (Attention, CLS behavior, etc.) emerged as a result of learning to solve them.**
These two tasks are:
🧩 ① Masked Language Modeling (MLM) — Predicting a hidden word from surrounding context
This is exactly what you described: “predicting surrounding words.”
Example:
I love eating [MASK] in the morning.
BERT uses:
“eating”
“in the morning”
“I love”
and other context to guess that [MASK] = “eggs.”
To solve this task:
- Self-Attention learns to “look at context”
- All tokens (not just CLS) learn to understand surrounding words
- Q/K/V weights evolve to capture “contextual meaning”
Thus, MLM is what gives BERT its *context understanding ability*.
🧩 ② Next Sentence Prediction (NSP) — Predict whether sentence B follows sentence A
Example:
Sentence A: The sky is full of smoke.
Sentence B: A wildfire is spreading nearby.
The task is to judge whether B naturally follows A.
To solve this:
- CLS learns to summarize the meaning of the entire sentence pair
- Segment Embeddings (A/B distinction) become necessary
- Attention learns to examine relationships between sentence A and B
Thus, NSP is what gives CLS its role as a “sentence summary vector.”
🔥 So the answer to your question is:
“Does the 12-layer NN have some kind of pretext task?
Like predicting surrounding words?”
👉 YES. Exactly.
BERT’s 12 layers were trained to solve:
- MLM (predict masked words from context)
- NSP (judge whether sentence B follows sentence A)
While solving these tasks:
- Self-Attention evolved to understand context
- CLS evolved to summarize entire sentences
- Q/K/V weights were optimized to capture “meaning”
Relationship between single-sentence tasks and Next Sentence Prediction
In this tweet competition, each sample is basically a single sentence, so there is no “next sentence.” But NSP is a task used during BERT’s pretraining to learn parameters like Q/K/V — is that understanding correct?
Copilot’s Answer
Your understanding is almost correct.
But with just a tiny refinement, the mechanism of
“how NSP contributes to learning Q/K/V during BERT pretraining”
becomes completely clear.
Let’s break down what you said:
🎯 Conclusion
In this tweet competition, NSP is not used because the task is single-sentence.
However, NSP *is indeed one of the tasks used during BERT pretraining
to train Q/K/V and all other parameters.*
Meaning:
MLM (word prediction)
NSP (sentence relationship prediction)
These two tasks together train the entire 12-layer Transformer
(including Q/K/V and all other weights).
Your understanding is on the right track.
🧠 So what exactly does NSP do?
You asked:
“NSP is a task used during BERT pretraining to extract parameters like Q/K/V from massive text?”
👉 YES. Exactly.
But more precisely:
🧩 NSP trains CLS to become a “sentence summary vector”
The NSP task is:
Does sentence B follow sentence A? (IsNext)
Or are they unrelated? (NotNext)
To solve this:
- CLS must understand the relationship between A and B
- Attention must examine cross-sentence relationships
- Q/K/V weights must encode “sentence-level meaning”
Thus, NSP:
- Creates the role of CLS
- Shapes the structure of Attention
- Trains Q/K/V weights
🧩 Updating your understanding precisely:
✔ MLM (word prediction)
→ Q/K/V learn “relationships between words”
→ Builds contextual understanding
✔ NSP (sentence relationship prediction)
→ CLS learns to be a “sentence summary vector”
→ Q/K/V learn “relationships between sentences”
🔥 So even though NSP is not used in the tweet competition…
The Q/K/V weights learned through NSP are *still used as-is*.
Meaning:
- The Attention you are visualizing
- The CLS vector you are using as features
All of these abilities come from BERT’s pretraining on MLM + NSP.
This is the end of this conversation.
📊 Results (Code, Score, Ranking)
Here is the code created this time:
Score: 0.79282 (previous: 0.80937)
Rank 🏆: unchanged (previous: 314/693 = 45.3%)

📚 Insights
1. I gained a vague understanding of BERT’s internal computations and implemented submission code.
2. I learned the roles and positions of Attention and Embedding inside BERT.
3. I feel I got a bit closer to understanding LLMs, but multi-layer structures are quite difficult, so I want to continue patiently 😉
🔮 Next
I want to try improving the performance of the BERT submission code.



