Takeaways

RMSNorm 的平方和会溢出 fp16，因此要升到 fp32 再降回去.
attn score 不除以 sqrt(d_k) 就会注意力分布过尖.
[fable5] adamw 不开 warmup 可能导致二阶动量估计不准，导致模型发散.
预训练不要开 dropout
初始 loss 应约等于 loss ≈ -ln(1 / vocabsize) = ln(vocab_size)

Hw 1:

Problem (unicode1): Understanding Unicode (1 point)

(a) null character
(b) string representation: 清晰知道对象是什么. 例如 chr(0) 回车就是 ‘\x00’
- printed 就是 __repr__()，chr(0) 是什么都不显示.
(c) chr(0) 在 text 中什么都不占. 只有交互式输出 \x00.

Problem (unicode2): Unicode Encodings (3 points)

(a): UTF-32 和 UTF-16 比 UTF-8 更长(100:50:34），且有很多的 0，可能不好学.
(b): 这函数试图把每个 byte 解码为字符了.

1
>>> decode_utf8_bytes_to_str_wrong("牛".encode("utf-8"))
2
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 0: unexpected end of data

(c): unicode 并不是稠密的. [231, 137] 不对应任何 unicode 字符.

1
bytes([231, 137]).decode()
2
# failed

Problem (train_bpe): BPE Tokenizer Training (15 points)

最恶心修改：3e90c96

调整 PAT 使得多个 \n 能够分到一组
调整 special token pattern 使得他们能够吃掉尾随的换行（测试样例里面是这样的我也不知道为什么）
Finished in Rust. test ok.

Problem (train_bpe_tinystories): BPE Training on TinyStories (2 points)

/usr/bin/time -v cargo test --release test_re4 -- --nocapture at c254184

1
  Percent of CPU this job got: 1991%
2
  Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.70
3
  Maximum resident set size (kbytes): 2247060 也就是 2.2G

uv run maturin develop --release --features pyo3-extension && /usr/bin/time -v uv run train_bpe_tinystories.py

1
  Percent of CPU this job got: 1505%
2
  Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.23  时间限制是 30min，快了 180 倍
3
  Maximum resident set size (kbytes): 2649476

python 版本慢一点，可能是在序列化把。

最长 b' responsibility' make sense.
b) Profile: pretokenize 花了六七秒，而 popping 只花 0.5s.

Problem (train_bpe_expts_owt): BPE Training on OpenWebText (2 points)

由于 buffer 可能被截断无法转 utf8，这里我把它改成 u8 匹配了。
有的 pretoken 居然有 12 万长度，i16 居然不够用，直接改 i64.
longest: 19 b’ telecommunications’
command time -v uv run train_bpe_expts_owt.py

1
    Percent of CPU this job got: 283%
2
    Elapsed (wall clock) time (h:mm:ss or m:ss): 5:17.27
3
    Maximum resident set size (kbytes): 14715528 (14G)
4

5
时间限制是 100h，快了 1200 倍
6
yyzhang2025: ~30min

Problem (tokenizer): Implementing the tokenizer (15 points)

让 codex 写完了, test ok

Problem (tokenizer_experiments): Experiments with tokenizers (4 points)

a): What is each tokenizer’s compression ratio (bytes/token)? Tiny: 4.09 Owt: 4.53
b): What happens if you tokenize your OpenWebText sample with the TinyStories tokenizer?: 压缩率仅剩 3.34
c): Tiny 的流量： 3506110 bytes/second
- Owt 的流量：3123067 bytes/second
- 预计时间：235303s (65小时)
d): 词表一共 32000 所以 u16 < 65536 够了

Problem (linear): Implementing the linear module (1 point)

Problem (embedding): Implement the embedding module (1 point)

Problem (rmsnorm): Root Mean Square Layer Normalization (1 point)

Problem (rope): Implement RoPE (2 points)

Problem (softmax): Implement softmax (1 point)

Problem (scaled_dot_product_attention): Implement scaled dot-product attention (5 points)

关于数学记号和 Linear 参数的关系: Linear in_features 是 b，out_features 是 a

或者说，原文就有

Problem (multihead_self_attention): Implement causal multi-head self-attention (5 points)

ok (2 tests)
中间卡在哪？
1. mask 作为下标不会自动广播，用.expand_as()
2. 要先分 heads 再 rope。怎么发现的？发现 test 中只有第一行 match，说明大概率是 positional emb 炸了，然而 rope 测试又通过了，肯定是应用 rope 方法不对.

Problem (transformer_block): Implement the Transformer block (3 points)

卡在变量写错

Problem (transformer_lm): Implementing the Transformer LM (3 points)

ok
卡在多加了 softmax，顺便重构 token_positions 接口

Problem (transformer_accounting): Transformer LM resource accounting (5 points)

skip

1
    vocab_size=50257,
2
    context_length=512,
3
    d_model=1600,
4
    num_layers=48,
5
    num_heads=25,
6
    d_ff=6400,
7
TransformerLM (2,127,057,600 params)
8
├── Embedding (80,411,200 params)
9
├── ModuleList (1,966,233,600 params)
10
│   ├── TransformerBlock (40,963,200 params)
11
│   │   ├── MultiheadSelfAttention (10,240,000 params)
12
│   │   │   ├── Linear (2,560,000 params)
13
│   │   │   ├── Linear (2,560,000 params)
14
│   │   │   ├── Linear (2,560,000 params)
15
│   │       └── Linear (2,560,000 params)
45 collapsed lines
16
│   │   ├── RMSNorm (1,600 params)
17
│   │   ├── SwiGLU (30,720,000 params)
18
│   │   │   ├── Linear (10,240,000 params)
19
│   │   │   ├── Linear (10,240,000 params)
20
│   │       └── Linear (10,240,000 params)
21
│   │   ├── RMSNorm (1,600 params)
22
│       └── RotaryPositionalEmbedding
23
│   ├── TransformerBlock (40,963,200 params)
24

25

26
context length 改为 512:
27
TransformerLM (2,127,057,600 params)
28
├── Embedding (80,411,200 params)
29
├── ModuleList (1,966,233,600 params)
30
│   ├── TransformerBlock (40,963,200 params)
31
│   │   ├── MultiheadSelfAttention (10,240,000 params)
32
│   │   │   ├── Linear (2,560,000 params)
33
│   │   │   ├── Linear (2,560,000 params)
34
│   │   │   ├── Linear (2,560,000 params)
35
│   │       └── Linear (2,560,000 params)
36
│   │   ├── RMSNorm (1,600 params)
37
│   │   ├── SwiGLU (30,720,000 params)
38
│   │   │   ├── Linear (10,240,000 params)
39
│   │   │   ├── Linear (10,240,000 params)
40
│   │       └── Linear (10,240,000 params)
41
│   │   ├── RMSNorm (1,600 params)
42
│       └── RotaryPositionalEmbedding
43
│   ├── TransformerBlock (40,963,200 params)
44

45
参数量并不会变化
46
唯一计算量 n 方增长的函数：计算 self attention
47

48
def scaled_dot_product_attention(
49
    q: Float[Tensor, "b ... seq_len d_k"],
50
    k: Float[Tensor, "b ... seq_len d_k"],
51
    v: Float[Tensor, "b ... seq_len d_v"],
52
    mask: Bool[Tensor, "b ... seq_len seq_len"] | None,
53
) -> Float[Tensor, "b ... seq_len d_v"]:
54
    d_k = q.shape[-1]
55
    attn = q @ k.mT / d_k ** 0.5 # [b, ..., q_len, k_len]
56
    if mask is not None:
57
        mask = mask.expand_as(attn)
58
        attn[~mask] -= float('inf')
59
    # v: k_len, d_v
60
    return softmax(attn, dim=-1) @ v

Problem (cross_entropy): Implement Cross entropy

use LogSumExp
卡时间在：维度没搞对（用 gather，以及 sum 默认求所有维度平均），以及自己推公式中间漏 log

Problem (learning_rate_tuning): Tuning the learning rate (1 point)

在 toy SGD 总，lr=1, 1e1, 1e2 依次收敛加快，而 1e3 发散. see /toy/sgd.py

Problem (adamw): Implement AdamW (2 points)

test ok

Problem (adamwAccounting): Resource accounting for training with AdamW

skip

Problem (learning_rate_schedule): Implement cosine learning rate schedule with warmup

test ok

Problem (gradient_clipping): Implement gradient clipping (1 point)

对整个网络求 L2_norm(所有参数)，如果超过 lim 则统一缩放使得 L2_norm(所有参数) 为 lim.
test ok

Problem (data_loading): Implement data loading (2 points)

test ok

Problem (checkpointing): Implement model checkpointing (1 point)

test ok

Problem (training_together): Put it together (4 points)

ok (no test)

Problem (decoding): Decoding (3 points)

ok (no test)

Problem (experiment_log): Experiment logging (3 points)

ok (no test)
差不多得了

336 hw1

Takeaways

Hw 1:

Problem (unicode1): Understanding Unicode (1 point)

Problem (unicode2): Unicode Encodings (3 points)

Problem (train_bpe): BPE Tokenizer Training (15 points)

Problem (train_bpe_tinystories): BPE Training on TinyStories (2 points)

Problem (train_bpe_expts_owt): BPE Training on OpenWebText (2 points)

Problem (tokenizer): Implementing the tokenizer (15 points)

Problem (tokenizer_experiments): Experiments with tokenizers (4 points)

Problem (linear): Implementing the linear module (1 point)

Problem (embedding): Implement the embedding module (1 point)

Problem (rmsnorm): Root Mean Square Layer Normalization (1 point)

Problem (rope): Implement RoPE (2 points)

Problem (softmax): Implement softmax (1 point)

Problem (scaled_dot_product_attention): Implement scaled dot-product attention (5 points)

Problem (multihead_self_attention): Implement causal multi-head self-attention (5 points)

Problem (transformer_block): Implement the Transformer block (3 points)

Problem (transformer_lm): Implementing the Transformer LM (3 points)

Problem (transformer_accounting): Transformer LM resource accounting (5 points)

Problem (cross_entropy): Implement Cross entropy

Problem (learning_rate_tuning): Tuning the learning rate (1 point)

Problem (adamw): Implement AdamW (2 points)

Problem (adamwAccounting): Resource accounting for training with AdamW

Problem (learning_rate_schedule): Implement cosine learning rate schedule with warmup

Problem (gradient_clipping): Implement gradient clipping (1 point)

Problem (data_loading): Implement data loading (2 points)

Problem (checkpointing): Implement model checkpointing (1 point)

Problem (training_together): Put it together (4 points)

Problem (decoding): Decoding (3 points)

Problem (experiment_log): Experiment logging (3 points)