Hw 1:
Problem (unicode1): Understanding Unicode (1 point)
- (a) null character
- (b) string representation: 清晰知道对象是什么. 例如 chr(0) 回车就是 ‘\x00’
- printed 就是
__repr__(),chr(0) 是什么都不显示.
- printed 就是
- (c) chr(0) 在 text 中什么都不占. 只有交互式输出 \x00.
Problem (unicode2): Unicode Encodings (3 points)
- (a): UTF-32 和 UTF-16 比 UTF-8 更长(100:50:34),且有很多的 0,可能不好学.
- (b): 这函数试图把每个 byte 解码为字符了.
1>>> decode_utf8_bytes_to_str_wrong("牛".encode("utf-8"))2UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 0: unexpected end of data- (c): unicode 并不是稠密的. [231, 137] 不对应任何 unicode 字符.
1bytes([231, 137]).decode()2# failedProblem (train_bpe): BPE Tokenizer Training (15 points)
最恶心修改:3e90c96
-
调整 PAT 使得多个 \n 能够分到一组
-
调整 special token pattern 使得他们能够吃掉尾随的换行(测试样例里面是这样的我也不知道为什么)
-
Finished in Rust. test ok.
Problem (train_bpe_tinystories): BPE Training on TinyStories (2 points)
/usr/bin/time -v cargo test --release test_re4 -- --nocaptureatc254184
1 Percent of CPU this job got: 1991%2 Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.703 Maximum resident set size (kbytes): 2247060 也就是 2.2Guv run maturin develop --release --features pyo3-extension && /usr/bin/time -v uv run train_bpe_tinystories.py
1 Percent of CPU this job got: 1505%2 Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.23 时间限制是 30min,快了 180 倍3 Maximum resident set size (kbytes): 2649476python 版本慢一点,可能是在序列化把。
- 最长
b' responsibility'make sense. - b) Profile: pretokenize 花了六七秒,而 popping 只花 0.5s.
Problem (train_bpe_expts_owt): BPE Training on OpenWebText (2 points)
-
由于 buffer 可能被截断无法转 utf8,这里我把它改成 u8 匹配了。
-
有的 pretoken 居然有 12 万长度,i16 居然不够用,直接改 i64.
-
longest: 19 b’ telecommunications’
-
command time -v uv run train_bpe_expts_owt.py
1 Percent of CPU this job got: 283%2 Elapsed (wall clock) time (h:mm:ss or m:ss): 5:17.273 Maximum resident set size (kbytes): 14715528 (14G)4
5时间限制是 100h,快了 1200 倍6yyzhang2025: ~30minProblem (tokenizer): Implementing the tokenizer (15 points)
- 让 codex 写完了, test ok
Problem (tokenizer_experiments): Experiments with tokenizers (4 points)
- a): What is each tokenizer’s compression ratio (bytes/token)? Tiny: 4.09 Owt: 4.53
- b): What happens if you tokenize your OpenWebText sample with the TinyStories tokenizer?: 压缩率仅剩 3.34
- c): Tiny 的流量: 3506110 bytes/second
- Owt 的流量:3123067 bytes/second
- 预计时间:235303s (65小时)
- d): 词表一共 32000 所以 u16 < 65536 够了
Problem (linear): Implementing the linear module (1 point)
- ok
Problem (embedding): Implement the embedding module (1 point)
- ok
Problem (rmsnorm): Root Mean Square Layer Normalization (1 point)
- ok
Problem (rope): Implement RoPE (2 points)
- ok
Problem (softmax): Implement softmax (1 point)
- ok
Problem (scaled_dot_product_attention): Implement scaled dot-product attention (5 points)
- ok
关于数学记号 和 Linear 参数的关系: Linear in_features 是 b,out_features 是 a
- 或者说 ,原文就有
Problem (multihead_self_attention): Implement causal multi-head self-attention (5 points)
- ok (2 tests)
- 中间卡在哪?
- mask 作为下标不会自动广播,用
.expand_as() - 要先分 heads 再 rope。怎么发现的?发现 test 中只有第一行 match,说明大概率是 positional emb 炸了,然而 rope 测试又通过了,肯定是应用 rope 方法不对.
- mask 作为下标不会自动广播,用
Problem (transformer_block): Implement the Transformer block (3 points)
- 卡在变量写错
Problem (transformer_lm): Implementing the Transformer LM (3 points)
- ok
- 卡在多加了 softmax,顺便重构 token_positions 接口
Problem (transformer_accounting): Transformer LM resource accounting (5 points)
- skip
1 vocab_size=50257,2 context_length=512,3 d_model=1600,4 num_layers=48,5 num_heads=25,6 d_ff=6400,7TransformerLM (2,127,057,600 params)8├── Embedding (80,411,200 params)9├── ModuleList (1,966,233,600 params)10│ ├── TransformerBlock (40,963,200 params)11│ │ ├── MultiheadSelfAttention (10,240,000 params)12│ │ │ ├── Linear (2,560,000 params)13│ │ │ ├── Linear (2,560,000 params)14│ │ │ ├── Linear (2,560,000 params)15│ │ └── Linear (2,560,000 params)45 collapsed lines
16│ │ ├── RMSNorm (1,600 params)17│ │ ├── SwiGLU (30,720,000 params)18│ │ │ ├── Linear (10,240,000 params)19│ │ │ ├── Linear (10,240,000 params)20│ │ └── Linear (10,240,000 params)21│ │ ├── RMSNorm (1,600 params)22│ └── RotaryPositionalEmbedding23│ ├── TransformerBlock (40,963,200 params)24
25
26context length 改为 512:27TransformerLM (2,127,057,600 params)28├── Embedding (80,411,200 params)29├── ModuleList (1,966,233,600 params)30│ ├── TransformerBlock (40,963,200 params)31│ │ ├── MultiheadSelfAttention (10,240,000 params)32│ │ │ ├── Linear (2,560,000 params)33│ │ │ ├── Linear (2,560,000 params)34│ │ │ ├── Linear (2,560,000 params)35│ │ └── Linear (2,560,000 params)36│ │ ├── RMSNorm (1,600 params)37│ │ ├── SwiGLU (30,720,000 params)38│ │ │ ├── Linear (10,240,000 params)39│ │ │ ├── Linear (10,240,000 params)40│ │ └── Linear (10,240,000 params)41│ │ ├── RMSNorm (1,600 params)42│ └── RotaryPositionalEmbedding43│ ├── TransformerBlock (40,963,200 params)44
45参数量并不会变化46唯一计算量 n 方增长的函数:计算 self attention47
48def scaled_dot_product_attention(49 q: Float[Tensor, "b ... seq_len d_k"],50 k: Float[Tensor, "b ... seq_len d_k"],51 v: Float[Tensor, "b ... seq_len d_v"],52 mask: Bool[Tensor, "b ... seq_len seq_len"] | None,53) -> Float[Tensor, "b ... seq_len d_v"]:54 d_k = q.shape[-1]55 attn = q @ k.mT / d_k ** 0.5 # [b, ..., q_len, k_len]56 if mask is not None:57 mask = mask.expand_as(attn)58 attn[~mask] -= float('inf')59 # v: k_len, d_v60 return softmax(attn, dim=-1) @ vProblem (cross_entropy): Implement Cross entropy
- use LogSumExp
- 卡时间在:维度没搞对(用 gather,以及 sum 默认求所有维度平均),以及自己推公式中间漏 log
Problem (learning_rate_tuning): Tuning the learning rate (1 point)
- 在 toy SGD 总,lr=1, 1e1, 1e2 依次收敛加快,而 1e3 发散.
see
/toy/sgd.py
Problem (adamw): Implement AdamW (2 points)
- test ok
Problem (adamwAccounting): Resource accounting for training with AdamW
- skip
Problem (learning_rate_schedule): Implement cosine learning rate schedule with warmup
- test ok
Problem (gradient_clipping): Implement gradient clipping (1 point)
- 对整个网络求
L2_norm(所有参数),如果超过lim则统一缩放使得L2_norm(所有参数)为lim. - test ok
Problem (data_loading): Implement data loading (2 points)
- test ok
Problem (checkpointing): Implement model checkpointing (1 point)
- test ok
Problem (training_together): Put it together (4 points)
- ok (no test)
Problem (decoding): Decoding (3 points)
- ok (no test)
Problem (experiment_log): Experiment logging (3 points)
- ok (no test)
- 差不多得了