Poin pengetahuan distilasi note

Catatan tentang poin pengetahuan distilasi

2024-07-12

Distilasi

Distilasi model merupakan suatu metode untuk mengoptimalkan kinerja model kecil dengan cara mentransfer pengetahuan model besar (model guru) ke model kecil (model siswa). Distilasi biasanya mencakup bentuk-bentuk berikut:

1. Distilasi Label Lembut

Model siswa dilatih melalui soft label model guru, sehingga model siswa mempelajari distribusi keluaran model guru.

import torch
import torch.nn as nn

# 定义教师模型和学生模型
teacher_model = ...
student_model = ...

# 定义损失函数
criterion = nn.KLDivLoss(reduction='batchmean')

# 教师模型生成软标签
teacher_model.eval()
with torch.no_grad():
    teacher_outputs = teacher_model(inputs)
soft_labels = torch.softmax(teacher_outputs / temperature, dim=1)

# 学生模型预测
student_outputs = student_model(inputs)
loss = criterion(torch.log_softmax(student_outputs / temperature, dim=1), soft_labels)

# 反向传播和优化
loss.backward()
optimizer.step()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

2. Fitur Distilasi

Belajar dari model guru dengan membiarkan model siswalapisan tengahrepresentasi fitur untuk mengoptimalkan kinerja model siswa.

class FeatureExtractor(nn.Module):
    def __init__(self, model):
        super(FeatureExtractor, self).__init__()
        self.features = nn.Sequential(*list(model.children())[:-1])
    
    def forward(self, x):
        return self.features(x)

teacher_feature_extractor = FeatureExtractor(teacher_model)
student_feature_extractor = FeatureExtractor(student_model)

# 获取特征表示
teacher_features = teacher_feature_extractor(inputs)
student_features = student_feature_extractor(inputs)

# 定义特征蒸馏损失
feature_distillation_loss = nn.MSELoss()(student_features, teacher_features)

# 反向传播和优化
feature_distillation_loss.backward()
optimizer.step()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

3. Distilasi Gabungan

Menggabungkan distilasi label lunak dan distilasi fitur, menggunakan distribusi keluaran model guru danRepresentasi fituruntuk melatih model siswa.

# 定义损失函数
criterion = nn.KLDivLoss(reduction='batchmean')
mse_loss = nn.MSELoss()

# 教师模型生成软标签
teacher_model.eval()
with torch.no_grad():
    teacher_outputs = teacher_model(inputs)
soft_labels = torch.softmax(teacher_outputs / temperature, dim=1)

# 学生模型预测
student_outputs = student_model(inputs)
soft_label_loss = criterion(torch.log_softmax(student_outputs / temperature, dim=1), soft_labels)

# 获取特征表示
teacher_features = teacher_feature_extractor(inputs)
student_features = student_feature_extractor(inputs)
feature_loss = mse_loss(student_features, teacher_features)

# 组合损失
total_loss = soft_label_loss + alpha * feature_loss

# 反向传播和优化
total_loss.backward()
optimizer.step()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Melalui teknologi penyulingan di atas, hal ini dapat dilakukan secara efektifModel optimasistruktur, mengurangi overhead komputasi, dan meningkatkan kecepatan inferensi model dan efisiensi penerapan sambil mempertahankan kinerja model.

Berbagi teknologi

Catatan tentang poin pengetahuan distilasi

Distilasi

1. Distilasi Label Lembut

2. Fitur Distilasi

3. Distilasi Gabungan

Profil pribadi

informasi kontak saya