[Machine Learning Practice] Datawhale Summer Camp 2: Audio and Video Attack and Defense (deepfake) Baseline Sentence Analysis

2024-07-12

# Datawhale # AI Summer Camp # Summer Camp

1. Brief introduction to the competition topic
2. Competition dataset
3. Evaluation indicators
4. Baseline overall
5 Conclusion
6. Extension
- 6.1 Task Definition of Deep Learning
- 6.2 The relationship between AIGC and Deepfake

1. Brief introduction to the competition topic

The task of the competition is to determine whether a face image is a Deepfake image and output the probability score of it being a Deepfake image. Participants need to develop and optimize detection models to cope with diverse Deepfake generation technologies and complex application scenarios, thereby improving the accuracy and robustness of Deepfake image detection.

2. Competition dataset

The training set label file train_label.txt is used to train the model, while the validation set label file val_label.txt is only used for model tuning. For example, in train_label.txt or val_label.txt, each line contains two parts separated by a comma. The first part is the file name (suffix .mp4), and the second part is the true value.
A target value of 1 indicates deep fake audio and video, and a target value of 0 indicates real face audio and video.

Here are samples of train_label.txt and val_label.txt:

train_label.txt

video_name,target
96b04c80704f02cb426076b3f624b69e.mp4,0
16fe4cf5ae8b3928c968a5d11e870360.mp4,1
1
2
3
4
5


val_label.txt

video_name,target
f859cb3510c69513d5c57c6934bc9968.mp4,0
50ae26b3f3ea85babb2f9dde840830e2.mp4,1
1
2
3
4
5
6

Each line in the file contains two parts, separated by a comma. The first part is the video file name, and the second part is the deep fake score corresponding to the model prediction (that is, the probability value of the sample belonging to a deep fake video). Please refer to the submission template below:

prediction.csv

video_name,score
658042526e6d0c199adc7bfeb1f7c888.mp4,0.123456
a20cf2d7dea580d0affc4d85c9932479.mp4,0.123456
1
2
3
4
5

Phase 2 After Phase 1, the public test set is released in Phase 2. Participants need to submit the prediction score file prediction_test.csv of the test set in the system, and the test score results are fed back online in real time.

After the second stage, the top 30 teams will advance to the third stage. In this stage, participants need to submit code docker and technical reports. Docker requires the original training code and test API (function input is the image path, and the output is the Deepfake score predicted by the model). The organizer will check and rerun the algorithm code to reproduce the training process and test results.

Only a single model is allowed to be submitted, and the valid network parameters must not exceed 200M (usingthoptool statistical model parameters).

Only ImageNet1K pre-trained models are allowed. Extended samples generated based on the released training set (via data augmentation/Deepfake tools) can be used for training, but these tools need to be submitted in the third phase for reproduction.

3. Evaluation indicators

The evaluation index mainly uses the AUC under the ROC curve as an indicator, and the value range of AUC is usually0.5-1Otherwise, we thinkThis is not a good machine learning modelThe closer the AUC is to 1, the better the model is. If the AUC shows an ambiguous ranking result, we use TPR (True Positive Rate) as an auxiliary reference. Of course, there is also a corresponding method called FPR.

F1-ScoreIt is also an indicator we can refer to: it is the ratio of precision and recallHarmonic mean。

$F1_Score = 2*(TP)/(2TP+FN+FP)$

Before learning machine learning, we should review two important concepts:AccuracyandRecall。

Accuracy： $P r e c i s i o n = T P T P + F P Precision = frac{TP}{TP+FP}$ , which is used to measure the modelPrecision performance, the proportion of samples predicted as positive among the correctly predicted samples.

Recall： $R e c a l l = T P T P + F N Recall = frac{TP}{TP+FN}$ , which is used to measure the modelCheck performance, the proportion of samples predicted to be positive that are actually positive.

True Positive Rate (TPR):
TPR = TP / (TP + FN)
False Positive Rate (FPR):
FPR = FP / (FP + TN)
in:
TP: attack samples are correctly identified as attacks;
TN: true samples are correctly identified as true;
FP: real samples are misidentified as attacks;
FN: Attack samples are misidentified as genuine.

Reference: Aghajan, H., Augusto, JC, & Delgado, RLC (Eds.). (2009). (If the link in the title does not open, please click:The whole book)

Here is my TPR calculation script:

l1 = [0,1,1,1,0,0,0,1]
l2 = [0,1,0,1,0,1,0,0]

def accuracy(y_true, y_pred):
    # 正确预测数初始化一个简单计数器
    correct_counter = 0
    # 遍历y_true, y_pred中所有元素
    # zip函数接受多个元组，返回他们组成的列表
    for yt, yp in zip(y_true, y_pred):
        if yt == yp:
            # 如果预测标签与真实标签相同，则增加计数器
            correct_counter += 1
    # 返回正确率，正确标签数/总标签数
    return correct_counter / len(y_true)
    
def false_positive(y_true, y_pred):
    # 初始化假阳性样本计数器
    fp = 0
    # 遍历y_true，y_pred中所有元素
    for yt, yp in zip(y_true, y_pred):
        # 若真实标签为负类但预测标签为正类，计数器增加
        if yt == 0 and yp == 1:
            fp += 1
    return fp

def false_negative(y_true, y_pred):
    # 初始化假阴性样本计数器
    fn = 0
    # 遍历y_true，y_pred中所有元素
    for yt, yp in zip(y_true, y_pred):
        # 若真实标签为正类但预测标签为负类，计数器增加
        if yt == 1 and yp == 0:
            fn += 1
    return fn
    
def true_positive(y_true, y_pred):
    # 初始化真阳性样本计数器
    tp = 0
    # 遍历y_true，y_pred中所有元素
    for yt, yp in zip(y_true, y_pred):
        # 若真实标签为正类且预测标签也为正类，计数器增加
        if yt == 1 and yp == 1:
            tp += 1
    return tp

def true_negative(y_true, y_pred):
    # 初始化真阴性样本计数器
    tn = 0
    # 遍历y_true，y_pred中所有元素
    for yt, yp in zip(y_true, y_pred):
        # 若真实标签为负类且预测标签也为负类，计数器增加
        if yt == 0 and yp == 0:
            tn += 1
    # 返回真阴性样本数
    return tn
    
# 您可以尝试更好的精确度计算方式
def accuracy_v2(y_true, y_pred):
  # 真阳性样本数
  tp = true_positive(y_true, y_pred)
  # 假阳性样本数
  fp = false_positive(y_true, y_pred)
  # 假阴性样本数
  fn = false_negative(y_true, y_pred)
  # 真阴性样本数
  tn = true_negative(y_true, y_pred)
  # 准确率
  accuracy_score = (tp + tn) / (tp + tn + fp + fn)
  return accuracy_score
  
# F1-score的计算方法
def f1(y_true,y_pred):
    p = precision(y_true, y_pred)
    r = recall(y_true,y_pred)
    score = 2*p*r/(p+r)
    return score
    

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78

If classification is required, you may need a threshold. It is related to the predicted value as follows:

$P re d i c t i o n = P ro babi l i t y > T h res h o l d$

After learning AUC, another important metric you should learn is log loss. For binary classification problems, we define log loss as:

$L o gL oss = - t a r g e t * l o g (p) - (1 - t a r g e t) * l o g (1 - p)$

The target value is 0 or 1, and the predicted value is the probability that the sample belongs to class 1. Log loss penalizes very certain and very wrong predictions. The smaller the log loss, the closer the probability predicted by the model is to the target value.

We may also use these indicators in classification problems:

Macro-averaged precision: Calculate the accuracy of all categories separately and then find the average
Micro-averaged precision: Calculate the precision for all categories and then calculate their weighted average.
Weighted precision: Calculate the precision for all categories and then calculate their weighted average. The weighted average is the product of the weights of each category.

4. Baseline overall

4.1 Calculating the number of samples

# “word count” 的缩写，是一个用于计数的 Unix 命令。-l 只计算行数
!wc -l /kaggle/input/ffdv-sample-dataset/ffdv_phase1_sample/train_label.txt
!wc -l /kaggle/input/ffdv-sample-dataset/ffdv_phase1_sample/val_label.txt
1
2
3

We just need to count the number of rows, which indicates the number of samples.

4.2 Creating a video object

from IPython.display import Video
Video("/kaggle/input/ffdv-sample-dataset/ffdv_phase1_sample/valset/00882a2832edbcab1d3dfc4cc62cfbb9.mp4", embed=True)
1
2

Video creates a video object, and embed, when set to True, the video player will be displayed directly in the output of the notebook cell.
After running on the kaggle baseline, you should be able to see the following results:
insert image description here

4.3 Download required libraries and additional knowledge

!pip install moviepy librosa matplotlib numpy timm
1

Documentation links for the libraries used:
moviepy
librosa(librosa is a very powerful third-party library for Python speech signal processing. In this baseline, we mainly used MEL spectrogram generation and spectrogram conversion)
matplotlib
numpy
timm(Image classification model library, quickly build various sota models)

What is SOTA? SOTA stands for State of the Arts, which refers to the best model in the field. SOTA has very high scores on some benchmark data sets.

Non-end-to-end model (pipeline): First of all, we need to understand what the end is. The two ends refer to the input end to the output end. The traditional machine learning process consists of multiple modules, which are independent of each other. The result of the latter module depends on the level of the previous result, which affects the entire training result.
End-to-end model: First of all, we must understand that the prediction is generated from the input end to the output end. The prediction result will be compared with the actual result (it must be remembered that the core task of machine learning is stillpredict), this error is back-propagated to each layer of the neural network, adjusting the weights and parameters of the model until the model converges or we get the expected results. If we look at it from the perspective of a control system, this is a closed-loop control system. (egBP neural network)
Sequence to sequence (seq2seq): This is a generalEnd-to-endThe sequence prediction method has an encoder and a decoder structure. If you use a question and answer dataset to encode/decode, you can get a question-answering robot, which is a sequence-to-sequence application.

The question comes back to the Baseline itself. What is the Baseline?Baseline usually refers to a simple and easy-to-implement benchmark model.
In the process of algorithm tuning and parameter adjustment, the task of the Baseline is to compare itself with itself to make the model better and better.

Benchmark is also an important concept. Its meaning isBenchmarksIt usually refers to a standardized method for evaluating and comparing the performance of algorithms, models, or methods, used to measure the differences and pros and cons between models.

You can see them frequently on model benchmarking websites. For exampleSinan。

4.4 Set pytorch random seed &&CUDNN configuration

When I run the baseline, a CUDA configuration error message appears. Please use another accelerator:

insert image description here

import torch
# 设置pytorch的种子
torch.manual_seed(0)
# deterministic当设置为False时，cuDNN将允许一些操作的非确定性优化
torch.backends.cudnn.deterministic = False
# benchmark设置为true允许cuDNN在每个前向传播中自动寻找最适合当前配置的卷积算法，以提高性能。
torch.backends.cudnn.benchmark = True
# 导入必要的库，我们需要用到cv2,glob,os,PIL
import torchvision.models as models
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data.dataset import Dataset
import timm
import time

import pandas as pd
import numpy as np
import cv2, glob, os
from PIL import Image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

4.5 Audio and Video Preprocessing

The parameters accepted by generate_mel_spectrogram are the video file path, the number of Mel frequency filters, the highest frequency (to control the calculated spectrum range), and the target image size.

import moviepy.editor as mp
import librosa
import numpy as np
import cv2

def generate_mel_spectrogram(video_path, n_mels=128, fmax=8000, target_size=(256, 256)):
    # 提取音频
    audio_path = 'extracted_audio.wav'
    # video_path 应该是之前定义的变量，包含了要处理的视频文件的路径。创建了一个 VideoFileClip 对象，存储在 video 变量中。
    video = mp.VideoFileClip(video_path)
    # video.audio 访问视频的音频轨道。write_audiofile() 方法将音频写入文件。verbose=False: 设置为False表示不在控制台输出处理进度。logger=None: 设置为None表示不使用日志记录器。实际上我们做这个预测没有这样的需求，也就不消耗占存。
    # 其默认参数：write_audiofile(self, filename, fps=None, nbytes=2, buffersize=2000, codec=None, bitrate=None, ffmpeg_params=None, write_logfile=False, verbose=True, logger='bar')
    video.audio.write_audiofile(audio_path, verbose=False, logger=None)

    # 加载音频文件，加载采样率
    y, sr = librosa.load(audio_path)

    # 生成MEL频谱图（梅尔频谱图，与之相对应的有mel倒频谱图）
    # 默认参数：librosa.feature.melspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512, power=2.0, **kwargs)
    # 参数解释：y:音频时间序列，sr:采样率，n_mels 是指在计算梅尔频谱图时，将频谱图划分为多少个梅尔频率滤波器（Mel filters），其决定了最终生成的梅尔频谱图的分辨率，也可以理解为梅尔频谱图的高度。
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)

    # 将频谱图转换为dB单位，S：输入功率，ref：作为参考，如果是标量，则振幅 abs（S） 相对于 ref： 10 * log10（S / ref） 进行缩放。此处np.max指的是将谱图中的最大值作为参考值，这也是一种常用的参考值取法
    S_dB = librosa.power_to_db(S, ref=np.max)

    # 归一化到0-255之间，NORM_MINMAX:数组的数值被平移或缩放到一个指定的范围，线性归一化。
    S_dB_normalized = cv2.normalize(S_dB, None, 0, 255, cv2.NORM_MINMAX)
    
    # 将浮点数转换为无符号8位整型
    S_dB_normalized = S_dB_normalized.astype(np.uint8)

    # 缩放到目标大小256，256
    img_resized = cv2.resize(S_dB_normalized, target_size, interpolation=cv2.INTER_LINEAR)

    return img_resized

# 使用示例
video_path = '/kaggle/input/ffdv-sample-dataset/ffdv_phase1_sample/trainset/001b0680999447348bc9f89efce0f183.mp4'  # 替换为您的视频文件路径
mel_spectrogram_image = generate_mel_spectrogram(video_path)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

4.6 Create a training data folder

!mkdir ffdv_phase1_sample
!mkdir ffdv_phase1_sample/trainset
!mkdir ffdv_phase1_sample/valset
1
2
3

4.7 Generate Mel-spectrogram

The amount of data is too large, so I won’t post a picture here. Here is a general Mel map:
insert image description here
Image source:Simon Fraser University
If you open it and listen to it, it is a gradually descending audio.

# 使用glob.glob函数查找/kaggle/input/ffdv-sample-dataset/ffdv_phase1_sample/trainset/目录下前400个.mp4视频文件的路径。
for video_path in glob.glob('/kaggle/input/ffdv-sample-dataset/ffdv_phase1_sample/trainset/*.mp4')[:400]:
    mel_spectrogram_image = generate_mel_spectrogram(video_path)
    cv2.imwrite('./ffdv_phase1_sample/trainset/' + video_path.split('/')[-1][:-4] + '.jpg', mel_spectrogram_image)
# a. 调用generate_mel_spectrogram(video_path)函数生成梅尔频谱图，并将其存储在mel_spectrogram_image变量中。b. 使用cv2.imwrite函数将梅尔频谱图保存为JPEG图像。图像被保存在./ffdv_phase1_sample/trainset/目录下，并使用与原始视频文件相同的名称（但扩展名改为.jpg）。
for video_path in glob.glob('/kaggle/input/ffdv-sample-dataset/ffdv_phase1_sample/valset/*.mp4'):
    mel_spectrogram_image = generate_mel_spectrogram(video_path)
    cv2.imwrite('./ffdv_phase1_sample/valset/' + video_path.split('/')[-1][:-4] + '.jpg', mel_spectrogram_image)
1
2
3
4
5
6
7
8

4.8 Define AverageMeter and ProgressMeter

The AverageMeter class is used to calculate and store the average and current value of a variable.

name: The name of the variable.
fmt: format string, used to format output.
reset(): Resets all statistics (val, avg, sum, count).
update(val, n=1): Updates statistics, val is the current value, and n is the weight of that value (usually the number of samples).
str(): Returns a formatted string including the current value and the average value.

class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

The ProgressMeter class is used to output the current batch information and statistical indicators during the training process.

num_batches: total number of batches.
meters: A list containing AverageMeter objects, used to store different metrics.
prefix: The prefix of the output lines.
print(batch): Prints information about the current batch, including the current batch number and the current value of each indicator.
_get_batch_fmtstr(num_batches) : Generates a batch format string, ensuring output alignment and formatting.

class ProgressMeter(object):
   def __init__(self, num_batches, *meters):
       self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
       self.meters = meters
       self.prefix = ""


   def pr2int(self, batch):
       entries = [self.prefix + self.batch_fmtstr.format(batch)]
       entries += [str(meter) for meter in self.meters]
       print('t'.join(entries))

   def _get_batch_fmtstr(self, num_batches):
       num_digits = len(str(num_batches // 1))
       fmt = '{:' + str(num_digits) + 'd}'
       return '[' + fmt + '/' + fmt.format(num_batches) + ']'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

4.9 Deep Learning and Model Evaluation Process (Key Points)

The validate function periodically evaluates the performance of the model on the validation set during training, and calculates and prints the Top-1 accuracy.

def validate(val_loader, model, criterion):
    batch_time = AverageMeter('Time', ':6.3f')# 批处理时间
    losses = AverageMeter('Loss', ':.4e')# 损失
    top1 = AverageMeter('Acc@1', ':6.2f')# Top-1准确率
    progress = ProgressMeter(len(val_loader), batch_time, losses, top1)# 输出ProgressMeter

    # switch to evaluate mode，eval()为评估函数，关闭训练时使用的一些特定层（如 Dropout），并启用 Batch Normalization 层的运行统计。
    model.eval()

    with torch.no_grad():# 定时设置requires_grad为False，防止梯度计算并节省内存。
        end = time.time()
        for i, (input, target) in enumerate(val_loader):
            input = input.cuda()# 将输入数据和目标数据转移到GPU计算
            target = target.cuda()

            # compute output
            output = model(input)
            loss = criterion(output, target)# 计算训练损失

            # measure accuracy and record loss，acc百分比显示
            acc = (output.argmax(1).view(-1) == target.float().view(-1)).float().mean() * 100
            losses.update(loss.item(), input.size(0))
            top1.update(acc, input.size(0))
            # measure elapsed time
            batch_time.update(time.time() - end)
            end = time.time()

        # TODO: this should also be done with the ProgressMeter
        print(' * Acc@1 {top1.avg:.3f}'
              .format(top1=top1))
        return top1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

The predict function is used to perform inference on the test set. It supports the use of test time augmentation (TTA) to improve the stability of model prediction by making multiple predictions and taking the average.

def predict(test_loader, model, tta=10):
    # switch to evaluate mode
    model.eval()
    # TTA（Test Time Augmentation）
    test_pred_tta = None
    for _ in range(tta):# 执行 TTA 次数的循环，每次循环会生成一个略有不同的输入数据。
        test_pred = []
        with torch.no_grad():
            end = time.time()
            for i, (input, target) in enumerate(test_loader):
                input = input.cuda()
                target = target.cuda()

                # compute output
                output = model(input)
                output = F.softmax(output, dim=1)# 对模型输出进行 softmax 归一化处理，以获得类别概率。
                output = output.data.cpu().numpy()

                test_pred.append(output)
        test_pred = np.vstack(test_pred)
    
        if test_pred_tta is None:
            test_pred_tta = test_pred
        else:
            test_pred_tta += test_pred
    
    return test_pred_tta
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

The train function is responsible for training the model by calculating the loss function and accuracy, and performing backpropagation and optimization steps to update the model parameters.

def train(train_loader, model, criterion, optimizer, epoch):
    batch_time = AverageMeter('Time', ':6.3f')
    losses = AverageMeter('Loss', ':.4e')
    top1 = AverageMeter('Acc@1', ':6.2f')
    progress = ProgressMeter(len(train_loader), batch_time, losses, top1)

    # switch to train mode
    model.train()

    end = time.time()
    for i, (input, target) in enumerate(train_loader):
        input = input.cuda(non_blocking=True)
        target = target.cuda(non_blocking=True)

        # compute output
        output = model(input)
        loss = criterion(output, target)

        # measure accuracy and record loss
        losses.update(loss.item(), input.size(0))

        acc = (output.argmax(1).view(-1) == target.float().view(-1)).float().mean() * 100
        top1.update(acc, input.size(0))# 更新 top1 计量器，记录当前批次的准确率。

        # compute gradient and do SGD step
        optimizer.zero_grad() # 清除之前累积的梯度。
        loss.backward()# 计算损失相对于模型参数的梯度
        optimizer.step()# 根据 backward() 计算的梯度更新模型参数。

        # measure elapsed time
        batch_time.update(time.time() - end)# 更新 batch_time 计量器，记录当前批次的处理时间。
        end = time.time()

        if i % 100 == 0:
            progress.pr2int(i)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

4.10 Loading labels for training and validation datasets

train_label = pd.read_csv("/kaggle/input/ffdv-sample-dataset/ffdv_phase1_sample/train_label.txt")
val_label = pd.read_csv("/kaggle/input/ffdv-sample-dataset/ffdv_phase1_sample/val_label.txt")

train_label['path'] = '/kaggle/working/ffdv_phase1_sample/trainset/' + train_label['video_name'].apply(lambda x: x[:-4] + '.jpg')
val_label['path'] = '/kaggle/working/ffdv_phase1_sample/valset/' + val_label['video_name'].apply(lambda x: x[:-4] + '.jpg')

train_label = train_label[train_label['path'].apply(os.path.exists)]
val_label = val_label[val_label['path'].apply(os.path.exists)]
1
2
3
4
5
6
7
8

4.11 Image loading and image conversion

transform leaves a parameter for subsequent data enhancement, which is preset to None.
Convert the image to RGB mode.
The labels are returned as a torch.Tensor.

class FFDIDataset(Dataset):
    def __init__(self, img_path, img_label, transform=None):
        self.img_path = img_path
        self.img_label = img_label
        
        if transform is not None:
            self.transform = transform
        else:
            self.transform = None
    
    def __getitem__(self, index):
        img = Image.open(self.img_path[index]).convert('RGB')
        
        if self.transform is not None:
            img = self.transform(img)
        
        return img, torch.from_numpy(np.array(self.img_label[index]))
    
    def __len__(self):
        return len(self.img_path)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

4.12 Setting up the training process (key points)

References the FFDID class set above.

train_loader = torch.utils.data.DataLoader(
    FFDIDataset(train_label['path'].values, train_label['target'].values, 
            transforms.Compose([
                        transforms.Resize((256, 256)),
                        transforms.RandomHorizontalFlip(),
                        transforms.RandomVerticalFlip(),
                        transforms.ToTensor(),
                        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])
    ), batch_size=40, shuffle=True, num_workers=12, pin_memory=True
)

val_loader = torch.utils.data.DataLoader(
    FFDIDataset(val_label['path'].values, val_label['target'].values, 
            transforms.Compose([
                        transforms.Resize((256, 256)),
                        transforms.ToTensor(),
                        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])
    ), batch_size=40, shuffle=False, num_workers=10, pin_memory=True
)
# 重点：这里调用timm提供的resnet18模型，因为分类为0/1（真视频/假视频），可以在后续改进，比如换用更深的网络ResNet-34、ResNet-50或是其他变体
model = timm.create_model('resnet18', pretrained=True, num_classes=2)
model = model.cuda()

# 交叉熵损失，针对多类别
criterion = nn.CrossEntropyLoss().cuda()
# Adam优化器，学习率设置为0.003。
optimizer = torch.optim.Adam(model.parameters(), 0.003)
# 每4个epoch将学习率按0.85的因子进行调整。
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=4, gamma=0.85)
# 初始化最优acc
best_acc = 0.0
for epoch in range(10):
    scheduler.step()
    print('Epoch: ', epoch)
	# 调用train函数
    train(train_loader, model, criterion, optimizer, epoch)
    # 调用validate函数
    val_acc = validate(val_loader, model, criterion)
    
    if val_acc.avg.item() > best_acc:
        best_acc = round(val_acc.avg.item(), 2)
        torch.save(model.state_dict(), f'./model_{best_acc}.pt')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Output:

Epoch:  0
[ 0/10]	Time  6.482 ( 6.482)	Loss 7.1626e-01 (7.1626e-01)	Acc@1  35.00 ( 35.00)
 * Acc@1 64.000
Epoch:  1
[ 0/10]	Time  0.819 ( 0.819)	Loss 4.6079e-01 (4.6079e-01)	Acc@1  80.00 ( 80.00)
 * Acc@1 75.500
Epoch:  2
[ 0/10]	Time  0.914 ( 0.914)	Loss 1.4983e-01 (1.4983e-01)	Acc@1  97.50 ( 97.50)
 * Acc@1 88.500
Epoch:  3
[ 0/10]	Time  0.884 ( 0.884)	Loss 2.4681e-01 (2.4681e-01)	Acc@1  87.50 ( 87.50)
 * Acc@1 84.000
Epoch:  4
[ 0/10]	Time  0.854 ( 0.854)	Loss 5.3736e-02 (5.3736e-02)	Acc@1 100.00 (100.00)
 * Acc@1 90.500
Epoch:  5
[ 0/10]	Time  0.849 ( 0.849)	Loss 5.9881e-02 (5.9881e-02)	Acc@1  97.50 ( 97.50)
 * Acc@1 89.500
Epoch:  6
[ 0/10]	Time  0.715 ( 0.715)	Loss 1.6215e-01 (1.6215e-01)	Acc@1  92.50 ( 92.50)
 * Acc@1 65.000
Epoch:  7
[ 0/10]	Time  0.652 ( 0.652)	Loss 5.3892e-01 (5.3892e-01)	Acc@1  80.00 ( 80.00)
 * Acc@1 78.500
Epoch:  8
[ 0/10]	Time  0.847 ( 0.847)	Loss 6.6098e-02 (6.6098e-02)	Acc@1  97.50 ( 97.50)
 * Acc@1 81.000
Epoch:  9
[ 0/10]	Time  0.844 ( 0.844)	Loss 9.4254e-02 (9.4254e-02)	Acc@1  97.50 ( 97.50)
 * Acc@1 81.500
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

4.12.1 Improvements to the resnet18 model

Deeper networks: If you need higher performance and more sophisticated feature extraction capabilities, you can consider using deeper networks such as ResNet-34, ResNet-50 or even larger ResNet variants (such as ResNet-101 or ResNet-152).

Other pre-trained models: In addition to the ResNet series, there are many other pre-trained models to choose from, such as:

EfficientNet: has excellent performance and parameter efficiency.
DenseNet: A densely connected network structure that helps to better utilize features.
VGG series: A simple and classic architecture suitable for use in resource-constrained environments.
Custom Model: Depending on the specific characteristics of the dataset and task requirements, you may also consider designing and training a custom model architecture, which may require more debugging and experimentation.

Ensemble learning: Consider using ensemble learning methods such as bagging or boosting to combine the predictions of multiple models to further improve performance and stability.

Hyperparameter Tuning: In addition to model selection, model performance can also be optimized by tuning the learning rate, batch size, choice of optimizer, and data augmentation strategy.

4.12.2 Improved loss function selection

Consider applying Dice Loss to improve the loss function in the future. Dice Loss measures the similarity between the prediction result and the target mask, and works well for binary classification tasks with obvious boundaries. It is a loss function that performs well on pixel-level predictions.

At the same time, we also pay attention to Focal Loss. It is specifically used to solve the problem of class imbalance. It focuses on difficult samples by reducing the weight of easy-to-classify samples, which can further improve the performance of the model on minority categories.

4.12.3 Improved Optimizer Selection

RAdam is an improvement on Adam, which improves stability and performance by dynamically adjusting the correction of the learning rate.

AdamW is a variant of Adam that introduces weight decay to address performance issues that Adam may introduce in some cases, especially when the number of model parameters is large.

4.13 Evaluating the Model

# 用模型 (model) 对验证数据集 (val_loader) 进行预测。这部分假设 [:, 1] 给出了类别1的概率。
val_pred = predict(val_loader, model, 1)[:, 1]
# 赋值，预测的概率（或者预测值）赋给了 val_label 数据框中名为 "y_pred" 的列
val_label["y_pred"] = val_pred
1
2
3
4

4.14 Integration of the final dataset

submit = pd.read_csv("/kaggle/input/multi-ffdv/prediction.txt.csv")
# 使用 merge 函数将提交文件 (submit) 中的数据与验证数据集标签 (val_label) 中的 video_name 和 y_pred 列合并
merged_df = submit.merge(val_label[['video_name', 'y_pred']], on='video_name', suffixes=('', '_df2'), how='left', )
# 将合并后的数据中 y_pred_df2 列（从验证集中获取的预测结果）的值填充到 y_pred 列中
merged_df['y_pred'] = merged_df['y_pred_df2'].combine_first(merged_df['y_pred'])
merged_df[['video_name', 'y_pred']].to_csv('submit.csv', index=None)
1
2
3
4
5
6

5 Conclusion

It doesn’t take 10 minutes to run the baseline. It usually gets stuck at 4.7 to generate the mel-spectrogram. But it takes 5 hours of patient exploration. Here is a dictation summary of the key processes:

Calculate the number of samples
Create audio and video objects
Importing training dataset
Importing audio and video data
Setting random seed/CUDNN
Audio and video preprocessing
Generate Mel-spectrogram
Define validation set, prediction set, and training set methods
Image conversion
Use the SOTA training, optimization and evaluation model provided by timm

6. Extension

6.1 Task Definition of Deep Learning

The task definition of deep learning can actually be summarized as "back propagation", because its core is to use the back propagation algorithm to adjust the model parameters to minimize the defined loss function.

Deep learning is very suitable for processing such audio and video tasks. I think the first reason is that the amount of audio and video data is huge, and it is necessary to classify these data in a complex way. The mechanism of deep learning determines that it requires a huge amount of data, and in fact, a very important point is that deepfake itself needsClassificationThought is essentially a classification task, and deep learning is very advantageous in large amounts of data and refined classification. A familiar example is the generative adversarial network (GAN).

Second, a large amount of statistically similar data is created, and it can learn the distribution of a large amount of data. It can also do the reverse task.

6.2 The relationship between AIGC and Deepfake

insert image description here

AIGC should include Deepfake. From a development perspective, the development and dimensionality of Deepfake technology will surely increase with the processing power of AIGC. We will face a vast amount of true and false data. Of course, we also need to see what these videos are used for. If it can save the energy of story creators, it will undoubtedly be beneficial to the development of literature and entertainment; if it is used for deception, it will face ethical challenges.

My thought is that, if we consider the current trend of short dramas, Deepfake can enable the public to have an ultra-low-cost audio-visual entertainment experience, but it also challenges the traditional film and television industry and even the discussion of what kind of actors and audio-visual stimulation the human eye needs. This is a topic worth discussing.

Technology Sharing