VGG16 implements image classification in pytorch, detailed steps

2024-07-12

VGG16 for image classification

Here we implement a VGG-16 network for classifying the CIFAR dataset

VGG16 Network Introduction

Preface

《Very Deep Convolutional Networks for Large-Scale Image Recognition》

ICLR 2015

VGGIt's from OxfordVisualGeometryGThe VGG network was proposed by the team of roup (you should be able to see the origin of the name VGG). This network is related to the work at ILSVRC 2014. The main work is to prove that increasing the depth of the network can affect the final performance of the network to a certain extent. VGG has two structures, namely VGG16 and VGG19. There is no essential difference between the two, but the network depth is different.

VGG principle

One improvement of VGG16 over AlexNet isUse several consecutive 3x3 convolution kernels to replace the larger convolution kernels (11x11, 7x7, 5x5) in AlexNet.For a given receptive field (the local size of the input image relative to the output), stacking small convolution kernels is better than using large convolution kernels, because multiple layers of nonlinear layers can increase the depth of the network to ensure learning of more complex patterns, and the cost is relatively small (fewer parameters).

Simply put, in VGG, three 3x3 convolution kernels are used to replace the 7x7 convolution kernel, and two 3x3 convolution kernels are used to replace the 5*5 convolution kernel. The main purpose of doing this is to increase the depth of the network while ensuring the same receptive field, thereby improving the effect of the neural network to a certain extent.

For example, the stacking effect of three 3x3 convolution kernels with a stride of 1 can be seen as a receptive field of size 7 (in fact, it means that three consecutive 3x3 convolutions are equivalent to a 7x7 convolution), and the total number of parameters is 3x(9xC^2). If the 7x7 convolution kernel is used directly, the total number of parameters is 49xC^2, where C refers to the number of input and output channels.Obviously, 27xC^{2 less than 49xC}2, that is, the parameters are reduced; and the 3x3 convolution kernel is conducive to better maintaining the image properties.

Here is an explanation of why two 3x3 convolution kernels can be used instead of a 5*5 convolution kernel:

The 5x5 convolution is seen as a small fully connected network sliding in the 5x5 area. We can first use a 3x3 convolution filter for convolution, and then use a fully connected layer to connect the 3x3 convolution output. This fully connected layer can also be seen as a 3x3 convolution layer. In this way, we can use two 3x3 convolutions cascaded (superimposed) to replace a 5x5 convolution.

The details are shown in the following figure:

insert image description here

VGG network structure

insert image description here

Below is the structure of the VGG network (both VGG16 and VGG19 are included):

insert image description here

GG network structure

VGG16 contains 16 hidden layers (13 convolutional layers and 3 fully connected layers), as shown in column D in the figure above.

VGG19 contains 19 hidden layers (16 convolutional layers and 3 fully connected layers), as shown in column E in the figure above.

The structure of the VGG network is very consistent, using 3x3 convolution and 2x2 max pooling from beginning to end.

VGG Advantages

The structure of VGGNet is very simple, and the entire network uses the same convolution kernel size (3x3) and maximum pooling size (2x2).

A combination of several convolutional layers with small filters (3x3) is better than a convolutional layer with one large filter (5x5 or 7x7):

It is verified that the performance can be improved by continuously deepening the network structure.

VGG Disadvantages

VGG consumes more computing resources and uses more parameters (not due to the 3x3 convolution), resulting in more memory usage (140M).

Dataset processing

Dataset Introduction

The CIFAR (Canadian Institute For Advanced Research) dataset is a small image dataset widely used in the field of computer vision. It is mainly used to train machine learning and computer vision algorithms, especially in tasks such as image recognition and classification. The CIFAR dataset consists of two main parts: CIFAR-10 and CIFAR-100.

CIFAR-10 is a dataset of 60,000 32x32 color images, which are divided into 10 categories, each containing 6,000 images. The 10 categories are: airplane, car, bird, cat, deer, dog, frog, horse, ship, and truck. In the dataset, 50,000 images are used for training and 10,000 images are used for testing. The CIFAR-10 dataset has become one of the most popular datasets in computer vision research and teaching due to its moderate size and rich category information.

Dataset characteristics

Good size: The images in the CIFAR dataset are small in size (32x32), which makes them ideal for quickly training and testing new computer vision algorithms.
Various categories: CIFAR-10 provides a basic image classification task, while CIFAR-100 further challenges the algorithm's fine-grained classification capabilities.
widely used: Due to these characteristics, the CIFAR dataset is widely used in research and teaching in fields such as computer vision, machine learning, and deep learning.

scenes to be used

The CIFAR dataset is commonly used for tasks such as image classification, object recognition, and training and testing of convolutional neural networks (CNNs). Due to its moderate size and rich category information, it is an ideal choice for beginners and researchers to explore image recognition algorithms. In addition, many computer vision and machine learning competitions also use the CIFAR dataset as a benchmark to evaluate the performance of contestants' algorithms.

If you want to prepare the data set, I have already downloaded it. If that doesn't work, you can download it from the official website, or I will give it to you directly.

If you need the data set, please contact email: [email protected]

My dataset was originally obtained through the data downloaded from torchvision, but I don’t want to do that now. I want to define the data dataset and load the data using DataLoader step by step. Understanding this process and the process of dataset processing will give me a deeper understanding of deep learning.

The dataset format is as follows:

insert image description here

Parse all labels of the dataset

The label categories of the dataset use a.metaThe file is stored, so we need to parse .meta The file is used to read all the tag data. The parsing code is as follows:

# 首先了解所有的标签，TODO 可以详细了解一下这个解包的过程
import pickle


def unpickle(file):
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict


meta_data = unpickle('./dataset_method_1/cifar-10-batches-py/batches.meta')
label_names = meta_data[b'label_names']
# 将字节标签转换为字符串
label_names = [label.decode('utf-8') for label in label_names]
print(label_names)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

The results of the analysis are as follows:

['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
1

Load a single batch of data for simple testing of the data

Our dataset has been downloaded, so we need to read the contents of the file. Since the file is a binary file, we need to read it using binary reading.

The reading code is as follows:

# 载入单个批次的数据
import numpy as np


def load_data_batch(file):
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
        X = dict[b'data']
        Y = dict[b'labels']
        X = X.reshape(10000, 3, 32, 32).transpose(0, 2, 3, 1)  # reshape and transpose to (10000, 32, 32, 3)
        Y = np.array(Y)
    return X, Y


# 加载第一个数据批次
data_batch_1 = './dataset_method_1/cifar-10-batches-py/data_batch_1'
X1, Y1 = load_data_batch(data_batch_1)

print(f'数据形状: {X1.shape}, 标签形状: {Y1.shape}')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

result:

数据形状: (10000, 32, 32, 3), 标签形状: (10000,)
1

Load all data

After the above test, we know how to load data. Now let's load all the data.

Loading the training set:

# 整合所有批次的数据
def load_all_data_batches(batch_files):
    X_list, Y_list = [], []
    for file in batch_files:
        X, Y = load_data_batch(file)
        X_list.append(X)
        Y_list.append(Y)
    X_all = np.concatenate(X_list)
    Y_all = np.concatenate(Y_list)
    return X_all, Y_all


batch_files = [
    './dataset_method_1/cifar-10-batches-py/data_batch_1',
    './dataset_method_1/cifar-10-batches-py/data_batch_2',
    './dataset_method_1/cifar-10-batches-py/data_batch_3',
    './dataset_method_1/cifar-10-batches-py/data_batch_4',
    './dataset_method_1/cifar-10-batches-py/data_batch_5'
]

X_train, Y_train = load_all_data_batches(batch_files)
print(f'训练数据形状: {X_train.shape}, 训练标签形状: {Y_train.shape}')
Y_train = Y_train.astype(np.int64)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

Output:

训练数据形状: (50000, 32, 32, 3), 训练标签形状: (50000,)
1

Loading the test set:

test_batch = './dataset_method_1/cifar-10-batches-py/test_batch'
X_test, Y_test = load_data_batch(test_batch)
Y_test = Y_test.astype(np.int64)
print(f'测试数据形状: {X_test.shape}, 测试标签形状: {Y_test.shape}')

1
2
3
4
5

Output:

测试数据形状: (10000, 32, 32, 3), 测试标签形状: (10000,)
1

Define a subclass of Dataset

Defining a subclass of the Dataset class is to facilitate subsequent loading into the Dataloader for batch training.

There are three methods that Dataset subclasses must implement:

__init__()Class constructor
__len__()Returns the length of the dataset
__getitem__()Get a data set

Here is my implementation:

from torch.utils.data import DataLoader, Dataset


# 定义 Pytorch 的数据集 
class CIFARDataset(Dataset):
    def __init__(self, images, labels, transform=None):
        self.images = images
        self.labels = labels
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        image = self.images[idx]
        label = self.labels[idx]

        if self.transform:
            image = self.transform(image)

        return image, label
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

Load the dataset as Dataloader

Define a transform to enhance the data. First, take the training set, widen it by 4px, normalize it, flip it horizontally, process it in grayscale, and finally return it to the original pixel size of 32 * 32.

transform_train = transforms.Compose(
    [transforms.Pad(4),
     transforms.ToTensor(),
     transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
     transforms.RandomHorizontalFlip(),
     transforms.RandomGrayscale(),
     transforms.RandomCrop(32, padding=4),
     ])
1
2
3
4
5
6
7
8

Because it involves image processing, and the data we read from the binary file is numpy data, we need to convert the numpy array into Image data to facilitate image processing. The processing is as follows:

# 把数据集变成 Image 的数组，不然好像不能进行数据的增强
# 改变训练数据
from PIL import Image
def get_PIL_Images(origin_data):
    datas = []
    for i in range(len(origin_data)):
        data = Image.fromarray(origin_data[i])
        datas.append(data)
    return datas
1
2
3
4
5
6
7
8
9

Get the trained dataloader

train_data = get_PIL_Images(X_train)
train_loader = DataLoader(CIFARDataset(train_data, Y_train, transform_train), batch_size=24, shuffle=True)
1
2

Get the dataloader test set for the test without too much processing. Here is the code directly

# 测试集的预处理
transform_test = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))]
)
test_loader = DataLoader(CIFARDataset(X_test, Y_test, transform_test), batch_size=24, shuffle=False)
1
2
3
4
5
6
7

Defining the network

We implement it on the Pytorch framework based on the VGG16 network mentioned above.

mainly divided:

Convolutional Layer
Fully connected layer
Classification layer

The implementation is as follows:

class VGG16(nn.Module):
    def __init__(self):
        super(VGG16, self).__init__()
        # 卷积层，这里进行卷积
        self.convolusion = nn.Sequential(
            nn.Conv2d(3, 96, kernel_size=3, padding=1), # 设置为padding=1 卷积完后，数据大小不会变
            nn.BatchNorm2d(96),
            nn.ReLU(inplace=True),
            nn.Conv2d(96, 96, kernel_size=3, padding=1),
            nn.BatchNorm2d(96),
            nn.ReLU(inplace=True),
            nn.Conv2d(96, 96, kernel_size=3, padding=1),
            nn.BatchNorm2d(96),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(96, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.AvgPool2d(kernel_size=1, stride=1)
        )
        # 全连接层
        self.dense = nn.Sequential(
            nn.Linear(512, 4096), # 32*32 的图像大小经过 5 次最大化池化后就只有 1*1 了，所以就是 512 个通道的数据输入全连接层
            nn.ReLU(inplace=True),
            nn.Dropout(0.4),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(0.4),
        )
        # 输出层
        self.classifier = nn.Linear(4096, 10)

    def forward(self, x):
        out = self.convolusion(x)
        out = out.view(out.size(0), -1)
        out = self.dense(out)
        out = self.classifier(out)
        return out
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

Training and Testing

For training and testing, you only need to instantiate the model, then define the optimization function, loss function, loss rate, and then perform training and testing.

code show as below:

Hyperparameter definition:

# 定义模型进行训练
model = VGG16()
# model.load_state_dict(torch.load('./my-VGG16.pth'))
optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=5e-3)
loss_func = nn.CrossEntropyLoss()
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.4, last_epoch=-1)
1
2
3
4
5
6

Test function:

def test():
    model.eval()
    correct = 0  # 预测正确的图片数
    total = 0  # 总共的图片数
    with torch.no_grad():
        for data in test_loader:
            images, labels = data
            images = images.to(device)
            outputs = model(images).to(device)
            outputs = outputs.cpu()
            outputarr = outputs.numpy()
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum()
    accuracy = 100 * correct / total
    accuracy_rate.append(accuracy)
    print(f'准确率为:{accuracy}%'.format(accuracy))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

Training epochs:

# 定义训练步骤
total_times = 40
total = 0
accuracy_rate = []
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

for epoch in range(total_times):
    model.train()
    model.to(device)
    running_loss = 0.0
    total_correct = 0
    total_trainset = 0
    print("epoch: ",epoch)
    for i, (data,labels) in enumerate(train_loader):
        data = data.to(device)
        outputs = model(data).to(device)
        labels = labels.to(device)
        loss = loss_func(outputs,labels).to(device)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        _,pred = outputs.max(1)
        correct = (pred == labels).sum().item()
        total_correct += correct
        total_trainset += data.shape[0]
        if i % 100 == 0 and i > 0:
            print(f"正在进行第{i}次训练, running_loss={running_loss}".format(i, running_loss))
            running_loss = 0.0
    test()
    scheduler.step()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

Save the trained model:

torch.save(model.state_dict(), './my-VGG16.pth')
accuracy_rate = np.array(accuracy_rate)
times = np.linspace(1, total_times, total_times)
plt.xlabel('times')
plt.ylabel('accuracy rate')
plt.plot(times, accuracy_rate)
plt.show()
print(accuracy_rate)
1
2
3
4
5
6
7
8

test

Defining the Model

model_my_vgg = VGG16()
1

Add trained model data

model_my_vgg.load_state_dict(torch.load('./my-VGG16-best.pth',map_location='cpu'))
1

Processing the verification pictures I found myself

from torchvision import transforms
from PIL import Image

# 定义图像预处理步骤
preprocess = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    # transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]),
])

def load_image(image_path):
    image = Image.open(image_path)
    image = preprocess(image)
    image = image.unsqueeze(0)  # 添加批次维度
    return image

image_data = load_image('./plane2.jpg')
print(image_data.shape)
output = model_my_vgg(image_data)
verify_data = X1[9]
verify_label = Y1[9]
output_verify = model_my_vgg(transform_test(verify_data).unsqueeze(0))
print(output)
print(output_verify)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Output:

torch.Size([1, 3, 32, 32])
tensor([[ 1.5990, -0.5269,  0.7254,  0.3432, -0.5036, -0.3267, -0.5302, -0.9417,
          0.4186, -0.1213]], grad_fn=<AddmmBackward0>)
tensor([[-0.6541, -2.0759,  0.6308,  1.9791,  0.8525,  1.2313,  0.1856,  0.3243,
         -1.3374, -1.0211]], grad_fn=<AddmmBackward0>)
1
2
3
4
5

Printing Results

print(label_names[torch.argmax(output,dim=1,keepdim=False)])
print(label_names[verify_label])
print("pred:",label_names[torch.argmax(output_verify,dim=1,keepdim=False)])
1
2
3

airplane
cat
pred: cat
1
2
3

insert image description here

Verification Horse

insert image description here

Verification Dog

insert image description here

Technology Sharing