Technology Sharing

Exploring K8s GPU resource management: Deploying the AI ​​large model Ollama on KubeSphere

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Author: Xingzhu, the operator of the operation and maintenance skills With the rapid development of artificial intelligence, machine learning, and AI large model technology, our demand for computing resources is also rising. In particular, for AI large models that need to process large-scale data and complex algorithms, the use of GPU resources has become crucial. For operation and maintenance engineers, mastering how to manage and configure GPU resources on Kubernetes clusters, as well as how to efficiently deploy applications that rely on these resources, has become an indispensable skill.

Today, I will lead you to have a deep understanding of how to use the powerful Kubernetes ecosystem and tools on the KubeSphere platform to manage GPU resources and deploy applications. The following are the three core topics that this article will explore:

  1. Cluster expansion and GPU node integration: We will use the KubeKey tool to expand the Kubernetes cluster and add Worker nodes with GPU capabilities to provide the necessary hardware support for AI applications.
  2. Kubernetes integration for GPU resources:Use Helm to install and configure NVIDIA GPU Operator, which is an official solution provided by NVIDIA to simplify the calling and management of GPU resources in Kubernetes clusters.
  3. Practical deployment: Ollama large model management tool: We will deploy Ollama, a management tool designed for large AI models, on KubeSphere to verify whether GPU resources can be correctly scheduled and used efficiently.

By reading this article, you will gain the knowledge and skills to manage GPU resources on Kubernetes, helping you to make full use of GPU resources in a cloud-native environment and promote the rapid development of AI applications.

KubeSphere Best Practice "2024" The experimental environment hardware configuration and software information of the series of documents are as follows:

Actual server configuration (the architecture is a 1:1 replica of a small-scale production environment, with slightly different configurations)

CPU nameIPCPUMemorysystem diskData diskuse
ksp-registry192.168.9.904840200Harbor Mirror Repository
ksp-control-1192.168.9.914840100KubeSphere/k8s-control-plane
ksp-control-2192.168.9.924840100KubeSphere/k8s-control-plane
ksp-control-3192.168.9.934840100KubeSphere/k8s-control-plane
ksp-worker-1192.168.9.9441640100k8s-worker/CI
ksp-worker-2192.168.9.9541640100k8s-worker
ksp-worker-3192.168.9.9641640100k8s-worker
ksp-storage-1192.168.9.974840300+ElasticSearch/Ceph/Longhorn/NFS/
ksp-storage-2192.168.9.984840300+ElasticSearch//Ceph/Longhorn
ksp-storage-3192.168.9.994840300+ElasticSearch//Ceph/Longhorn
ksp-gpu-worker-1192.168.9.10141640100k8s-worker(GPU NVIDIA Tesla M40 24G)
ksp-gpu-worker-2192.168.9.10241640100k8s-worker(GPU NVIDIA Tesla P100 16G)
ksp-gateway-1192.168.9.1032440Self-built application service proxy gateway/VIP: 192.168.9.100
ksp-gateway-2192.168.9.1042440Self-built application service proxy gateway/VIP: 192.168.9.100
ksp-mid192.168.9.1054840100Service nodes deployed outside the k8s cluster (Gitlab, etc.)
total15561526002000

Software version information related to the actual combat environment

  • operating system:openEuler 22.03 LTS SP3 x86_64
  • KubeSphere:v3.4.1
  • Kubernetes:v1.28.8
  • KubeKey: v3.1.1
  • Containerd:1.7.13
  • NVIDIA GPU Operator:v24.3.0
  • NVIDIA Graphics Driver:550.54.15

1. Prerequisites

1.1 Prepare a Worker node with a graphics card

Due to resource and cost constraints, I don't have a high-end physical host and graphics card to do the experiment. I can only add two virtual machines equipped with entry-level GPU graphics cards as Worker nodes for the cluster.

  • Node 1 is equipped with a GPU NVIDIA Tesla M40 24G graphics card. The only advantage is the large 24G video memory and low performance.
  • Node 2 is equipped with a GPU NVIDIA Tesla P100 16G graphics card. It has a small video memory but is faster than graphics cards such as M40 and P40.

Although these graphics cards are not as good as high-end models in performance, they are sufficient for most learning and development tasks. With limited resources, such a configuration provides me with valuable practical opportunities to explore the management and scheduling strategies of GPU resources in Kubernetes clusters.

1.2 Operating system initialization configuration

Please refer to Kubernetes cluster node openEuler 22.03 LTS SP3 system initialization guide, complete the operating system initialization configuration.

The initial configuration guide does not cover tasks related to operating system upgrades. When initializing the system in an environment with Internet access, be sure to upgrade the operating system and then restart the node.

2. Use KubeKey to expand GPU Worker nodes

Next, we use KubeKey to add the newly added GPU node to the existing Kubernetes cluster. Refer to the official documentation. The whole process is relatively simple and only requires two steps.

  • Modify the cluster configuration file used when deploying KubeKey
  • Execute the command to add a node

2.1 Modify the cluster configuration file

On the Control-1 node, switch to the kubekey directory for deployment and modify the original cluster configuration file. The name we use in practice is ksp-v341-v1288.yaml, please modify according to the actual situation.

Main modifications:

  • spec.hosts section: Add information about new worker nodes.
  • spec.roleGroups.worker section: Add information about new worker nodes

The modified example is as follows:

apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
  name: opsxlab
spec:
  hosts:
  ......(保持不变)
  - {name: ksp-gpu-worker-1, address: 192.168.9.101, internalAddress: 192.168.9.101, user: root, password: "OpsXlab@2024"}
  - {name: ksp-gpu-worker-2, address: 192.168.9.102, internalAddress: 192.168.9.102, user: root, password: "OpsXlab@2024"}
  roleGroups:
    ......(保持不变)
    worker:
    ......(保持不变)
    - ksp-gpu-worker-1
    - ksp-gpu-worker-2

# 下面的内容保持不变
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

2.2 Use KubeKey to add nodes

Before adding nodes, let's confirm the node information of the current cluster.

$ kubectl get nodes -o wide
NAME            STATUS   ROLES           AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                    KERNEL-VERSION                       CONTAINER-RUNTIME
ksp-control-1   Ready    control-plane   24h   v1.28.8   192.168.9.91   <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64   containerd://1.7.13
ksp-control-2   Ready    control-plane   24h   v1.28.8   192.168.9.92   <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64   containerd://1.7.13
ksp-control-3   Ready    control-plane   24h   v1.28.8   192.168.9.93   <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64   containerd://1.7.13
ksp-worker-1    Ready    worker          24h   v1.28.8   192.168.9.94   <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64   containerd://1.7.13
ksp-worker-2    Ready    worker          24h   v1.28.8   192.168.9.95   <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64   containerd://1.7.13
ksp-worker-3    Ready    worker          24h   v1.28.8   192.168.9.96   <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64   containerd://1.7.13
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Next, we execute the following command to use the modified configuration file to add the newly added Worker node to the cluster.

export KKZONE=cn
./kk add nodes -f ksp-v341-v1288.yaml
  • 1

After the above command is executed, KubeKey will first check whether the dependencies and other configurations of the deployed Kubernetes meet the requirements. After passing the check, the system will prompt you to confirm the installation. Enter yes And press ENTER to continue with the deployment.

It takes about 5 minutes to complete the deployment, depending on the network speed, machine configuration, and the number of nodes added.

Once the deployment is complete, you should see output similar to the following on your terminal.

......
19:29:26 CST [AutoRenewCertsModule] Generate k8s certs renew script
19:29:27 CST success: [ksp-control-2]
19:29:27 CST success: [ksp-control-1]
19:29:27 CST success: [ksp-control-3]
19:29:27 CST [AutoRenewCertsModule] Generate k8s certs renew service
19:29:29 CST success: [ksp-control-3]
19:29:29 CST success: [ksp-control-2]
19:29:29 CST success: [ksp-control-1]
19:29:29 CST [AutoRenewCertsModule] Generate k8s certs renew timer
19:29:30 CST success: [ksp-control-2]
19:29:30 CST success: [ksp-control-1]
19:29:30 CST success: [ksp-control-3]
19:29:30 CST [AutoRenewCertsModule] Enable k8s certs renew service
19:29:30 CST success: [ksp-control-3]
19:29:30 CST success: [ksp-control-2]
19:29:30 CST success: [ksp-control-1]
19:29:30 CST Pipeline[AddNodesPipeline] execute successfully
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

3. Verify cluster status after expansion

3.1 KubeSphere Management Console Verifies Cluster Status

We open the browser and access the IP address and port of the Control-1 node 30880, log in to the login page of the KubeSphere management console.

Enter the cluster management interface, click the "Node" menu on the left, and click "Cluster Node" to view detailed information about the available nodes in the Kubernetes cluster.

3.2 Kubectl command line to verify cluster status

  • View cluster node information

Run the kubectl command on the Control-1 node to obtain the node information of the Kubernetes cluster.

kubectl get nodes -o wide

    The output shows that the current Kubernetes cluster has 8 nodes, and displays the name, status, role, lifetime, Kubernetes version number, internal IP, operating system type, kernel version, and container runtime of each node in detail.

    $ kubectl get nodes -o wide
    NAME               STATUS     ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                    KERNEL-VERSION                        CONTAINER-RUNTIME
    ksp-control-1      Ready      control-plane   25h   v1.28.8   192.168.9.91    <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64    containerd://1.7.13
    ksp-control-2      Ready      control-plane   25h   v1.28.8   192.168.9.92    <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64    containerd://1.7.13
    ksp-control-3      Ready      control-plane   25h   v1.28.8   192.168.9.93    <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64    containerd://1.7.13
    ksp-gpu-worker-1   Ready      worker          59m   v1.28.8   192.168.9.101   <none>        openEuler 22.03 (LTS-SP3)   5.10.0-199.0.0.112.oe2203sp3.x86_64   containerd://1.7.13
    ksp-gpu-worker-2   Ready      worker          59m   v1.28.8   192.168.9.102   <none>        openEuler 22.03 (LTS-SP3)   5.10.0-199.0.0.112.oe2203sp3.x86_64   containerd://1.7.13
    ksp-worker-1       Ready      worker          25h   v1.28.8   192.168.9.94    <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64    containerd://1.7.13
    ksp-worker-2       Ready      worker          25h   v1.28.8   192.168.9.95    <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64    containerd://1.7.13
    ksp-worker-3       Ready      worker          25h   v1.28.8   192.168.9.96    <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64    containerd://1.7.13
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    So far, we have completed the task of using Kubekey to add 2 Worker nodes to the existing Kubernetes cluster consisting of 3 Master nodes and 3 Worker nodes.

    Next, we install the NVIDIA GPU Operator officially produced by NVIDIA to enable K8s to schedule Pods to use GPU resources.

    4. Install and configure NVIDIA GPU Operator

    4.1 Install NVIDIA graphics card driver

    NVIDIA GPU Operator supports automatic installation of graphics card drivers, but only CentOS 7, 8 and Ubuntu 20.04, 22.04 and other versions do not support openEuler, so the graphics card driver needs to be installed manually.

    Please refer to KubeSphere Best Practice: openEuler 22.03 LTS SP3 Install NVIDIA graphics driver, complete the graphics card driver installation.

    4.2 Prerequisites

    Node Feature Discovery (NFD) detects feature checks.

     $ kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

      The above command execution results are true, illustrate NFD Already running in the cluster. If NFD is already running in the cluster, you must disable the deployment of NFD when installing the Operator.

      illustrate: By default, NFD is not installed and configured in a K8s cluster deployed using KubeSphere.

      4.3 Install NVIDIA GPU Operator

      1. Add the NVIDIA Helm repository
      helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
        1. Installing the GPU Operator

        Use the default configuration file, disable the automatic installation of graphics card drivers, and install GPU Operator.

        helm install -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator --set driver.enabled=false

          Note: Since the image to be installed is relatively large, a timeout may occur during the initial installation. Please check whether your image has been successfully pulled! You can consider using offline installation to solve this problem.

          1. Install the GPU Operator with custom values.Optional, used when offline or in custom configuration
          helm install -f gpu-operator-values.yaml -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator --set driver.enabled=false

            The output of correct execution is as follows:

            $ helm install -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator --set driver.enabled=false
            NAME: gpu-operator
            LAST DEPLOYED: Tue Jul  2 21:40:29 2024
            NAMESPACE: gpu-operator
            STATUS: deployed
            REVISION: 1
            TEST SUITE: None
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6

            4.4 Checking GPU Operator Deployment Status Using the Command Line

            After executing the command to install GPU Operator, please wait patiently for all images to be successfully pulled and all Pods to be in the Running state.

            1. Check the status of pods using the command line
            $ kubectl get pods -n gpu-operator
            NAME                                                          READY   STATUS      RESTARTS   AGE
            gpu-feature-discovery-czdf5                                   1/1     Running     0          15m
            gpu-feature-discovery-q9qlm                                   1/1     Running     0          15m
            gpu-operator-67c68ddccf-x29pm                                 1/1     Running     0          15m
            gpu-operator-node-feature-discovery-gc-57457b6d8f-zjqhr       1/1     Running     0          15m
            gpu-operator-node-feature-discovery-master-5fb74ff754-fzbzm   1/1     Running     0          15m
            gpu-operator-node-feature-discovery-worker-68459              1/1     Running     0          15m
            gpu-operator-node-feature-discovery-worker-74ps5              1/1     Running     0          15m
            gpu-operator-node-feature-discovery-worker-dpmg9              1/1     Running     0          15m
            gpu-operator-node-feature-discovery-worker-jvk4t              1/1     Running     0          15m
            gpu-operator-node-feature-discovery-worker-k5kwq              1/1     Running     0          15m
            gpu-operator-node-feature-discovery-worker-ll4bk              1/1     Running     0          15m
            gpu-operator-node-feature-discovery-worker-p4q5q              1/1     Running     0          15m
            gpu-operator-node-feature-discovery-worker-rmk99              1/1     Running     0          15m
            nvidia-container-toolkit-daemonset-9zcnj                      1/1     Running     0          15m
            nvidia-container-toolkit-daemonset-kcz9g                      1/1     Running     0          15m
            nvidia-cuda-validator-l8vjb                                   0/1     Completed   0          14m
            nvidia-cuda-validator-svn2p                                   0/1     Completed   0          13m
            nvidia-dcgm-exporter-9lq4c                                    1/1     Running     0          15m
            nvidia-dcgm-exporter-qhmkg                                    1/1     Running     0          15m
            nvidia-device-plugin-daemonset-7rvfm                          1/1     Running     0          15m
            nvidia-device-plugin-daemonset-86gx2                          1/1     Running     0          15m
            nvidia-operator-validator-csr2z                               1/1     Running     0          15m
            nvidia-operator-validator-svlc4                               1/1     Running     0          15m
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • 11
            • 12
            • 13
            • 14
            • 15
            • 16
            • 17
            • 18
            • 19
            • 20
            • 21
            • 22
            • 23
            • 24
            1. View the GPU resources that can be allocated to the node
            $ kubectl describe node ksp-gpu-worker-1 | grep "^Capacity" -A 7
            Capacity:
              cpu:                4
              ephemeral-storage:  35852924Ki
              hugepages-1Gi:      0
              hugepages-2Mi:      0
              memory:             15858668Ki
              nvidia.com/gpu:     1
              pods:               110
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8

            illustrate: Focusnvidia.com/gpu: The value of the field.

            4.5 Check GPU Operator deployment status in KubeSphere console

            The workload that was created successfully is as follows:

            • Deployments

            • Daemonsets

            5. GPU functional verification test

            5.1 Test Example 1-Verification Test CUDA

            After the GPU Operator is correctly installed, use the CUDA base image to test whether K8s can correctly create a Pod that uses GPU resources.

            1. Create a resource manifest file,vi cuda-ubuntu.yaml
            apiVersion: v1
            kind: Pod
            metadata:
              name: cuda-ubuntu2204
            spec:
              restartPolicy: OnFailure
              containers:
              - name: cuda-ubuntu2204
                image: "nvcr.io/nvidia/cuda:12.4.0-base-ubuntu22.04"
                resources:
                  limits:
                    nvidia.com/gpu: 1
                command: ["nvidia-smi"]
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • 11
            • 12
            1. Create Resources
            kubectl apply -f cuda-ubuntu.yaml
              1. View the created resources

              From the results, we can see that the pod is created on the ksp-gpu-worker-2 node (The graphics card model of this node is Tesla P100-PCIE-16GB)。

              $ kubectl get pods -o wide
              NAME                      READY   STATUS      RESTARTS   AGE   IP             NODE               NOMINATED NODE   READINESS GATES
              cuda-ubuntu2204           0/1     Completed   0          73s   10.233.99.15   ksp-gpu-worker-2   <none>           <none>
              ollama-79688d46b8-vxmhg   1/1     Running     0          47m   10.233.72.17   ksp-gpu-worker-1   <none>           <none>
              • 1
              • 2
              • 3
              1. View Pod Logs
              kubectl logs pod/cuda-ubuntu2204

                The output of correct execution is as follows:

                $ kubectl logs pod/cuda-ubuntu2204
                Mon Jul  8 11:10:59 2024
                +-----------------------------------------------------------------------------------------+
                | NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
                |-----------------------------------------+------------------------+----------------------+
                | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
                | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
                |                                         |                        |               MIG M. |
                |=========================================+========================+======================|
                |   0  Tesla P100-PCIE-16GB           Off |   00000000:00:10.0 Off |                    0 |
                | N/A   40C    P0             26W /  250W |       0MiB /  16384MiB |      0%      Default |
                |                                         |                        |                  N/A |
                +-----------------------------------------+------------------------+----------------------+
                
                +-----------------------------------------------------------------------------------------+
                | Processes:                                                                              |
                |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
                |        ID   ID                                                               Usage      |
                |=========================================================================================|
                |  No running processes found                                                             |
                +-----------------------------------------------------------------------------------------+
                • 1
                • 2
                • 3
                • 4
                • 5
                • 6
                • 7
                • 8
                • 9
                • 10
                • 11
                • 12
                • 13
                • 14
                • 15
                • 16
                • 17
                • 18
                • 19
                • 20
                1. Cleaning up test resources
                kubectl apply -f cuda-ubuntu.yaml

                  5.2 Test Example 2-Official GPU Applications Example

                  Implement a simple CUDA example for adding two vectors.

                  1. Create a resource manifest file,vi cuda-vectoradd.yaml
                  apiVersion: v1
                  kind: Pod
                  metadata:
                    name: cuda-vectoradd
                  spec:
                    restartPolicy: OnFailure
                    containers:
                    - name: cuda-vectoradd
                      image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
                      resources:
                        limits:
                          nvidia.com/gpu: 1
                  • 1
                  • 2
                  • 3
                  • 4
                  • 5
                  • 6
                  • 7
                  • 8
                  • 9
                  • 10
                  • 11
                  1. Execute the command to create a Pod
                  $ kubectl apply -f cuda-vectoradd.yaml
                    1. View Pod execution results

                    The Pod is created successfully and will run after startup vectorAdd command and exit.

                    $ kubectl logs pod/cuda-vectoradd

                      The output of correct execution is as follows:

                      [Vector addition of 50000 elements]
                      Copy input data from the host memory to the CUDA device
                      CUDA kernel launch with 196 blocks of 256 threads
                      Copy output data from the CUDA device to the host memory
                      Test PASSED
                      Done
                      • 1
                      • 2
                      • 3
                      • 4
                      • 5
                      1. Cleaning up test resources
                      kubectl delete -f cuda-vectoradd.yaml

                        6. Deploy Ollama on KubeSphere

                        The above verification test proves that Pod resources using GPU can be created on the K8s cluster. Next, we will use KubeSphere to create a large model management tool Ollama in the K8s cluster based on actual usage needs.

                        6.1 Create a deployment resource list

                        This example is a simple test. The storage option is hostPath Mode, please replace it with storage class or other types of persistent storage in actual use.

                        1. Create a resource list,vi deploy-ollama.yaml
                        kind: Deployment
                        apiVersion: apps/v1
                        metadata:
                          name: ollama
                          namespace: default
                          labels:
                            app: ollama
                        spec:
                          replicas: 1
                          selector:
                            matchLabels:
                              app: ollama
                          template:
                            metadata:
                              labels:
                                app: ollama
                            spec:
                              volumes:
                                - name: ollama-models
                                  hostPath:
                                    path: /data/openebs/local/ollama
                                    type: ''
                                - name: host-time
                                  hostPath:
                                    path: /etc/localtime
                                    type: ''
                              containers:
                                - name: ollama
                                  image: 'ollama/ollama:latest'
                                  ports:
                                    - name: http-11434
                                      containerPort: 11434
                                      protocol: TCP
                                  resources:
                                    limits:
                                      nvidia.com/gpu: '1'
                                    requests:
                                      nvidia.com/gpu: '1'
                                  volumeMounts:
                                    - name: ollama-models
                                      mountPath: /root/.ollama
                                    - name: host-time
                                      readOnly: true
                                      mountPath: /etc/localtime
                                  imagePullPolicy: IfNotPresent
                              restartPolicy: Always
                        ---
                        kind: Service
                        apiVersion: v1
                        metadata:
                          name: ollama
                          namespace: default
                          labels:
                            app: ollama
                        spec:
                          ports:
                            - name: http-11434
                              protocol: TCP
                              port: 11434
                              targetPort: 11434
                              nodePort: 31434
                          selector:
                            app: ollama
                          type: NodePort
                        • 1
                        • 2
                        • 3
                        • 4
                        • 5
                        • 6
                        • 7
                        • 8
                        • 9
                        • 10
                        • 11
                        • 12
                        • 13
                        • 14
                        • 15
                        • 16
                        • 17
                        • 18
                        • 19
                        • 20
                        • 21
                        • 22
                        • 23
                        • 24
                        • 25
                        • 26
                        • 27
                        • 28
                        • 29
                        • 30
                        • 31
                        • 32
                        • 33
                        • 34
                        • 35
                        • 36
                        • 37
                        • 38
                        • 39
                        • 40
                        • 41
                        • 42
                        • 43
                        • 44
                        • 45
                        • 46
                        • 47
                        • 48
                        • 49
                        • 50
                        • 51
                        • 52
                        • 53
                        • 54
                        • 55
                        • 56
                        • 57
                        • 58
                        • 59
                        • 60
                        • 61
                        • 62
                        • 63

                        Special Instructions: The management console of KubeSphere supports graphical configuration of Deployment and other resources to use GPU resources. The configuration examples are as follows. Interested friends can study them on their own.

                        6.2 Deploy Ollama Service

                        • Create Ollama
                        kubectl apply -f deploy-ollama.yaml
                          • View Pod creation results

                          From the results, we can see that the pod is created on the ksp-gpu-worker-1 node (The graphics card model of this node is Tesla M40 24GB)。

                          $ kubectl get pods -o wide
                          NAME                      READY   STATUS    RESTARTS   AGE   IP             NODE               NOMINATED NODE   READINESS GATES
                          k   1/1     Running   0          12s   10.233.72.17   ksp-gpu-worker-1   <none>           <none>
                          • 1
                          • 2
                          • View container log
                          [root@ksp-control-1 ~]# kubectl logs ollama-79688d46b8-vxmhg
                          2024/07/08 18:24:27 routes.go:1064: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
                          time=2024-07-08T18:24:27.829+08:00 level=INFO source=images.go:730 msg="total blobs: 5"
                          time=2024-07-08T18:24:27.829+08:00 level=INFO source=images.go:737 msg="total unused blobs removed: 0"
                          time=2024-07-08T18:24:27.830+08:00 level=INFO source=routes.go:1111 msg="Listening on [::]:11434 (version 0.1.48)"
                          time=2024-07-08T18:24:27.830+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2414166698/runners
                          time=2024-07-08T18:24:32.454+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60101]"
                          time=2024-07-08T18:24:32.567+08:00 level=INFO source=types.go:98 msg="inference compute" id=GPU-9e48dc13-f8f1-c6bb-860f-c82c96df22a4 library=cuda compute=5.2 driver=12.4 name="Tesla M40 24GB" total="22.4 GiB" available="22.3 GiB"
                          • 1
                          • 2
                          • 3
                          • 4
                          • 5
                          • 6
                          • 7

                          6.3 Pull the large model used by Ollama

                          • Ollama pull model

                          In order to save time, this example uses the Alibaba open source qwen2 1.5b small-size model as the test model.

                          kubectl exec -it ollama-79688d46b8-vxmhg -- ollama pull qwen2:1.5b

                            The output of correct execution is as follows:

                            [root@ksp-control-1 ~]# kubectl exec -it ollama-79688d46b8-vxmhg -- ollama pull qwen2:1.5b
                            pulling manifest
                            pulling 405b56374e02... 100% ▕█████████████████████████████████████████████████████▏ 934 MB
                            pulling 62fbfd9ed093... 100% ▕█████████████████████████████████████████████████████▏  182 B
                            pulling c156170b718e... 100% ▕█████████████████████████████████████████████████████▏  11 KB
                            pulling f02dd72bb242... 100% ▕█████████████████████████████████████████████████████▏   59 B
                            pulling c9f5e9ffbc5f... 100% ▕█████████████████████████████████████████████████████▏  485 B
                            verifying sha256 digest
                            writing manifest
                            removing any unused layers
                            success
                            • 1
                            • 2
                            • 3
                            • 4
                            • 5
                            • 6
                            • 7
                            • 8
                            • 9
                            • 10
                            • View the contents of a model file

                            exist ksp-gpu-worker-1 The node executes the following view command

                            $ ls -R /data/openebs/local/ollama/
                            /data/openebs/local/ollama/:
                            id_ed25519  id_ed25519.pub  models
                            
                            /data/openebs/local/ollama/models:
                            blobs  manifests
                            
                            /data/openebs/local/ollama/models/blobs:
                            sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e
                            sha256-62fbfd9ed093d6e5ac83190c86eec5369317919f4b149598d2dbb38900e9faef
                            sha256-c156170b718ec29139d3653d40ed1986fd92fb7e0959b5c71f3c48f62e6636f4
                            sha256-c9f5e9ffbc5f14febb85d242942bd3d674a8e4c762aaab034ec88d6ba839b596
                            sha256-f02dd72bb2423204352eabc5637b44d79d17f109fdb510a7c51455892aa2d216
                            
                            /data/openebs/local/ollama/models/manifests:
                            registry.ollama.ai
                            
                            /data/openebs/local/ollama/models/manifests/registry.ollama.ai:
                            library
                            
                            /data/openebs/local/ollama/models/manifests/registry.ollama.ai/library:
                            qwen2
                            
                            /data/openebs/local/ollama/models/manifests/registry.ollama.ai/library/qwen2:
                            1.5b
                            • 1
                            • 2
                            • 3
                            • 4
                            • 5
                            • 6
                            • 7
                            • 8
                            • 9
                            • 10
                            • 11
                            • 12
                            • 13
                            • 14
                            • 15
                            • 16
                            • 17
                            • 18
                            • 19
                            • 20
                            • 21
                            • 22
                            • 23
                            • 24

                            6.4 Model Capability Test

                            • Calling interface test
                            curl http://192.168.9.91:31434/api/chat -d '{
                              "model": "qwen2:1.5b",
                              "messages": [
                                { "role": "user", "content": "用20个字,介绍你自己" }
                              ]
                            }'
                            • 1
                            • 2
                            • 3
                            • 4
                            • 5
                            • Test Results
                            $ curl http://192.168.9.91:31434/api/chat -d '{
                              "model": "qwen2:1.5b",
                              "messages": [
                                { "role": "user", "content": "用20个字,介绍你自己" }
                              ]
                            }'
                            {"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.011798927Z","message":{"role":"assistant","content":"我"},"done":false}
                            {"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.035291669Z","message":{"role":"assistant","content":"是一个"},"done":false}
                            {"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.06360233Z","message":{"role":"assistant","content":"人工智能"},"done":false}
                            {"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.092411266Z","message":{"role":"assistant","content":"助手"},"done":false}
                            {"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.12016935Z","message":{"role":"assistant","content":","},"done":false}
                            {"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.144921623Z","message":{"role":"assistant","content":"专注于"},"done":false}
                            {"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.169803961Z","message":{"role":"assistant","content":"提供"},"done":false}
                            {"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.194796364Z","message":{"role":"assistant","content":"信息"},"done":false}
                            {"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.21978104Z","message":{"role":"assistant","content":"和"},"done":false}
                            {"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.244976103Z","message":{"role":"assistant","content":"帮助"},"done":false}
                            {"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.270233992Z","message":{"role":"assistant","content":"。"},"done":false}
                            {"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.29548561Z","message":{"role":"assistant","content":""},"done_reason":"stop","done":true,"total_duration":454377627,"load_duration":1535754,"prompt_eval_duration":36172000,"eval_count":12,"eval_duration":287565000}
                            • 1
                            • 2
                            • 3
                            • 4
                            • 5
                            • 6
                            • 7
                            • 8
                            • 9
                            • 10
                            • 11
                            • 12
                            • 13
                            • 14
                            • 15
                            • 16
                            • 17

                            6.5 View GPU allocation information

                            • View the GPU resources allocated to the Worker node
                            $ kubectl describe node ksp-gpu-worker-1 | grep "Allocated resources" -A 9
                            Allocated resources:
                              (Total limits may be over 100 percent, i.e., overcommitted.)
                              Resource           Requests        Limits
                              --------           --------        ------
                              cpu                487m (13%)      2 (55%)
                              memory             315115520 (2%)  800Mi (5%)
                              ephemeral-storage  0 (0%)          0 (0%)
                              hugepages-1Gi      0 (0%)          0 (0%)
                              hugepages-2Mi      0 (0%)          0 (0%)
                              nvidia.com/gpu     1               1
                            • 1
                            • 2
                            • 3
                            • 4
                            • 5
                            • 6
                            • 7
                            • 8
                            • 9
                            • 10
                            • Physical GPU usage during Ollama runtime

                            Execute on the Worker node nvidia-smi -l Observe GPU usage.

                            Disclaimer:

                            • The author's level is limited. Although he has tried his best to ensure the accuracy of the content after multiple verifications and checks,But there may still be omissions. We welcome the advice of industry experts.
                            • The contents described in this article have only been verified and tested in actual combat environments. Readers can learn and refer to them, butIt is strictly forbidden to use it directly in the production environmentThe author is not responsible for any problems caused by this.

                            This article is provided by the blog one article multiple post platform OpenWrite release!