深入分析kubelet（2）——创建Pod

紧接着上一篇继续学习。上一篇讲到生产者，本篇将介绍消费者。

background

PodManager

k8s.io\kubernetes\pkg\kubelet\pod\pod_manager.go

// Manager stores and manages access to pods, maintaining the mappings
// between static pods and mirror pods.
//
// The kubelet discovers pod updates from 3 sources: file, http, and
// apiserver. Pods from non-apiserver sources are called static pods, and API
// server is not aware of the existence of static pods. In order to monitor
// the status of such pods, the kubelet creates a mirror pod for each static
// pod via the API server.
//
// A mirror pod has the same pod full name (name and namespace) as its static
// counterpart (albeit different metadata such as UID, etc). By leveraging the
// fact that the kubelet reports the pod status using the pod full name, the
// status of the mirror pod always reflects the actual status of the static
// pod. When a static pod gets deleted, the associated orphaned mirror pod
// will also be removed.
type Manager interface {
    // GetPods returns the regular pods bound to the kubelet and their spec.
    GetPods() []*v1.Pod
    // GetPodByFullName returns the (non-mirror) pod that matches full name, as well as
    // whether the pod was found.
    GetPodByFullName(podFullName string) (*v1.Pod, bool)
    // GetPodByName provides the (non-mirror) pod that matches namespace and
    // name, as well as whether the pod was found.
    GetPodByName(namespace, name string) (*v1.Pod, bool)
    // GetPodByUID provides the (non-mirror) pod that matches pod UID, as well as
    // whether the pod is found.
    GetPodByUID(types.UID) (*v1.Pod, bool)
    // GetPodByMirrorPod returns the static pod for the given mirror pod and
    // whether it was known to the pod manger.
    GetPodByMirrorPod(*v1.Pod) (*v1.Pod, bool)
    // GetMirrorPodByPod returns the mirror pod for the given static pod and
    // whether it was known to the pod manager.
    GetMirrorPodByPod(*v1.Pod) (*v1.Pod, bool)
    // GetPodsAndMirrorPods returns the both regular and mirror pods.
    GetPodsAndMirrorPods() ([]*v1.Pod, []*v1.Pod)
    // SetPods replaces the internal pods with the new pods.
    // It is currently only used for testing.
    SetPods(pods []*v1.Pod)
    // AddPod adds the given pod to the manager.
    AddPod(pod *v1.Pod)
    // UpdatePod updates the given pod in the manager.
    UpdatePod(pod *v1.Pod)
    // DeletePod deletes the given pod from the manager.  For mirror pods,
    // this means deleting the mappings related to mirror pods.  For non-
    // mirror pods, this means deleting from indexes for all non-mirror pods.
    DeletePod(pod *v1.Pod)
    // DeleteOrphanedMirrorPods deletes all mirror pods which do not have
    // associated static pods. This method sends deletion requests to the API
    // server, but does NOT modify the internal pod storage in basicManager.
    DeleteOrphanedMirrorPods()
    // TranslatePodUID returns the actual UID of a pod. If the UID belongs to
    // a mirror pod, returns the UID of its static pod. Otherwise, returns the
    // original UID.
    //
    // All public-facing functions should perform this translation for UIDs
    // because user may provide a mirror pod UID, which is not recognized by
    // internal Kubelet functions.
    TranslatePodUID(uid types.UID) kubetypes.ResolvedPodUID
    // GetUIDTranslations returns the mappings of static pod UIDs to mirror pod
    // UIDs and mirror pod UIDs to static pod UIDs.
    GetUIDTranslations() (podToMirror map[kubetypes.ResolvedPodUID]kubetypes.MirrorPodUID, mirrorToPod map[kubetypes.MirrorPodUID]kubetypes.ResolvedPodUID)
    // IsMirrorPodOf returns true if mirrorPod is a correct representation of
    // pod; false otherwise.
    IsMirrorPodOf(mirrorPod, pod *v1.Pod) bool

    MirrorClient
}

上一篇介绍过static pod，与之对应的概念是mirror pod

Pod status

参考k8s.io\api\core\v1\types.go

Pending: Pending状态是指api-server已经接受了Pod的创建请求，但是有容器还没有启动。Pending包括Pod未被调度以及拉取镜像阶段。PodPending means the pod has been accepted by the system, but one or more of the containers has not been started. This includes time before being bound to a node, as well as time spent pulling images onto the host.
Running: Running是指所有的容器都已经运行过了，其中至少一个容器正在运行或者正在重新运行。PodRunning means the pod has been bound to a node and all of the containers have been started. At least one container is still running or is in the process of being restarted.
Succeeded: Succeeded表示所有容器都成功运行结束了，并且K8S不会重新启动这些容器。PodSucceeded means that all containers in the pod have voluntarily terminated with a container exit code of 0, and the system is not going to restart any of these containers.
Failed: Failed表示所有容器都不运行了，并且至少一个容器异常退出。PodFailed means that all containers in the pod have terminated, and at least one container has terminated in a failure (exited with a non-zero exit code or was stopped by the system).
Unknown: Unknown表示因为默写原因无法获知Pod状态，常见于Node失联。PodUnknown means that for some reason the state of the pod could not be obtained, typically due to an error in communicating with the host of the pod.

PodWorkers

k8s.io\kubernetes\pkg\kubelet\pod_workers.go

// PodWorkers is an abstract interface for testability.
type PodWorkers interface {
    UpdatePod(options *UpdatePodOptions)
    ForgetNonExistingPodWorkers(desiredPods map[types.UID]empty)
    ForgetWorker(uid types.UID)
}

type podWorkers struct {
    // Protects all per worker fields.
    podLock sync.Mutex

    // Tracks all running per-pod goroutines - per-pod goroutine will be
    // processing updates received through its corresponding channel.
    podUpdates map[types.UID]chan UpdatePodOptions
    // Track the current state of per-pod goroutines.
    // Currently all update request for a given pod coming when another
    // update of this pod is being processed are ignored.
    isWorking map[types.UID]bool
    // Tracks the last undelivered work item for this pod - a work item is
    // undelivered if it comes in while the worker is working.
    lastUndeliveredWorkUpdate map[types.UID]UpdatePodOptions

    workQueue queue.WorkQueue

    // This function is run to sync the desired stated of pod.
    // NOTE: This function has to be thread-safe - it can be called for
    // different pods at the same time.
    syncPodFn syncPodFnType

    // The EventRecorder to use
    recorder record.EventRecorder

    // backOffPeriod is the duration to back off when there is a sync error.
    backOffPeriod time.Duration

    // resyncInterval is the duration to wait until the next sync.
    resyncInterval time.Duration

    // podCache stores kubecontainer.PodStatus for all pods.
    podCache kubecontainer.Cache
}

code

k8s.io\kubernetes\pkg\kubelet\kubelet.go

func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
    kl.syncLoop(updates, kl)
}

// syncLoop is the main loop for processing changes. It watches for changes from
// three channels (file, apiserver, and http) and creates a union of them. For
// any new change seen, will run a sync against desired state and running state. If
// no changes are seen to the configuration, will synchronize the last known desired
// state every sync-frequency seconds. Never returns.
func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
    // The resyncTicker wakes up kubelet to checks if there are any pod workers
    // that need to be sync'd. A one-second period is sufficient because the
    // sync interval is defaulted to 10s.
    syncTicker := time.NewTicker(time.Second)
    defer syncTicker.Stop()
    // 2s
    housekeepingTicker := time.NewTicker(housekeepingPeriod)
    defer housekeepingTicker.Stop()
    plegCh := kl.pleg.Watch()
    for {
        kl.syncLoopMonitor.Store(kl.clock.Now())
        if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
            break
        }
        kl.syncLoopMonitor.Store(kl.clock.Now())
    }
}

可以看到，updates贯穿整个过程，是一个非常重要的概念，所以上一篇整篇都在分析updates的由来。这里我们重点关注syncLoopIteration。

// syncLoopIteration reads from various channels and dispatches pods to the
// given handler.
//
// Arguments:
// 1.  configCh:       a channel to read config events from
// 2.  handler:        the SyncHandler to dispatch pods to
// 3.  syncCh:         a channel to read periodic sync events from
// 4.  houseKeepingCh: a channel to read housekeeping events from
// 5.  plegCh:         a channel to read PLEG updates from
//
// Events are also read from the kubelet liveness manager's update channel.
//
// The workflow is to read from one of the channels, handle that event, and
// update the timestamp in the sync loop monitor.
//
// With that in mind, in truly no particular order, the different channels
// are handled as follows:
//
// * configCh: dispatch the pods for the config change to the appropriate
//             handler callback for the event type
// * plegCh: update the runtime cache; sync pod
// * syncCh: sync all pods waiting for sync
// * houseKeepingCh: trigger cleanup of pods
// * liveness manager: sync pods that have failed or in which one or more
//                     containers have failed liveness checks
func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
    syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
    select {
    case u, open := <-configCh:
        // Update from a config source; dispatch it to the right handler
        // callback.
        if !open {
            glog.Errorf("Update channel is closed. Exiting the sync loop.")
            return false
        }

        switch u.Op {
        case kubetypes.ADD:
            // After restarting, kubelet will get all existing pods through
            // ADD as if they are new pods. These pods will then go through the
            // admission process and *may* be rejected. This can be resolved
            // once we have checkpointing.
            handler.HandlePodAdditions(u.Pods)
        case kubetypes.UPDATE:
            handler.HandlePodUpdates(u.Pods)
        case kubetypes.REMOVE:
            handler.HandlePodRemoves(u.Pods)
        case kubetypes.RECONCILE:
            handler.HandlePodReconcile(u.Pods)
        case kubetypes.DELETE:
            // DELETE is treated as a UPDATE because of graceful deletion.
            handler.HandlePodUpdates(u.Pods)
        case kubetypes.RESTORE:
            // These are pods restored from the checkpoint. Treat them as new
            // pods.
            handler.HandlePodAdditions(u.Pods)
        }
    }        
    return true
}

本文只关注syncLoopIteration函数的configCh分支，在可预见的未来，将分析其他分支。

这里只是简单地将Pods分给对应的handler处理。

func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
    for _, pod := range pods {
        // Always add the pod to the pod manager. Kubelet relies on the pod
        // manager as the source of truth for the desired state. If a pod does
        // not exist in the pod manager, it means that it has been deleted in
        // the apiserver and no action (other than cleanup) is required.
        kl.podManager.AddPod(pod)

        mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
        kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
    }
}

// dispatchWork starts the asynchronous sync of the pod in a pod worker.
// If the pod is terminated, dispatchWork
func (kl *Kubelet) dispatchWork(pod *v1.Pod, syncType kubetypes.SyncPodType, mirrorPod *v1.Pod, start time.Time) {
    // Run the sync in an async worker.
    kl.podWorkers.UpdatePod(&UpdatePodOptions{
        Pod:        pod,
        MirrorPod:  mirrorPod,
        UpdateType: syncType,
        OnCompleteFunc: func(err error) {},
    })
}

调用podWorkers.UpdatePod执行操作。

k8s.io\kubernetes\pkg\kubelet\pod_workers.go

// Apply the new setting to the specified pod.
func (p *podWorkers) UpdatePod(options *UpdatePodOptions) {
    pod := options.Pod
    uid := pod.UID

    if podUpdates, exists = p.podUpdates[uid]; !exists {
        podUpdates = make(chan UpdatePodOptions, 1)
        p.podUpdates[uid] = podUpdates

        go func() {
            defer runtime.HandleCrash()
            p.managePodLoop(podUpdates)
        }()
    }
    
    if !p.isWorking[pod.UID] {
        p.isWorking[pod.UID] = true
        podUpdates <- *options
    } 
}

给Pod创建一个goroutine，并创建一个channel管理它。

func (p *podWorkers) managePodLoop(podUpdates <-chan UpdatePodOptions) {
    for update := range podUpdates {
        err := func() error {  
            err = p.syncPodFn(syncPodOptions{
                mirrorPod:      update.MirrorPod,
                pod:            update.Pod,
                podStatus:      status,
                killPodOptions: update.KillPodOptions,
                updateType:     update.UpdateType,
            })
            return err
        }()
    }
}

这里其实就是回调了podWorkers.syncPodFn方法。

k8s.io\kubernetes\pkg\kubelet\kubelet.go

func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,
        klet.podWorkers = newPodWorkers(klet.syncPod, kubeDeps.Recorder, klet.workQueue, klet.resyncInterval, backOffPeriod, klet.podCache)   
}

500多行的函数，里面就有我们需要的答案klet.syncPod。这是一个很复杂的函数，包括了所有Pod状态的处理过程，我们慢慢拆解。

// syncPod is the transaction script for the sync of a single pod.
//
// Arguments:
//
// o - the SyncPodOptions for this invocation
//
// The workflow is:
// * If the pod is being created, record pod worker start latency
// * Call generateAPIPodStatus to prepare an v1.PodStatus for the pod
// * If the pod is being seen as running for the first time, record pod
//   start latency
// * Update the status of the pod in the status manager
// * Kill the pod if it should not be running
// * Create a mirror pod if the pod is a static pod, and does not
//   already have a mirror pod
// * Create the data directories for the pod if they do not exist
// * Wait for volumes to attach/mount
// * Fetch the pull secrets for the pod
// * Call the container runtime's SyncPod callback
// * Update the traffic shaping for the pod's ingress and egress limits
//
// If any step of this workflow errors, the error is returned, and is repeated
// on the next syncPod call.
func (kl *Kubelet) syncPod(o syncPodOptions) error {
    pod := o.pod
    mirrorPod := o.mirrorPod
    podStatus := o.podStatus
    updateType := o.updateType
    
    // if we want to kill a pod, do it now!
    if updateType == kubetypes.SyncPodKill {
        return kl.killPod(pod, nil, podStatus, killPodOptions.PodTerminationGracePeriodSecondsOverride)
    }
    
    apiPodStatus := kl.generateAPIPodStatus(pod, podStatus)
    
    // Create Cgroups for the pod and apply resource parameters
    // to them if cgroups-per-qos flag is enabled.
    pcm := kl.containerManager.NewPodContainerManager()
    // If pod has already been terminated then we need not create
    // or update the pod's cgroup
    if !kl.podIsTerminated(pod) {
        // Create and Update pod's Cgroups
        pcm.EnsureExists(pod)
    }
    
    // Create Mirror Pod for Static Pod if it doesn't already exist
    if kubepod.IsStaticPod(pod) {
        if mirrorPod == nil || deleted {
            kl.podManager.CreateMirrorPod(pod)
        }
    }
    
    // Make data directories for the pod
    kl.makePodDataDirs(pod)
    
    // Volume manager will not mount volumes for terminated pods
    if !kl.podIsTerminated(pod) {
        // Wait for volumes to attach/mount
        kl.volumeManager.WaitForAttachAndMount(pod)
    }
    
    // Fetch the pull secrets for the pod
    pullSecrets := kl.getPullSecretsForPod(pod)
    
    // Call the container runtime's SyncPod callback
    result := kl.containerRuntime.SyncPod(pod, apiPodStatus, podStatus, pullSecrets, kl.backOff)
}

创建Pod过程：

如果需要kill，直接kill
给Pod创建PodStatus对象
创建cgroups
如果是static pod，就创建mirror pod，方便通过apiserver查询 static pod，只能查询，其他操作都不可以
创建数据目录，比如挂载目录
挂载目录
获取ImagePullSecrets
调用CRI创建Pod

k8s.io\kubernetes\pkg\kubelet\kuberuntime\kuberuntime_manager.go

// SyncPod syncs the running pod into the desired pod by executing following steps:
//
//  1. Compute sandbox and container changes.
//  2. Kill pod sandbox if necessary.
//  3. Kill any containers that should not be running.
//  4. Create sandbox if necessary.
//  5. Create init containers.
//  6. Create normal containers.
func (m *kubeGenericRuntimeManager) SyncPod(pod *v1.Pod, _ v1.PodStatus, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, backOff *flowcontrol.Backoff) (result kubecontainer.PodSyncResult) {
    // Step 1: Compute sandbox and container changes.
    podContainerChanges := m.computePodActions(pod, podStatus)
    
    // Step 2: Kill the pod if the sandbox has changed.
    if podContainerChanges.KillPod {
        m.killPodWithSyncResult(pod, kubecontainer.ConvertPodStatusToRunningPod(m.runtimeName, podStatus), nil)
    }else {
        // Step 3: kill any running containers in this pod which are not to keep.
        for containerID, containerInfo := range podContainerChanges.ContainersToKill {
            m.killContainer(pod, containerID, containerInfo.name, containerInfo.message, nil)
        }
    }
    
    // Step 4: Create a sandbox for the pod if necessary.
    podSandboxID, msg, err = m.createPodSandbox(pod, podContainerChanges.Attempt)
    
    // Step 5: start the init container.
    if container := podContainerChanges.NextInitContainerToStart; container != nil {
        m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, kubecontainer.ContainerTypeInit)
    }
    
    // Step 6: start containers in podContainerChanges.ContainersToStart.
    for _, idx := range podContainerChanges.ContainersToStart {
        container := &pod.Spec.Containers[idx]
        m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, kubecontainer.ContainerTypeRegular)
    }
}

这个函数的注释写得很赞，过程如下：

比较网络和容器变化
如果网络有变化，就把之前的容器删掉
创建容器网络
启动init容器
启动容器

启动init容器也是有说法的，全部逻辑都在computePodActions里面，必须先按顺序将init容器全部启动之后，再启动容器，大概过程如下：

第一次只启动pod.Spec.InitContainers[0]，changes.ContainersToStart为空
之后每次启动下一个init容器
init容器启动完，启动容器

k8s.io\kubernetes\pkg\kubelet\kuberuntime\kuberuntime_container.go

// startContainer starts a container and returns a message indicates why it is failed on error.
// It starts the container through the following steps:
// * pull the image
// * create the container
// * start the container
// * run the post start lifecycle hooks (if applicable)
func (m *kubeGenericRuntimeManager) startContainer(podSandboxID string, podSandboxConfig *runtimeapi.PodSandboxConfig, container *v1.Container, pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, podIP string, containerType kubecontainer.ContainerType) (string, error) {
    // Step 1: pull the image.
    imageRef, msg, err := m.imagePuller.EnsureImageExists(pod, container, pullSecrets)   
    
    // Step 2: create the container.
    m.runtimeService.CreateContainer(podSandboxID, containerConfig, podSandboxConfig)
    
    // Step 3: start the container.
    m.runtimeService.StartContainer(containerID)
    
    legacySymlink := legacyLogSymlink(containerID, containerMeta.Name, sandboxMeta.Name,sandboxMeta.Namespace)
    m.osInterface.Symlink(containerLog, legacySymlink)
}

经过一步步抽丝剥茧，这里终于真相了，

拉镜像
create 容器，runtimeService其实就是通过grpc调用CRI
start 容器
给容器日志创建soft link，增加K8S相关信息，这里日志采集的时候就很有用，详见K8S Fluentd Mongo日志采集

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 205,236评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 87,867评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,715评论 0赞 340
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,899评论 1赞 278
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,895评论 5赞 368
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,733评论 1赞 283
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,085评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,722评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 43,025评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,696评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,816评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,447评论 4赞 322
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,057评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,009评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,254评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,204评论 2赞 352
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,561评论 2赞 343

深入分析kubelet（2）——创建Pod

深入分析kubelet（2）——创建Pod

background

PodManager

Pod status

PodWorkers

code

推荐阅读更多精彩内容