Qingular

Scheduling Constraints

·CKAk8sPractice

nodeSelector, Node/Pod Affinity, Taints & Tolerations, PriorityClass

← Back to CKA Practice Index

Overview

Scheduling constraints control which nodes Pods are assigned to. The CKA exam focuses on practical configuration of Taints & Tolerations, Node Affinity, and nodeSelector.


1. nodeSelector

The simplest node selection method, based on node labels.

1.1 Labeling Nodes

kubectl get nodes --show-labels

# Add label
kubectl label nodes <node-name> disktype=ssd
kubectl label nodes <node-name> gpu=true

# Remove label
kubectl label nodes <node-name> disktype-

# Modify label (--overwrite)
kubectl label nodes <node-name> disktype=hdd --overwrite

1.2 Using nodeSelector

apiVersion: v1
kind: Pod
metadata:
  name: ssd-pod
spec:
  nodeSelector:
    disktype: ssd
  containers:
  - name: nginx
    image: nginx

2. Node Affinity

More flexible node selection than nodeSelector, supports match expressions.

2.1 Two Types

TypeDescription
requiredDuringSchedulingIgnoredDuringExecutionHard constraint: must be met for scheduling (similar to nodeSelector but supports expressions)
preferredDuringSchedulingIgnoredDuringExecutionSoft constraint: best-effort, schedules even if not met

2.2 Configuration Example

apiVersion: v1
kind: Pod
metadata:
  name: node-affinity-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
            - nvme
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - us-east-1
      - weight: 20
        preference:
          matchExpressions:
          - key: gpu
            operator: Exists
  containers:
  - name: nginx
    image: nginx

2.3 Match Operators

OperatorDescriptionExample
InMatches any value in valuesdisktype In [ssd, nvme]
NotInDoes not match any value in valuesdisktype NotIn [hdd]
ExistsKey exists (values ignored)gpu Exists
DoesNotExistKey does not existgpu DoesNotExist
GtValue greater than (numeric comparison)memory Gt [32]
LtValue less than (numeric comparison)memory Lt [64]

2.4 Imperative Creation of Node Affinity

# Use kubectl run then edit YAML to add the affinity section
kubectl run nginx --image=nginx --restart=Never --dry-run=client -o yaml > pod.yaml
# Edit to add spec.affinity.nodeAffinity

3. Pod Affinity / Anti-Affinity

Controls the scheduling relationship between Pods (same topology domain / different topology domain).

3.1 Configuration Example

apiVersion: v1
kind: Pod
metadata:
  name: pod-affinity-pod
  labels:
    app: frontend
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - backend
        topologyKey: "kubernetes.io/hostname"   # On the same node
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - frontend
          topologyKey: "kubernetes.io/hostname"  # Not on the same node
  containers:
  - name: nginx
    image: nginx

3.2 Topology Keys (topologyKey)

KeyDescription
kubernetes.io/hostnameNode level
topology.kubernetes.io/zoneAvailability zone
topology.kubernetes.io/regionRegion
failure-domain.beta.kubernetes.io/zoneLegacy availability zone

3.3 Common Use Cases

  • Pod Affinity: Schedule Web and Cache Pods on the same node (reduce network latency)
  • Pod Anti-Affinity: Spread Pods of the same application across different nodes (high availability)

3.4 Notes

  • Pod Affinity/Anti-Affinity increases scheduler computation overhead
  • requiredDuringScheduling may cause Pods to be unschedulable
  • topologyKey cannot be empty

4. Taints & Tolerations

Taints mark nodes to reject Pod scheduling, Tolerations allow Pods to bypass Taints.

4.1 Taint Operations

# View node Taints
kubectl describe node <node-name> | grep Taints

# Add Taint (kubectl taint nodes <node> <key>=<value>:<effect>)
kubectl taint nodes node1 app=blue:NoSchedule
kubectl taint nodes node1 key1=value1:NoExecute
kubectl taint nodes node1 key1=value1:PreferNoSchedule

# Remove Taint
kubectl taint nodes node1 app=blue:NoSchedule-
kubectl taint nodes node1 key1:NoExecute-    # No need to specify value

# Remove all Taints
kubectl taint nodes node1 app-               # Remove all taints related to 'app'

4.2 Taint Effect Types

EffectDescription
NoScheduleNot tolerated means not scheduled (does not evict existing Pods)
PreferNoScheduleSoft constraint: best-effort to not schedule
NoExecuteNot tolerated means evict existing Pods + reject new Pods

4.3 Toleration Configuration

apiVersion: v1
kind: Pod
metadata:
  name: toleration-pod
spec:
  tolerations:
  - key: "app"
    operator: "Equal"
    value: "blue"
    effect: "NoSchedule"
  - key: "key1"
    operator: "Exists"          # Matches all taints containing key1
    effect: "NoExecute"
    tolerationSeconds: 60       # Tolerate for 60 seconds before eviction
  - operator: "Exists"          # Tolerate all taints (use with caution)
  containers:
  - name: nginx
    image: nginx

4.4 Toleration Operators

OperatorDescriptionExample
EqualFull match on key + value + effectkey=app, value=blue, effect=NoSchedule
ExistsTolerates as long as key matchesNo value needed
Only operator (no key)Tolerates all taintsSuitable for DaemonSet

4.5 Common Use Cases

# 1. Dedicated node (only run specific Pods)
kubectl taint nodes gpu-node dedicated=gpu:NoSchedule
# Only allow Pods with toleration to be scheduled

# 2. Control plane node default Taint
kubectl describe node controlplane | grep Taints
# Output: node-role.kubernetes.io/control-plane:NoSchedule

# 3. Node failure handling
kubectl taint nodes node1 node.kubernetes.io/unreachable:NoExecute

4.6 Exam Tips

# Set node as non-schedulable (NoSchedule)
kubectl taint nodes worker1 env=production:NoSchedule

# Create a Pod that tolerates this taint
kubectl run toleration-pod --image=nginx --dry-run=client -o yaml > pod.yaml
# Edit to add spec.tolerations

# Verify the Pod is scheduled to that node
kubectl get pods -o wide | grep toleration-pod

# Running regular Pods on master nodes
# Method 1: Remove the master's taint
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-

# Method 2: Add toleration (recommended, doesn't affect the control plane)

5. PriorityClass

PriorityClass sets Pod priority; higher priority Pods can preempt lower priority Pods.

5.1 Create PriorityClass

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000              # Higher value means higher priority
globalDefault: false        # Whether this is the default PriorityClass
description: "High priority Pods"
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
description: "Low priority Pods"
# Create PriorityClass
kubectl apply -f priorityclass.yaml

# View
kubectl get priorityclass
kubectl get pc

5.2 Using in Pods

apiVersion: v1
kind: Pod
metadata:
  name: high-priority-pod
spec:
  priorityClassName: high-priority
  containers:
  - name: nginx
    image: nginx
    resources:
      requests:
        cpu: 500m
        memory: 512Mi

5.3 Common PriorityClass Values

NameValueDescription
system-cluster-critical2000000000Cluster-critical components
system-node-critical2000001000Node-critical components
Custom (high)1000000High priority applications
Custom (low)100Low priority batch tasks

6. Pod Scheduling Flow

  1. Node Filtering (Filtering / Predicates):

    • Check if node resources satisfy Pod requests
    • Check nodeSelector, Node Affinity
    • Check Taints & Tolerations
    • Check Pod Affinity/Anti-Affinity
  2. Node Scoring (Scoring / Priorities):

    • Resource utilization (used resources / total resources)
    • Node Affinity weight
    • Pod dispersion (Anti-Affinity)
  3. Binding: The scheduler binds the Pod to the selected node

# View scheduler logs
kubectl logs -n kube-system kube-scheduler-controlplane

# View scheduling events
kubectl get events --sort-by='.lastTimestamp' | grep -i schedule

# View the node a Pod is scheduled to
kubectl get pods -o wide
kubectl get pod <pod-name> -o wide

7. Useful Exam Commands

# 1. View all node labels
kubectl get nodes --show-labels

# 2. View node Taints
kubectl describe nodes | grep -A 5 Taints

# 3. Schedule a Pod to the master node
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-

# 4. Create a Pod with Node Affinity (using generate name)
kubectl run nginx --image=nginx --restart=Never --dry-run=client -o yaml > pod.yaml
# Edit to add spec.affinity.nodeAffinity

# 5. Add Toleration to DaemonSet (common)
kubectl get ds fluentd -o yaml > ds.yaml
# Add tolerations to spec.template.spec in ds.yaml

# 6. Check if a Pod cannot be scheduled due to Taint
kubectl describe pod <pod-name> | grep -A 10 Events
# Output: 0/3 nodes are available: 1 node(s) had taint, 2 node(s) didn't match pod anti-affinity

# 7. Quickly create a PriorityClass
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: important
value: 100000
globalDefault: false
EOF

🧪 Complete Hands-on Example: Controlling Pod Scheduling with Taints/Tolerations and NodeAffinity

Scenario

Apply taints to nodes to control specific Pod scheduling, while configuring nodeAffinity for more fine-grained scheduling constraints.

Prerequisites

  • A multi-node Kubernetes cluster (at least 2 worker nodes)
  • kubectl is configured to connect to the cluster
  • The cluster has node1 and node2 (if node names differ, replace them in the commands)

Steps

Step 1: Label Nodes and Apply Taints

# View nodes
kubectl get nodes
# Expected output: NAME       STATUS   ROLES           AGE   VERSION
#          controlplane   Ready    control-plane   10m   v1.29
#          node1          Ready    <none>          9m    v1.29
#          node2          Ready    <none>          9m    v1.29

# Label node1
kubectl label nodes node1 disktype=ssd
# Expected output: node/node1 labeled

# Apply a taint to node1 (only allow Pods tolerating this taint to be scheduled)
kubectl taint nodes node1 dedicated=gpu:NoSchedule
# Expected output: node/node1 tainted

Step 2: Create a Deployment with Toleration (Can be Scheduled on node1)

cat <<'EOF' > deploy-tolerate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-workload
spec:
  replicas: 2
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
    spec:
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      containers:
      - name: nginx
        image: nginx
EOF

kubectl apply -f deploy-tolerate.yaml
# Expected output: deployment.apps/gpu-workload created

kubectl get pods -l app=gpu-app -o wide
# Expected output: Pods are scheduled to node1 (with toleration)
# NAME                            READY   STATUS    RESTARTS   AGE   NODE
# gpu-workload-<hash>-<pod-id>    1/1     Running   0          <s>   node1
# gpu-workload-<hash>-<pod-id>    1/1     Running   0          <s>   node1

Step 3: Create a Deployment without Toleration (Cannot be Scheduled)

cat <<'EOF' > deploy-no-tolerate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: normal-workload
spec:
  replicas: 1
  selector:
    matchLabels:
      app: normal
  template:
    metadata:
      labels:
        app: normal
    spec:
      containers:
      - name: nginx
        image: nginx
EOF

kubectl apply -f deploy-no-tolerate.yaml
# Expected output: deployment.apps/normal-workload created

kubectl get pods -l app=normal
# Expected output: Pod will show Pending status
# NAME                              READY   STATUS    RESTARTS   AGE
# normal-workload-<hash>-<pod-id>   0/1     Pending   0          <seconds>

Step 4: Analyze the Reason for Scheduling Failure

kubectl describe pod -l app=normal | grep -A 10 Events
# Expected output:
# Events:
#   Type     Reason            Age   From               Message
#   ----     ------            ----  ----               -------
#   Warning  FailedScheduling  10s   default-scheduler  0/2 nodes are available: 1 node(s) had taint {dedicated: gpu} that the pod didn't tolerate, 1 node(s) had taint {dedicated: gpu} that the pod didn't tolerate.

Step 5: Add Node Affinity (Soft Constraint)

# First delete pending Pods and update the Deployment
kubectl delete deployment normal-workload
# Expected output: deployment.apps "normal-workload" deleted

cat <<'EOF' > deploy-affinity.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: normal-workload
spec:
  replicas: 1
  selector:
    matchLabels:
      app: normal
  template:
    metadata:
      labels:
        app: normal
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80
            preference:
              matchExpressions:
              - key: disktype
                operator: In
                values:
                - ssd
      containers:
      - name: nginx
        image: nginx
EOF

kubectl apply -f deploy-affinity.yaml
# Expected output: deployment.apps/normal-workload created

kubectl get pods -l app=normal -o wide
# Expected output: Pod is scheduled to node2 (no disktype=ssd label, but soft constraint is not mandatory)
# NAME                              READY   STATUS    RESTARTS   AGE   NODE
# normal-workload-<hash>-<pod-id>   1/1     Running   0          <s>   node2

Verification

# Confirm gpu-workload runs on node1
kubectl get pods -l app=gpu-app -o wide | grep node1
# Expected output: gpu-workload Pod shows node1

# Confirm normal-workload runs on node2
kubectl get pods -l app=normal -o wide | grep node2
# Expected output: normal-workload Pod shows node2

# Cleanup
kubectl delete deployment gpu-workload normal-workload
kubectl taint nodes node1 dedicated=gpu:NoSchedule-
kubectl label nodes node1 disktype-
# Expected output: All resources cleaned up

Exam Tips

  • NoSchedule only affects new Pods, not existing ones; NoExecute also evicts existing Pods
  • Toleration operator: Exists with a key matches all values; operator: Exists without a key matches all taints
  • Node Affinity's requiredDuringScheduling is a hard constraint; Pods cannot be scheduled if not met
  • In the exam, if a Pod is in Pending state, first use kubectl describe pod to check Events to determine the cause
  • Master nodes have the node-role.kubernetes.io/control-plane:NoSchedule taint by default; removing it allows Pods to be scheduled on the master
  • When creating a DaemonSet, you usually need to add tolerance for all taints: - operator: Exists