Sunday, 22 March 2026

OCFS2 setup for shareable ( Read/write) Block Volume for Multiple Cluster Compute instances

This blog explains how to create shareable block volumes and mount them across multiple OCI compute nodes. 

Step 1: Create the block volume in OCI
In the OCI console , create a block volume in the same availability domain as your compute instances. Pick size and performance to match your workload.

Step 2: Attach the volume to each cluster node

  1. Open the volume (or the instance) and choose Attach block volume
  2. Set Attachment type to iSCSI (not paravirtualized for this flow).
  3. Set Attachment access to Read/write – shareable.

Attach the same volume to all nodes in cluster, with the same settings each time.

Step 3: Run the iSCSI commands on each node
After each attachment, OCI shows iSCSI IPv4 commands & information for that attachment.
Open it and copy the full set of `iscsiadm` commands  (discover, login, and any optional rescan steps OCI lists).

1. SSH to the node
2. Paste and run those commands as root or with `sudo`, exactly as OCI documents for your image (Oracle Linux / RHEL-style hosts usually use the `iscsiadm` sequence from the console).
Repeat on every node so each host has an active iSCSI session to the same volume.

Check: On each node run `lsblk` (or `fdisk -l`). You should see a new disk (often `/dev/sdb` or similar).

OCFS2 needs a small cluster layout file. The file must list all nodes in the cluster. `node_count` must match how many `node:` blocks you define.
Step 4: Create the config directory
On each node:
sudo mkdir -p /etc/ocfs2

Step 5: Edit `cluster.conf`
sudo vi /etc/ocfs2/cluster.conf
cluster:
    node_count = 2
    name = ocfs2

node:
    number = 0
    cluster = ocfs2
    ip_port = 7777
    ip_address = 10.0.0.94
    name = jay-db-node01

node:
    number = 1
    cluster = ocfs2
    ip_port = 7777
    ip_address = 10.0.0.95
    name = jay-db-node02


Use one cluster name (here `ocfs2`)
Private IPs your nodes use to talk to each other (often the VCN private address).
`ip_port` is commonly 7777 for OCFS2.
`number` must be unique per node (0, 1, 2, …).

copy same `cluster.conf` on every node—the full list of all nodes and their IPs must match on each machine.

Register and configure O2CB
Step 6: Register the cluster
sudo o2cb register-cluster ocfs2

That tells the system which cluster this node belongs to.

Step 7: Configure the driver (one time per node)
[root@jay-db-node01 ~]# sudo /sbin/o2cb.init configure
Configuring the O2CB driver.

This will configure the on-boot properties of the O2CB driver.
The following questions will determine whether the driver is loaded on
boot.  The current values will be shown in brackets ('[]').  Hitting
<ENTER> without typing an answer will keep that current value.  Ctrl-C
will abort.

Load O2CB driver on boot (y/n) [n]: y
Cluster stack backing O2CB [o2cb]:
Cluster to start on boot (Enter "none" to clear) [ocfs2]: ocfs2
Specify heartbeat dead threshold (>=7) [31]: 31
Specify network idle timeout in ms (>=5000) [30000]: 5000
Specify network keepalive delay in ms (>=1000) [2000]:
Specify network reconnect delay in ms (>=2000) [2000]:
Writing O2CB configuration: OK
checking debugfs...
Loading stack plugin "o2cb": OK
Loading filesystem "ocfs2_dlmfs": OK
Creating directory '/dlm': OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Setting cluster stack "o2cb": OK
Registering O2CB cluster "ocfs2": OK
Setting O2CB cluster timeouts : OK


Step 8: Start O2CB and check status
[root@jay-db-node01 ~]# sudo o2cb register-cluster ocfs2
[root@jay-db-node01 ~]# sudo systemctl start o2cb
[root@jay-db-node01 ~]# sudo o2cb cluster-status ocfs2
Cluster 'ocfs2' is online

Mount point and format the volume
Step 9: Create the mount directory ( All Nodes)
sudo mkdir /Oradb_data

Step 10: Format the shared disk with OCFS2 (Run it one time on one node only)
[root@jay-db-node01 ~]# sudo mkfs.ocfs2 -L Oradb_data /dev/sdb -N 8
mkfs.ocfs2 1.8.6
Cluster stack: classic o2cb
Label: Oradb_data
Features: sparse extended-slotmap backup-super unwritten inline-data strict-journal-super xattr indexed-dirs refcount discontig-bg
Block size: 4096 (12 bits)
Cluster size: 4096 (12 bits)
Volume size: 2199023255552 (536870912 clusters) (536870912 blocks)
Cluster groups: 16645 (tail covers 2048 clusters, rest cover 32256 clusters)
Extent allocator size: 276824064 (66 groups)
Journal size: 268435456
Node slots: 8
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 6 block(s)
Formatting Journals: done
Growing extent allocator: done
Formatting slot map: done
Formatting quota files: done
Writing lost+found: done
mkfs.ocfs2 successful

When you see mkfs.ocfs2 successful, the volume is ready. Do not run `mkfs` again on the other nodes.

fstab and mount on every node
Step 11: Add fstab on all nodes
sudo vi /etc/fstab
/dev/sdb /Oradb_data ocfs2     _netdev,defaults   0 0

If the shared disk shows up as a different device name on another node, use a stable name (UUID or `/dev/disk/by-id/...`) so every node points at the same LUN.

Step 12: Mount on all nodes
sudo mount -a
Check with `df -h /Oradb_data` or `mount | grep Oradb_data`

[root@jay-db-node01 ~]# df -h /Oradb_data
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        2.0T  4.2G  2.0T   1% /Oradb_data

Tuesday, 20 January 2026

What is the Kube-Scheduler?

kube-scheduler is a watchman. Its primary job is to monitor the API Server for newly created pods that have no nodeName assigned (the "Pending" state). Once it finds one, it evaluates every node in your cluster to find the best possible home based on resources, policies, and hardware constraints.

3-Step Core Workflow
1.Scheduling Queue
Whenever a pod is created, it enters a Pending state and is added to the Scheduling Queue. This isn't a simple FIFO (First-In-First-Out) line; it’s a Priority Queue where pods are sorted based on their PriorityClass. High-priority pods, such as system-critical components, jump to the front of the line to be processed first, while lower-priority pods wait their turn. The scheduler then pulls these pods from the queue one by one to begin the placement process.

2.Filtering:
In this phase, the scheduler runs a series of "Predicates." If a node fails even one of these checks, it is disqualified.
Resource Check (PodFitsResources): Does the node have enough free CPU and Memory to meet the Pod’s requests?
Port Check (PodFitsHostPorts): If a pod requires a specific port on the host (HostPort), is that port already taken by another pod on this node?
Taint/Toleration Check: Nodes can have Taints (repellants). Unless the pod has a matching Toleration, it cannot be scheduled there.
Node Selection: Does the node match the nodeSelector or nodeAffinity labels defined in the Pod spec?

3.Scoring:
After filtering, we might have five nodes that could run the pod. The Scoring phase determines which one should run it. Each node is given a score (usually 0–100) based on several factors:
Least Requested: Favors nodes with more free resources to balance the cluster.
Image Locality: Favors nodes that already have the container image downloaded (speeding up start times).
Affinity/Anti-Affinity: Soft preferences, like "I'd prefer not to be on the same node as other pods from this app for high availability."

The node with the highest score is selected as the "Winner."

Binding: Updating the Cluster State
Once the winner is selected, the scheduler doesn't actually "start" the pod. Instead, it completes a Binding request:
Request to API Server: The scheduler sends a "Binding" object to the kube-apiserver.
API Server Updates etcd: The API Server receives this request, validates it, and updates the Pod's definition in etcd (the cluster's database), setting the nodeName field to the winner's name.
Kubelet Takes Over: The Kubelet (the agent on the worker node) is also watching the API Server. It sees that a pod has been assigned to its node, pulls the image, and starts the container.
    
    

 

Saturday, 17 January 2026

Kubernetes Authorization Modes

In Kubernetes, security is a multi-layered journey. Once a user or service proves their identity—a process known as Authentication—they face a second, more granular challenge: Authorization.

If Authentication asks, "Who are you?", Authorization asks, "What exactly are you allowed to do here?"
In this post, we’ll break down the mechanisms Kubernetes uses to control access and ensure your cluster remains a "Zero Trust" environment.  

1. Node Authorization: 
Node Authorization is a specialized, fixed-purpose authorizer designed specifically for Kubelets. It implements a graph-based check to ensure that a worker node only has access to the resources it absolutely needs to function.

Target: Requests coming from nodes (identified by the system:nodes group and system:node:<nodeName> username).
Technical Logic: It limits a Kubelet's ability to read Secrets, ConfigMaps, and PersistentVolumes. A Kubelet can only access these objects if they are associated with a Pod currently scheduled on that specific node.
Security Impact: This prevents a compromised node from "lateral movement"—it cannot reach out and steal secrets belonging to workloads on other nodes.

2. RBAC: 
Role-Based Access Control (RBAC) is the most common and recommended authorization mechanism. It allows for dynamic, API-driven permission management without requiring an API server restart.

Objects: * Roles/ClusterRoles: Pure sets of permissions (Verbs + Resources + API Groups).
RoleBindings/ClusterRoleBindings: Mapping objects that attach a Subject (User/Group/ServiceAccount) to a Role.
Technical Nuance: RBAC is additive-only. There are no "Deny" rules in RBAC; if no rule grants access, the request is denied by default. It also supports Aggregation, allowing you to combine multiple ClusterRoles into a single "super-role" dynamically.

3. ABAC: Policy-Driven
Attribute-Based Access Control (ABAC) grants access based on a combination of attributes (user, resource, and environment).

Implementation: Unlike RBAC, ABAC policies are defined in a local JSON file on the master node.

Technical Logic: Each line in the policy file is a "Policy Object."For example:
{"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"user": "alice", "namespace": "dev", "resource": "pods", "readonly": true}}

Downside: ABAC is difficult to manage at scale because any change requires a manual update to the file and a restart of the Kube-API server, making it less agile than RBAC.

4. Webhook Authorization: 
Webhook authorization allows Kubernetes to delegate the "Yes/No" decision to a remote HTTP service. This is the ultimate tool for integrating Kubernetes with enterprise-wide security policies.

Flow: When a request arrives, the API server sends a SubjectAccessReview (a JSON-serialized object) to an external REST endpoint.
 Technical Payload: The payload includes the username, groups, and the specific resource/verb requested. The remote service responds with an allowed: true or false status. 

Use Cases: * Integrating with Open Policy Agent (OPA) for complex logic.

5. AlwaysAllow:
As the name suggests, the AlwaysAllow mode grants every request, regardless of who is asking or what they are trying to do. It completely bypasses all security checks.

Technical Logic: It returns allowed: true for every single API call.
Use Cases: * Local Development: Used in very restricted, single-node local environments (like early-stage minikube setups) where security isn't a concern.
Unit Testing: Used by developers testing API server extensions where they want to isolate the logic from authorization interference.
Risk: Enabling this in a production cluster is a critical security failure. It effectively turns off the cluster's "immune system," allowing any unauthenticated "system:anonymous" user to delete the entire cluster.

6. AlwaysDeny:
The AlwaysDeny mode does exactly the opposite: it rejects every single request.

Technical Logic: It returns allowed: false for everything.
Use Cases: * Security Hardening: It is often used at the very end of a list of authorization modes. If the request doesn't match a Node rule, an RBAC rule, or a Webhook rule, it hits the "final wall" and is rejected.
Emergency Lockdown: In extreme scenarios, an administrator could theoretically set this to prevent any further changes to the cluster state during an active breach investigation.
Nuance: Even with AlwaysDeny, the API server may still allow certain "discovery" endpoints (like /healthz) depending on the version and configuration, but for all intents and purposes, the cluster becomes a "read-only/no-access" vault.
    
        

Saturday, 10 January 2026

Kubernetes API: Understanding apiVersion

apiVersion field is the first line of every manifest. While it may seem like a static piece of boilerplate, it is actually the most important instruction you give to the API server. It defines the schema, the validation rules, and the stability of the resource are about to create.

Kubernetes organizes its thousands of parameters into API Groups. The apiVersion string tells the cluster which "folder" and "version" of the API to look in.

There are two distinct patterns for these values:
1. The Core Group
These are the foundational objects of Kubernetes. Because they have existed since the beginning, they do not belong to a named group.
    Format: v1
    Resources: Pod, Service, Namespace, Node, ConfigMap, Secret.
    Example: 
    YAML
    apiVersion: v1
    kind: Service

2. Named Groups
As Kubernetes evolved, new functionality was added via specialized groups. These follow a "Group/Version" structure.
    Format: group.k8s.io/version
    Resources: Deployments, Ingress, CronJobs.
    Example:
    YAML
    apiVersion: apps/v1
    kind: Deployment

The Stability Lifecycle
Version     Stability           Description
v1alpha1    Experimental        May contain bugs. Can be dropped in future releases without warning.
v1beta1     Prerelease          Feature-complete and tested. Safe for non-critical environments.
v1          Stable              Production-ready.   

 
Many resources have migrated from "Beta" to "Stable" over the last few years. Here is the current standard mapping for common resources:
    Workloads: apps/v1 (Deployment, StatefulSet, DaemonSet)
    Batch: batch/v1 (Job, CronJob)
    Networking: networking.k8s.io/v1 (Ingress, NetworkPolicy)
    RBAC: rbac.authorization.k8s.io/v1 (Role, ClusterRole)

# See all resources and their associated API versions
kubectl api-resources

# List all enabled API versions on the server
kubectl api-versions