Files
infrastructure/K3S_PLANNING.md

80 lines
3.5 KiB
Markdown

# K3s Cluster Setup Plan (Enterprise Ready)
## 1. Architecture Overview
We will deploy a High-Availability (HA) K3s cluster consisting of 3 Control Plane nodes (embedded etcd). This setup is resilient against the failure of a single node.
* **Topology:** 3 Nodes (Server + Agent mixed).
* **Operating System:** Ubuntu 24.04 (via Terraform/Cloud-Init).
* **Networking:**
* VLAN 40 (IP Range: `10.100.40.0/24`).
* **VIP (Virtual IP):** A floating IP managed by `kube-vip` for the API Server and Ingress Controller.
* **Ingress Flow:**
* `Internet` -> `Traefik im k3s Cluster (VIP 10.100.40.6)` -> `Traefik Ingress (K3s)` -> `Pod`.
* **GitOps:**
* **Tool:** FluxCD.
* **Repository Structure:**
* `stabify-infra` (Current): Bootstraps the nodes, installs K3s, installs Flux Binary.
* `stabify-gitops` (New): Watched by Flux. Contains system workloads (Cert-Manager, Traefik Internal) and User Apps.
## 2. Terraform Changes (`terraform/`)
We will update the existing `locals.tf` to reflect the 3-node HA structure.
* **`terraform/locals.tf`**:
* Refactor `vms` map:
* `vm-k3s-master-400` (`10.100.40.10`)
* `vm-k3s-master-401` (`10.100.40.11`)
* `vm-k3s-master-402` (`10.100.40.12`)
* Define VIPs:
* `k3s-api-vip`: `10.100.40.1` (or `.5`) - Endpoint for kubectl and Nodes.
* `k3s-ingress-vip`: `10.100.40.2` (or `.6`) - Endpoint for Traefik Edge.
* **`terraform/main.tf`**:
* Add `opnsense_unbound_host_override` resources for the VIPs to ensure internal DNS resolution.
## 3. Ansible Role Design (`infrastructure/ansible/`)
We will create a new role `k3s` and a corresponding playbook.
* **Inventory (`inventory.ini`)**:
* Add `[k3s_masters]` group.
* **Role: `k3s`**:
* **Task: System Prep:** Install `open-iscsi`, `nfs-common`, `curl`. Configure sysctl (bridged traffic).
* **Task: Install K3s (First Node):**
* Exec: `curl -sfL https://get.k3s.io | sh -`
* Args: `--cluster-init --disable traefik --disable servicelb --tls-san k3s-api.stabify.de`
* **Task: Install K3s (Other Nodes):**
* Args: `--server https://<First-Node-IP>:6443 --token <Secret>`
* **Task: Install Kube-VIP:**
* Deploy Manifest for Control Plane HA (ARP Mode).
* Deploy Manifest for Service LoadBalancer (ARP Mode).
* **Task: Bootstrap Flux:**
* Install Flux CLI.
* Run `flux bootstrap git ...`.
## 4. Network & DNS Strategy
* **DNS Records (OPNsense):**
* `vm-k3s-master-*.stabify.de` -> Node IPs (Managed by Terraform).
* `k3s-api.stabify.de` -> `10.100.40.5` (VIP).
* `*.k3s.stabify.de` -> `10.100.40.6` (Ingress VIP).
* **Traefik Edge Config (im k3s Cluster):**
* File Provider für TLS Passthrough zu k3s Services.
* ConfigMap: `traefik-edge-dynamic-k3s`
* Rule: `HostSNIRegexp('^.+\.k3s\.stabify\.de$')`
* Target: `10.100.40.6:443` (TLS Passthrough).
## 5. Next Steps for Implementation
1. **Refactor Terraform:** Update `locals.tf` to 3 Masters. Apply to create VMs.
2. **DNS Update:** Verify OPNsense records.
3. **Ansible Development:** Create `k3s` role.
4. **Execute Ansible:** Deploy Cluster.
5. **Flux Bootstrap:** Link cluster to GitOps repo.
6. **Traefik Edge:** Configure routing.
This plan ensures a clean separation of concerns: Terraform builds the hardware, Ansible installs the OS/Cluster software, and Flux manages the workloads.