Talos Kubernetes

This repository includes all of the configuration and documentation for my home lab cluster.

This cluster consists of 4 Intel Nucs running Talos Linux with a Synology NAS for large data. More information about this repository and cluster is available at https://infrastructure.btkostner.io.

Features

Talos Linux cluster with NVMe as a boot drive and SSD for data
Argo CD autopilot for cluster bootstrapping
Cilium as a kube proxy replacement and sidecar-less networking
Rook Ceph for stateful replicated storage for all nodes
Velero for offsite cluster backup

Provisioning

This page goes over how to setup a whole new cluster. Starting with basic networking setup, Talos configuration generating and applying, and Argo CD bootstrapping.

My home cluster is made up of 4x Intel 11th gen i5 NUCs with 32 GB of ram, an NVMe boot drive, and a 1TB SSD for Ceph. All of the nodes are nearly identical to make things easier, though this isn't a hard requirement. The main reason I chose this configuration is for it's low power usage, small space, and Intel Xe graphics.

I also have a PiKVM setup connected to a TESmart 4K 16 Ports HDMI KVM Switch. This allows easy bios control and booting of images. This makes it infinitely easier to install Talos on a cluster.

Node Setup

Node setup is mostly ensuring all of the hardware is working and has up to date firmware. This is especially important for my Crucial MX500 SSDs as older versions can randomly disconnect and cause Ceph to go into an unhealthy state.

Once all of the hardware is confirmed working, I set a static IP for the node. This starts at 192.168.3.11 and continue for each node. Talos will set a virtual IP at 192.168.3.10 for the control plane nodes.

One important thing to note is I use cluster.btkostner.network to point to my cluster virtual IP of 192.168.3.10. Yes this is a local address so it will only resolve correctly in my network, but that should be fine.

Generating Talos Configuration

Talos configuration resides in the provision/talos/ directory. It includes a folder of configuration patches as well as a generate.sh script to run the needed talosctl commands.

If you plan to run this for your own cluster, ensure all of the patches in provision/talos/patches are relevant to you.

Once you are ready, you can run the generate.sh script. This will generate your controlplane.yaml, talosconfig, and worker.yaml files. Ensure you back these files.

Installing Talos

For installing Talos, I grab the latest Talos ISO from GitHub, upload it to my PiKVM, and boot the node into it. This pulls up the Talos instance in bootstrap status. I then run this command for the node:

talosctl apply --file ./controlplane.yaml --nodes "192.168.3.11" --insecure

This sets up the control node. Then run this command to bootstrap the cluster:

talosctl bootstrap --nodes "192.168.3.11"

This does the initial Kubernetes resource creation and what not. At this point, 192.168.3.10 should point to the control plane node.

At this point, I can run the same talosctl apply command for all of the other control nodes. Then I apply the worker configuration with:

talosctl apply --file ./worker.yaml --nodes "192.168.3.14" --insecure

With all of that done, there is now a fresh Kubernetes cluster. Note that because the default CNI is not installed, none of the node networking will work and every Kubernetes node will have a taint on it. They will also reboot every 10 minutes until the Cilium CNI is installed.

Finally I run this command to generate the needed kubectl entry:

talosctl kubeconfig ~/.kube/config --nodes talos.btkostner.network --force

Installing Core Resources

Once the Kubernetes cluster is up, we can start installing software on it. To simplify this process I just run the provision/core/install.sh script. This will install all of the resources in the cluster/core directory. Once this is ran initially, ArgoCD will take care of syncing all resources, so you should never need to run this provision core install script again.

Bootstrapping 1Password credentials

The last step of provisioning a cluster is setting up 1Password connect to handle secrets in Kubernetes. Luckily Argo will install the 1Password connect service and the external secrets operator, but we have to add the required 1Password secret for your vault. To do this, I have an install script at provision/1password/install.sh. This will use the 1Password CLI to verify if the 1Password Connect password exists in the 1Password vault (or create it if it doesn't), then copies the password to Kubernetes. If you are running it on your own, please verify the variables in the script before running. It makes heavy assumptions based on my own 1Password setup.

Networking

This page covers networking for individual nodes, our private internal DNS and gateway, and our public DNS and gateways.

Local Network

My Intel NUC cluster runs on a separate vlan powered by Ubiquiti.

Kind	Name	IP
PiKVM	PiKVM	192.168.3.2
SnapAV WB-800VPS-IPVM-18	WattBox	192.168.3.3
Synology RS1221+	Behemoth	192.168.3.4
TESmart 4K UHD 16 Ports HDMI KVM Switch	KVM	192.168.3.5
Kubernetes control plane VIP		192.168.3.10
Intel NUC11PAHI5	NUC 1	192.168.3.11
Intel NUC11PAHI5	NUC 2	192.168.3.12
Intel NUC11PAHI5	NUC 3	192.168.3.13
Intel NUC11PAHI5	NUC 4	192.168.3.14
Intel NUC11PAHI&	NUC 5	192.168.3.15
Kubernetes ingress VIP	cilium-ingress	192.168.3.50
Kubernetes ingress VIP	cilium-gateway-external-gateway	192.168.3.51

Private Network

To make things easier, I have a Tailscale network for everything. This makes it easy for all of my devices to access private services on the cluster. To make it even easier, I have a full DNS setup with Cloudflare at btkostner.network. All IPs in that zone point to private local network IPs or Tailscale IPs.

Currently the PiKVM and Synology NAS has built in Tailscale support, so they just work™. This allows me to access my kvm from any device with Tailscale setup by opening a browser and accessing https://kvm.btkostner.network, similarly my nas with https://behemoth.btkostner.network.

For the Kubernetes cluster.... TODO....

Public Network

For public networking it's a pretty standard Kubernetes setup. One abnormal thing about my setup is I use the new (and totally awesome) Kubernetes Gateway API. There is a single external-gateway resource that uses MetalLB to allocate the 192.68.3.50 IP address to it. Port forwarding with Ubiquiti allows my house public IP address to accept traffic and route to my cluster services.

To make this fully work, I also have some ddns jobs on the cluster that set the required Cloudflare records pointing to my house public IP address.

Secrets

As you may have noticed, nothing in this repository is a secret. There are no encrypted files. This is because we use 1Password to manage all of our secrets in the cluster. This is done via 1Password connect and external secrets operator.

In order to get this working as intended, there is one provisioning step that needs to run in order to add the required 1Password authentication into the cluster. Once that is done, everything else should be automated and ready to go.

Cilium Upgrades

Upgrading Cilium is... painful. Because it runs as the networking back bone it's very easy to do something wrong and completely break your cluster. As always, read the official Cilium upgrade documentation, but this is how I've been upgrading it so far:

Run the preflight checks with helm:

helm --kube-context admin@btkostner install cilium-preflight cilium/cilium --version 1.15.0 --namespace=kube-system --set preflight.enabled=true --set agent=false --set operator.enabled=false

This will pull all of the container images for each node to reduce issues.

Condon all of the nodes. This will prevent them from trying to upgrade in place and causing a huge cluster wide outage.
Apply the upgrade. This should be done via Argo and merging in a helm version upgrade. Make note to ensure the helm values file is updated for the new version.
You should see the cilium operator container start up on the cluster. Make sure no errors are present.
Start rebooting nodes. I usually start with the head controller node which is 192.168.3.11 on my cluster. You can accomplish a standard reboot with the talosctl utility like so:

talosctl reboot --debug --wait --timeout 30s -n 192.168.3.11

Note, if you run into issues, it's helpful to use the kvm console to see what's going on. You can also force a reboot with Ctrl+Alt+Del.

This should reboot the node safely and bring it back into the cluster. Because it's the same host, it will still be condoned on startup. You can uncondone it and check the status of the cilium agent on the node. It should be running and healthy.

Continue rebooting all of the other nodes. Same as above, though now that controller node is back up, you should not get any weird dashboard outages or issues.
Once all nodes are rebooted, you can uncondone them all and the cluster should be back to normal.

Infrastructure