Homelab & Nerding

Goodbye Azure Kubernetes, Hello Hetzner?

On to plan D!

After a few good months of experimentation since my last post, I’ve come to the conclusion that Azure is not the place to meet my goals.

Let’s recap on the goals, and how well Azure and my deployment skills were able to meet each:

  1. Migrate OhanaAlpine workloads – Azure: 3, My skills: 5

    I was able to get my static sites migrated – including fixing the build pipeline on one, SSL cert generation via LetsEncrypt, and MariaDB and WordPress running as single pod deployments. I learned so much doing this – which was one of the primary goals of doing it – but ultimately ran into some Azure limitations that were showstoppers. More on those below.
  2. Single node and ability to scale – Azure: 5, My skills: 4

    Single node was decent enough, and while I know Azure could scale out capably, but I never got the single-node deployment working well enough to want to take it to multi-node. There were some Azure performance limitations that stopped that.
  3. Less than $50/mo, and lower is better – Azure 2, My skills: 3

    When I bailed, my bill was running close to $70/mo for 1x Standard_B2s worker node, a load balancer, Azure Disk and Azure File storage, and very minimal egress charges.
  4. No single point of failure / self-restoring pods – Azure: 4, My skills: 4

    It’s hard to get to this goal with a single node, so I worked to make the workloads self-healing in the event of a node rebuild. Using the Gitlab Agent for Kubernetes, a good tool and a carryover from the Linode days made deployments super-simple. If a node gets blown away, the agent will pull all the manifests from a gitlab rep and rebuild the workloads. Works like a charm. The only catch is that it doesn’t put data back – that is, databases will be empty and WordPress will be stock out-of-the-box new. Using Azure Files as a read-write-many (RWX) filesystem was going to be a key to this, and it worked for manual reloads, but the performance wasn’t there to pursue the automation further, nor was it nearly good enough to use as the backing file system for WordPress’s wp_content directory.
  5. Flexible for other deployments – Azure: n/a, My skills: n/a

    I never got all of my core workloads going in a way that I was ready to call production capable so I never branched into other workloads.

As alluded to above, WordPress with Azure File as ANY part of the backing store is a no-go performance-wise. This was enough to put me off of Azure, as it’s near impossible to fix without seriously scaling up and spending $$$. With a Files share mapped in to wp-content, I was getting 30sec page load times, and that’s with images failing to load. Mind you, that’s with Azure’s HDD “Transaction Optimized” storage. It’s a known problem with Azure File handling large numbers of small files.

Corroborating sites on Azure Files performance issues with large numbers of small files in WordPress:

The path forward is either to go with un-scalable Azure Disk, scale up to way out of budget configurations, or move a platform other than Azure.

And with that, I cancelled my Azure subscription and deleted all my data there. 

I put a question out to the collective wisdom of Reddit, and they came back with a few good options for providers that would meet my needs:

  • Hetzner – super-affordable and highly regarded, but no managed Kubernetes.
  • Digital Ocean – affordable with managed Kubernetes
  • Civo
  • Vulture
  • Symbiosis.host
  • Linode

Right now, I’m leaning strongly towards Hetzner, even if that means spinning up my own Kubernetes cluster. Their pricing is such that I can do 3 control nodes, 3 worker nodes, a load balancer, and a mix of storage for $48/mo or so.

Folks have pointed out that I could pay for static hosting and WordPress hosting for less cost and hassle than I’m going through with this, but I point out that the end product is only a small part of the goal. I’ve learned so much about Kubernetes and automation doing this so far, and I’m not done yet!

Homelab & Nerding

Goodbye Linode Kubernetes, Hello… Azure Kubernetes?

On to plan C!

After a fair amount of googling and noodling, I’ve come to the conclusion that Linode’s LKE Kubernetes service can’t do what I want it to, at least in a way that doesn’t feel hacky and get expensive. My goals – and working on this migration has helped sharpen these:

  1. Migrate my OhanaAlpine VPS docker workloads over to Kubernetes.
  2. Do so in a way that can run comfortably on a single node, or scale up to 3-5 for testing  / research / upgrades.
  3. Not be cost-prohibitive. << $50/mo, the lower the better.
  4. Not have a single point of failure, even in the single node config. That means that if the single node got recycled that it’d be able to reconstitute itself including data (DB, app data, files) from backup as part of the rebuild if necessary.
  5. Be flexible for other deployments.

LKS hit all but #4, and I could not for the life of me figure out a way to do that that wasn’t kludgey. Here’s why:

  • Persistent storage doesn’t persist for node recycles. That is, if I put MariaDB tables out on a PV/PVC block storage volume, it doesn’t get re-attached if the node is recycled and built from scratch.
  • Linode doesn’t offer RWX access for PV. That is, a block storage volume can only be attached to one volume at a time.
  • Related, there’s no easy, obvious way to do shared storage across nodes / pods easily. I looked into Longhorn that might do the trick, it depends on at least 1 node in the cluster being running. I know that should be the norm, but that violates #4
  • I thought about S3 object storage, either as primary shared storage (I don’t think RWX is required for that) and as backup storage to store backups for Longhorn to bootstrap with.  It all felt overly complicated and rickety to set up. Yandex S3 was the lead option, and while S3 kinda got close, it wasn’t really a proven option. I may circle back to this one day.

What I really wanted was a file storage service from Linode, sorta like EFS from AWS. If I could reliably and securely mount an NFS share in a pod or across pods, that would have solved most of my problems, or at least been a non-hacky way to achieve my goals. Why doesn’t Linode offer this?  Oh, I could have spun up my own, but that’s more cost and complexity. Not out of the question in the future but feels like too heavy of a lift for now.

So, what’s a hacker to do? Without changing my requirements (I’m looking at you, item #4…) looking at alternatives Kubernetes hosting is the next step. Looking at Digital Ocean, AWS, GCP, and others, it seems like Azure (AKS) is the best way forward. I think it’s a super-capable platform, but I’m not totally crazy about it because it looks expensive with even a minimal cluster. It comes with a $200 credit to use in the first month and a bunch of free services for the first 12mos, and that should give me time to get built and see what steady-state costs are going to be. I might yet fail on #3 but at least I’m learning, right?

Homelab & Nerding

Hello, Kubernetes!

As a matter of learning, and to get my personal sites off of the cobbled VPS where they’ve happily lived for a while, I took on migrating them all to a Kubernetes cluster. How hard could it be, right? Or rather, how many learning opportunities could there be in this endeavor? Let’s discuss a few.

K3S on a Linode Nanode will work, right? I figured I’d try it and see. On a train ride from NYC to NC, I built out K3S on 2x 1GB/1CPU VPS’, and it was… alright. I didn’t end up with enough useful capacity afterwards to actually deploy much, but it built. I tore that down, and moved on to plan B – if I have to move up to a (slightly) more expensive VPS, for the same price why not have Linode do the Kubernetes control plane for me?

So far, Linode Kubernes Engine (or LKE) has been solid. I configured it with the GitLab Agent for Kubernetes (aka agentk), and made that Pull tool the core of my CI configuration. Once set up – and it’s straightforward to set up – I check a manifest yaml into the agent’s repository, and the agent pulls it down and executes it.

All was good until I kept failing on pulling a container image from a private repository of mine. They say you learning by failing fast and often, and I did learn.

  1. A container image pulled from a private registry with no namespace descriptors in the manifest failed consistently with an error:

"Failed to pull image <registry URL>: rpc error: code = Unknown desc = Error response from daemon: Head <registry URL /w tag>: denied: access forbidden

This failed however I created the container registry secret, and regardless of where I configured its use.

  1. When I made that same project public in GitLab, after a few hour pause, it pulled and provisioned successfully. This made me think it might be a auth problem and not a connectivity problem between my LKE nodes and GitLab. Win. Deleted this and made the project private again.
  2. Created a new namespace. Added the gitlab project token as a secret in the new namespace and tried the same private project in the namespace, using namespace directives in the manifest yaml and it worked. I have connection and authorization in an explicit namespace. Win.
  3. I created a newly named secret in default namespace, took the same project def and changed the namespace descriptors in the manifest YAML from newNamespace to default (and the imagePullSecrets), and it worked. Win.

So, I got it working in a sane and reproducible way, but I’m still not sure why it failed in the first place. It’s like agentk wasn’t looking in default for the imagePullSecrets until default was explicitly declared in the manifest. It’s not obvious to me where it was looking though.

Manifests are now triggering the successful pull and deployment of both public and private packages, and I’ve learned an amazing amount about deployments and secrets and namespaces and private registry auth and Kubernetes details.

Next up, Ingress with Nginx!

(Parts of this were re-used from a bug comment I made on this thread. In retrospect, I don’t think the problem I was having was exactly the one in the bug description, but it’s close, and this thread was helpful in me figuring out what was going on with my private container auth.)