Homelab & Nerding

Goodbye Azure Kubernetes, Hello Hetzner?

On to plan D!

After a few good months of experimentation since my last post, I’ve come to the conclusion that Azure is not the place to meet my goals.

Let’s recap on the goals, and how well Azure and my deployment skills were able to meet each:

  1. Migrate OhanaAlpine workloads – Azure: 3, My skills: 5

    I was able to get my static sites migrated – including fixing the build pipeline on one, SSL cert generation via LetsEncrypt, and MariaDB and WordPress running as single pod deployments. I learned so much doing this – which was one of the primary goals of doing it – but ultimately ran into some Azure limitations that were showstoppers. More on those below.
  2. Single node and ability to scale – Azure: 5, My skills: 4

    Single node was decent enough, and while I know Azure could scale out capably, but I never got the single-node deployment working well enough to want to take it to multi-node. There were some Azure performance limitations that stopped that.
  3. Less than $50/mo, and lower is better – Azure 2, My skills: 3

    When I bailed, my bill was running close to $70/mo for 1x Standard_B2s worker node, a load balancer, Azure Disk and Azure File storage, and very minimal egress charges.
  4. No single point of failure / self-restoring pods – Azure: 4, My skills: 4

    It’s hard to get to this goal with a single node, so I worked to make the workloads self-healing in the event of a node rebuild. Using the Gitlab Agent for Kubernetes, a good tool and a carryover from the Linode days made deployments super-simple. If a node gets blown away, the agent will pull all the manifests from a gitlab rep and rebuild the workloads. Works like a charm. The only catch is that it doesn’t put data back – that is, databases will be empty and WordPress will be stock out-of-the-box new. Using Azure Files as a read-write-many (RWX) filesystem was going to be a key to this, and it worked for manual reloads, but the performance wasn’t there to pursue the automation further, nor was it nearly good enough to use as the backing file system for WordPress’s wp_content directory.
  5. Flexible for other deployments – Azure: n/a, My skills: n/a

    I never got all of my core workloads going in a way that I was ready to call production capable so I never branched into other workloads.

As alluded to above, WordPress with Azure File as ANY part of the backing store is a no-go performance-wise. This was enough to put me off of Azure, as it’s near impossible to fix without seriously scaling up and spending $$$. With a Files share mapped in to wp-content, I was getting 30sec page load times, and that’s with images failing to load. Mind you, that’s with Azure’s HDD “Transaction Optimized” storage. It’s a known problem with Azure File handling large numbers of small files.

Corroborating sites on Azure Files performance issues with large numbers of small files in WordPress:

The path forward is either to go with un-scalable Azure Disk, scale up to way out of budget configurations, or move a platform other than Azure.

And with that, I cancelled my Azure subscription and deleted all my data there. 

I put a question out to the collective wisdom of Reddit, and they came back with a few good options for providers that would meet my needs:

  • Hetzner – super-affordable and highly regarded, but no managed Kubernetes.
  • Digital Ocean – affordable with managed Kubernetes
  • Civo
  • Vulture
  • Symbiosis.host
  • Linode

Right now, I’m leaning strongly towards Hetzner, even if that means spinning up my own Kubernetes cluster. Their pricing is such that I can do 3 control nodes, 3 worker nodes, a load balancer, and a mix of storage for $48/mo or so.

Folks have pointed out that I could pay for static hosting and WordPress hosting for less cost and hassle than I’m going through with this, but I point out that the end product is only a small part of the goal. I’ve learned so much about Kubernetes and automation doing this so far, and I’m not done yet!

Homelab & Nerding

Goodbye Linode Kubernetes, Hello… Azure Kubernetes?

On to plan C!

After a fair amount of googling and noodling, I’ve come to the conclusion that Linode’s LKE Kubernetes service can’t do what I want it to, at least in a way that doesn’t feel hacky and get expensive. My goals – and working on this migration has helped sharpen these:

  1. Migrate my OhanaAlpine VPS docker workloads over to Kubernetes.
  2. Do so in a way that can run comfortably on a single node, or scale up to 3-5 for testing  / research / upgrades.
  3. Not be cost-prohibitive. << $50/mo, the lower the better.
  4. Not have a single point of failure, even in the single node config. That means that if the single node got recycled that it’d be able to reconstitute itself including data (DB, app data, files) from backup as part of the rebuild if necessary.
  5. Be flexible for other deployments.

LKS hit all but #4, and I could not for the life of me figure out a way to do that that wasn’t kludgey. Here’s why:

  • Persistent storage doesn’t persist for node recycles. That is, if I put MariaDB tables out on a PV/PVC block storage volume, it doesn’t get re-attached if the node is recycled and built from scratch.
  • Linode doesn’t offer RWX access for PV. That is, a block storage volume can only be attached to one volume at a time.
  • Related, there’s no easy, obvious way to do shared storage across nodes / pods easily. I looked into Longhorn that might do the trick, it depends on at least 1 node in the cluster being running. I know that should be the norm, but that violates #4
  • I thought about S3 object storage, either as primary shared storage (I don’t think RWX is required for that) and as backup storage to store backups for Longhorn to bootstrap with.  It all felt overly complicated and rickety to set up. Yandex S3 was the lead option, and while S3 kinda got close, it wasn’t really a proven option. I may circle back to this one day.

What I really wanted was a file storage service from Linode, sorta like EFS from AWS. If I could reliably and securely mount an NFS share in a pod or across pods, that would have solved most of my problems, or at least been a non-hacky way to achieve my goals. Why doesn’t Linode offer this?  Oh, I could have spun up my own, but that’s more cost and complexity. Not out of the question in the future but feels like too heavy of a lift for now.

So, what’s a hacker to do? Without changing my requirements (I’m looking at you, item #4…) looking at alternatives Kubernetes hosting is the next step. Looking at Digital Ocean, AWS, GCP, and others, it seems like Azure (AKS) is the best way forward. I think it’s a super-capable platform, but I’m not totally crazy about it because it looks expensive with even a minimal cluster. It comes with a $200 credit to use in the first month and a bunch of free services for the first 12mos, and that should give me time to get built and see what steady-state costs are going to be. I might yet fail on #3 but at least I’m learning, right?

Homelab & Nerding

Hello, Kubernetes!

As a matter of learning, and to get my personal sites off of the cobbled VPS where they’ve happily lived for a while, I took on migrating them all to a Kubernetes cluster. How hard could it be, right? Or rather, how many learning opportunities could there be in this endeavor? Let’s discuss a few.

K3S on a Linode Nanode will work, right? I figured I’d try it and see. On a train ride from NYC to NC, I built out K3S on 2x 1GB/1CPU VPS’, and it was… alright. I didn’t end up with enough useful capacity afterwards to actually deploy much, but it built. I tore that down, and moved on to plan B – if I have to move up to a (slightly) more expensive VPS, for the same price why not have Linode do the Kubernetes control plane for me?

So far, Linode Kubernes Engine (or LKE) has been solid. I configured it with the GitLab Agent for Kubernetes (aka agentk), and made that Pull tool the core of my CI configuration. Once set up – and it’s straightforward to set up – I check a manifest yaml into the agent’s repository, and the agent pulls it down and executes it.

All was good until I kept failing on pulling a container image from a private repository of mine. They say you learning by failing fast and often, and I did learn.

  1. A container image pulled from a private registry with no namespace descriptors in the manifest failed consistently with an error:

"Failed to pull image <registry URL>: rpc error: code = Unknown desc = Error response from daemon: Head <registry URL /w tag>: denied: access forbidden

This failed however I created the container registry secret, and regardless of where I configured its use.

  1. When I made that same project public in GitLab, after a few hour pause, it pulled and provisioned successfully. This made me think it might be a auth problem and not a connectivity problem between my LKE nodes and GitLab. Win. Deleted this and made the project private again.
  2. Created a new namespace. Added the gitlab project token as a secret in the new namespace and tried the same private project in the namespace, using namespace directives in the manifest yaml and it worked. I have connection and authorization in an explicit namespace. Win.
  3. I created a newly named secret in default namespace, took the same project def and changed the namespace descriptors in the manifest YAML from newNamespace to default (and the imagePullSecrets), and it worked. Win.

So, I got it working in a sane and reproducible way, but I’m still not sure why it failed in the first place. It’s like agentk wasn’t looking in default for the imagePullSecrets until default was explicitly declared in the manifest. It’s not obvious to me where it was looking though.

Manifests are now triggering the successful pull and deployment of both public and private packages, and I’ve learned an amazing amount about deployments and secrets and namespaces and private registry auth and Kubernetes details.

Next up, Ingress with Nginx!

(Parts of this were re-used from a bug comment I made on this thread. In retrospect, I don’t think the problem I was having was exactly the one in the bug description, but it’s close, and this thread was helpful in me figuring out what was going on with my private container auth.)

Homelab & Nerding

Goodbye, CoreOS

For my now-retired at-home hosting, and for my first migration to the cloud, I used CoreOS as my hosting OS. It’s optimized to be a container server and for super-efficient automated deployments. It worked well, and I loved the elegance behind the design, even though I never did multi-node automated CI/CD deployments with it. I had had a problem with it since the cloud migration, where every 11 days after reboot, it would spike CPU and memory and get unresponsive for a few minutes at a time. I could never entirely nail down what was causing it, but best I can tell, logging was getting sticky trying to log failed login attempts from SSH. That I had MariaDB/MySQL, a couple of low-activity wordpress sites, and automated Let’sEncrypt cert management running in 2G of memory wasn’t helping either.

Since I built that version 8mos or so ago, the CoreOS people, since acquired by RedHat, announced upcoming the end-of-life of CoreOS on 26MAY2020. That, combined with my ongoing minor problems led me to port everything to a new cloud host.

This time I went with a more conventional linux, and ported everything over in a few hours, along with DR / rebuild documentation. In fact, this post is the first on the new platform. Here’s hope it’s happily running in 12 days, and thanks for your service CoreOS.

Cancer

Personal Disaster Recovery – part one

About last year at this time, Susie and I were knee-jerking to the fresh news of my cancer, and the first bucket list thing we jumped on was a long-discussed trip to Hawaii. It was an utterly amazing trip, and I’m so glad we were able to go. Thanks, too, to Delta for offering me very generous flight credits to be bumped – credits that paid for most of the airfare for both of us.

Susie and I at Waipio Valley, north of Hilo, HI. Life is good. #hansleythuglife

We skipped town just as hurricane Florence was hitting NC, and in preparation for the storm, I had powered down most of the tech in the house before we left, with instructions for the house-sitter on what to turn on first to get the house back online. In “normal people’s” houses, they turn their wireless router back on and they’re good to go. Not so in a nerd’s home. I had been playing with Windows Server, and had it hosting all of the DNS roles for both internal and external hosts, and it had worked pretty seamlessly up until this. When the house-sitter fired everything up, some glitch, either in the Xen server hosting everything, or in the Windows server startup kept it from completely starting, which meant that while the network was working, the router was handing out a dead address for DNS, which meant that nothing – Apple TV, the house-sitter’s laptop, etc. – was connecting to the outside world. So, Susie’s driving down from some gorgeous waterfall hike north of Hilo, HI, and I’m on the phone with the house-sitter, trying to talk her through how to set a manual DNS entry on all her devices so she could do work while she was here. We got her going – mostly – and this started a conversation with Susie about how we could simplify tech around the house so that if / when something ever happens to me, she won’t have to call in the techie friends to have them decode home tech and get it working again.

We’d talked about this before, but a little more abstractly, when the husband of a good grad school friend died suddenly in a one-car accident. She had techie friends come over and rip out most of the late husband’s custom home build and install a stock wireless router just so she and her kid could get on the internet. Susie didn’t want to be in that position if at all possible.

This time it wasn’t abstract. The goal was – without killing off technical tinkering, innovation, and learning – how to simplify and document things at home so that should something ever happen to me, that friends could step in and easily help keep her running. In the business world, companies talk and think about disaster recovery plans, and if they’re smart, the practice them occasionally – fail over to backup servers, restore things from backup, switch to alternative network feeds. It was time to think like that for all of our home systems, but first, we had to make sure that what we were running was set up in the best way.

Roughly speaking, what we had running at home that we had to analyze was:

1. FreeNAS – running 4 fairly old 4T drives in a ZFS array, storing pics, movies, backups, and other files
2. Xen Server – running VMs for Windows Server, Win10 running Blue Iris for the home IP cameras, and two CoreOS instances hosting a bunch of docker containers – WordPress, MariaDB, Ubiquity’s unifi controller, among others.
3. Home Networking – AT&T and Ubiquity with a mix of APs in the house and garage.
4. Physical Security – Exterior IP cameras talking to a Blue Iris VM.

More on this saga in the next segment.

Homelab & Nerding

No time for you…

Ever since we made the move to AT&T Uverse Fiber – and happily said “See ya!” to TimeWarner / Spectrum – we’ve been a household with clocks adrift. NTP (Network Time Protocol) set up on servers didn’t work – although I blamed myself for botched configurations, NTP on security cameras didn’t work, and network time sync from Windows and MacOS routinely failed. As you might expect, none of the system clocks were in sync, and some were out enough to cause the occasional problems. Then I ran across this forum post on AT&T’s forums, which detailed out what services / ports that AT&T blocks by default for its residential customers – and port 123, both inbound and outbound, which NTP uses was on the list. The complete list is:

So, I get why they do this – a misconfigured NTP server or botnet node can hammer a target server to its knees in a accidental or intentional DDOS. Protecting people from themselves makes sense for NTP and for the other ports they regularly block.

Identifying the problem was half the battle. Getting AT&T to fix it was it’s own effort. What started with a chat session from the Uverse site went like this:

  1. Chat session with the billing / business group, who kindly said he couldn’t help, and transferred my chat to a tech support group.
  2. The first tech support group changed this from a chat into a call, and couldn’t handle my request, and bounced me to a higher tier support group.
  3. This upper tier support group turned out to be a premium internet support group who would only talk to me if I was paying for premium internet support – which he kindly offered to sign me up for on the call. Annoyed at the prospect of paying for something which I thought I already had, I declined. This guy transferred me to a different internet support group who he said could help.
  4. Internet support group #2 – a different group from item 2 above – couldn’t help me either, but instead transferred me to the Fiber Support group.
  5. A fine support rep from the Fiber Support group knew what I was asking (“Can you unblock port 123 for me?”) and how to do it! Once I agreed that I was taking on risk by unblocking this, she proceeded, and by the end of the 40 minute call, I had at least one of my machines able to set its clock from an NTP server. Success!

The total time on the phone / chat with AT&T to get this resolved as just under an hour, and that was once I knew what the problem was and that it was fixable.

As collateral damage, when the rep made the NTP unblock change on my fiber gateway, they ended up breaking all other inbound ports, disconnecting several of my servers from the internet. No amount of reconfiguring and resetting on my side was able to resolve this. Another call to AT&T fiber support and this was resolved on the first call, but that’s another hour or two of my life spent on troubleshooting and resolving something that wasn’t part of the problem to start with.

So, kudos for the right person at AT&T being able to fix this, but it took far too much work on my part to realize that this was an intentional thing on AT&T’s part and that there was a fix for it.  Total time spent on configuration, troubleshooting, research, support call, and testing was easily 6-8 hours, but at least I can happily report that all our home systems now believe it’s the same time.