Homelab & Nerding

Goodbye Azure Kubernetes, Hello Hetzner?

On to plan D!

After a few good months of experimentation since my last post, I’ve come to the conclusion that Azure is not the place to meet my goals.

Let’s recap on the goals, and how well Azure and my deployment skills were able to meet each:

  1. Migrate OhanaAlpine workloads – Azure: 3, My skills: 5

    I was able to get my static sites migrated – including fixing the build pipeline on one, SSL cert generation via LetsEncrypt, and MariaDB and WordPress running as single pod deployments. I learned so much doing this – which was one of the primary goals of doing it – but ultimately ran into some Azure limitations that were showstoppers. More on those below.
  2. Single node and ability to scale – Azure: 5, My skills: 4

    Single node was decent enough, and while I know Azure could scale out capably, but I never got the single-node deployment working well enough to want to take it to multi-node. There were some Azure performance limitations that stopped that.
  3. Less than $50/mo, and lower is better – Azure 2, My skills: 3

    When I bailed, my bill was running close to $70/mo for 1x Standard_B2s worker node, a load balancer, Azure Disk and Azure File storage, and very minimal egress charges.
  4. No single point of failure / self-restoring pods – Azure: 4, My skills: 4

    It’s hard to get to this goal with a single node, so I worked to make the workloads self-healing in the event of a node rebuild. Using the Gitlab Agent for Kubernetes, a good tool and a carryover from the Linode days made deployments super-simple. If a node gets blown away, the agent will pull all the manifests from a gitlab rep and rebuild the workloads. Works like a charm. The only catch is that it doesn’t put data back – that is, databases will be empty and WordPress will be stock out-of-the-box new. Using Azure Files as a read-write-many (RWX) filesystem was going to be a key to this, and it worked for manual reloads, but the performance wasn’t there to pursue the automation further, nor was it nearly good enough to use as the backing file system for WordPress’s wp_content directory.
  5. Flexible for other deployments – Azure: n/a, My skills: n/a

    I never got all of my core workloads going in a way that I was ready to call production capable so I never branched into other workloads.

As alluded to above, WordPress with Azure File as ANY part of the backing store is a no-go performance-wise. This was enough to put me off of Azure, as it’s near impossible to fix without seriously scaling up and spending $$$. With a Files share mapped in to wp-content, I was getting 30sec page load times, and that’s with images failing to load. Mind you, that’s with Azure’s HDD “Transaction Optimized” storage. It’s a known problem with Azure File handling large numbers of small files.

Corroborating sites on Azure Files performance issues with large numbers of small files in WordPress:

The path forward is either to go with un-scalable Azure Disk, scale up to way out of budget configurations, or move a platform other than Azure.

And with that, I cancelled my Azure subscription and deleted all my data there. 

I put a question out to the collective wisdom of Reddit, and they came back with a few good options for providers that would meet my needs:

  • Hetzner – super-affordable and highly regarded, but no managed Kubernetes.
  • Digital Ocean – affordable with managed Kubernetes
  • Civo
  • Vulture
  • Symbiosis.host
  • Linode

Right now, I’m leaning strongly towards Hetzner, even if that means spinning up my own Kubernetes cluster. Their pricing is such that I can do 3 control nodes, 3 worker nodes, a load balancer, and a mix of storage for $48/mo or so.

Folks have pointed out that I could pay for static hosting and WordPress hosting for less cost and hassle than I’m going through with this, but I point out that the end product is only a small part of the goal. I’ve learned so much about Kubernetes and automation doing this so far, and I’m not done yet!

Homelab & Nerding

Goodbye Linode Kubernetes, Hello… Azure Kubernetes?

On to plan C!

After a fair amount of googling and noodling, I’ve come to the conclusion that Linode’s LKE Kubernetes service can’t do what I want it to, at least in a way that doesn’t feel hacky and get expensive. My goals – and working on this migration has helped sharpen these:

  1. Migrate my OhanaAlpine VPS docker workloads over to Kubernetes.
  2. Do so in a way that can run comfortably on a single node, or scale up to 3-5 for testing  / research / upgrades.
  3. Not be cost-prohibitive. << $50/mo, the lower the better.
  4. Not have a single point of failure, even in the single node config. That means that if the single node got recycled that it’d be able to reconstitute itself including data (DB, app data, files) from backup as part of the rebuild if necessary.
  5. Be flexible for other deployments.

LKS hit all but #4, and I could not for the life of me figure out a way to do that that wasn’t kludgey. Here’s why:

  • Persistent storage doesn’t persist for node recycles. That is, if I put MariaDB tables out on a PV/PVC block storage volume, it doesn’t get re-attached if the node is recycled and built from scratch.
  • Linode doesn’t offer RWX access for PV. That is, a block storage volume can only be attached to one volume at a time.
  • Related, there’s no easy, obvious way to do shared storage across nodes / pods easily. I looked into Longhorn that might do the trick, it depends on at least 1 node in the cluster being running. I know that should be the norm, but that violates #4
  • I thought about S3 object storage, either as primary shared storage (I don’t think RWX is required for that) and as backup storage to store backups for Longhorn to bootstrap with.  It all felt overly complicated and rickety to set up. Yandex S3 was the lead option, and while S3 kinda got close, it wasn’t really a proven option. I may circle back to this one day.

What I really wanted was a file storage service from Linode, sorta like EFS from AWS. If I could reliably and securely mount an NFS share in a pod or across pods, that would have solved most of my problems, or at least been a non-hacky way to achieve my goals. Why doesn’t Linode offer this?  Oh, I could have spun up my own, but that’s more cost and complexity. Not out of the question in the future but feels like too heavy of a lift for now.

So, what’s a hacker to do? Without changing my requirements (I’m looking at you, item #4…) looking at alternatives Kubernetes hosting is the next step. Looking at Digital Ocean, AWS, GCP, and others, it seems like Azure (AKS) is the best way forward. I think it’s a super-capable platform, but I’m not totally crazy about it because it looks expensive with even a minimal cluster. It comes with a $200 credit to use in the first month and a bunch of free services for the first 12mos, and that should give me time to get built and see what steady-state costs are going to be. I might yet fail on #3 but at least I’m learning, right?

Homelab & Nerding

Hello, Kubernetes!

As a matter of learning, and to get my personal sites off of the cobbled VPS where they’ve happily lived for a while, I took on migrating them all to a Kubernetes cluster. How hard could it be, right? Or rather, how many learning opportunities could there be in this endeavor? Let’s discuss a few.

K3S on a Linode Nanode will work, right? I figured I’d try it and see. On a train ride from NYC to NC, I built out K3S on 2x 1GB/1CPU VPS’, and it was… alright. I didn’t end up with enough useful capacity afterwards to actually deploy much, but it built. I tore that down, and moved on to plan B – if I have to move up to a (slightly) more expensive VPS, for the same price why not have Linode do the Kubernetes control plane for me?

So far, Linode Kubernes Engine (or LKE) has been solid. I configured it with the GitLab Agent for Kubernetes (aka agentk), and made that Pull tool the core of my CI configuration. Once set up – and it’s straightforward to set up – I check a manifest yaml into the agent’s repository, and the agent pulls it down and executes it.

All was good until I kept failing on pulling a container image from a private repository of mine. They say you learning by failing fast and often, and I did learn.

  1. A container image pulled from a private registry with no namespace descriptors in the manifest failed consistently with an error:

"Failed to pull image <registry URL>: rpc error: code = Unknown desc = Error response from daemon: Head <registry URL /w tag>: denied: access forbidden

This failed however I created the container registry secret, and regardless of where I configured its use.

  1. When I made that same project public in GitLab, after a few hour pause, it pulled and provisioned successfully. This made me think it might be a auth problem and not a connectivity problem between my LKE nodes and GitLab. Win. Deleted this and made the project private again.
  2. Created a new namespace. Added the gitlab project token as a secret in the new namespace and tried the same private project in the namespace, using namespace directives in the manifest yaml and it worked. I have connection and authorization in an explicit namespace. Win.
  3. I created a newly named secret in default namespace, took the same project def and changed the namespace descriptors in the manifest YAML from newNamespace to default (and the imagePullSecrets), and it worked. Win.

So, I got it working in a sane and reproducible way, but I’m still not sure why it failed in the first place. It’s like agentk wasn’t looking in default for the imagePullSecrets until default was explicitly declared in the manifest. It’s not obvious to me where it was looking though.

Manifests are now triggering the successful pull and deployment of both public and private packages, and I’ve learned an amazing amount about deployments and secrets and namespaces and private registry auth and Kubernetes details.

Next up, Ingress with Nginx!

(Parts of this were re-used from a bug comment I made on this thread. In retrospect, I don’t think the problem I was having was exactly the one in the bug description, but it’s close, and this thread was helpful in me figuring out what was going on with my private container auth.)

Productivity

How I Work – An Update on Capture

In my last How I Work post, I was much enamored with Bullet Journaling – or BuJo for short – for note capture. As a recap, I had never used a note-taking system that worked the way my brain did, and as such, nothing ever stuck until Bullet Journaling. Its solid, simple, flexible capture that just works. How to organize and track for retrieval after the fact I never got the hang of, so my system ended up being easy-in, serial page-flipping search to get anything out. Still, it was an major improvement from living in my head.

GoodNotes – Redefining Capture

My wife Susie turned me on to an iPad journaling app that she’d found through her Life Coach School and NoBS coaching communities called GoodNotes. GoodNotes is much like a paper journal in that you’re capturing notes and sketches, but rather that ink on paper, the app captures into digital ink. One of the fantastic features was being able to use PDF templates for pages. I had an effective Daily template that had some standard gratitude and goals questions that I found helpful to start the day with – many thanks to Robert Terekedis for his [article on effective Digital Bujo]) https://blog.euc-rt.me/2020-03-06-organizing-adapted-digital-bullet-journal-ipad-goodnotes/#gsc.tab=0). Robert’s work was a helpful foundation to get me started. Moving to GoodNotes on the iPad with Apple Pencil was a solid upgrade.

Importing PDFs to use as immutable templates is GoodNotes superpower.

Feature Rundown

  • Capture of text was equally as good as the paper Bujo.
  • Having easy access to more (digital) ink colors and line widths was a plus
  • Erasing easily was a huge plus
  • Copy and paste of inked text was fantastic for rearranging a day’s notes and for daily and monthly migrations. With a paper Bujo, rewriting everything at the first of a new month as part of migration always struck me as more hassle than benefit. With GoodNotes, it’s lasso/copy/paste and you’re done. That made migrations much easier.
  • Pasting pictures was very helpful. Often, after I’d captured notes from a meeting, I’d go back on the Mac GoodNotes client and paste in useful images and screencaps from the meeting. It was super useful to have notes and images together – not easily doable with the paper BuJo.
  • One of the big expected benefits I was looking forward to with GoodNotes was Search. Since GN converts inked writing to text behind the scenes, it’s all searchable. Unfortunately, the search is disappointingly basic. Sure, you could find the word "lasagne", and it’d show you all occurrences of that word, but you couldn’t search for "lasagne AND vegetable" to find those words on the same page, or "lasagne AND date in JAN2022" to find pages within date ranges. The final disappointment with search was a lack of hashtag searching. I tried tagging my notes with #ALMOND, and it wouldn’t search for the hash symbol – it just ignored it, making hashtagging notes not so useful.

The Verdict

The final verdict: good for capture, but not for long-term storage and retrieval. In my next How Do I Work post, I’ll cover what came next: Obsidian, and cover the GoodNotes feature that made the transition relatively painless.

Productivity

How Do I Work – v4.0

I’ve been taking a hard look at how I do work. By that, I mean how I chose what work to do and how I arrange my day to make it happen. There have been a few iterations over the last year, and I keep getting better and better at it.

All In My Head – v1.0

In the beginning, there was my not great memory and my “feels“. My todos lived in my head (or my inbox) and calendar had a scatter of hard scheduled meetings and open time. Project time had to happen in my scattered free time, and was largely done based on what I felt like doing in the moment, or what urgent thing had been dropped on me. That was good enough in some ways, but I wasn’t as nearly as effective or productive as I could have been. Some days, it was hard to state a solid item that I had done that day, but I certainly stayed busy doing emails and the like.

I dabbling in GTD (Getting Things Done), in both paper and digital versions, and that never stuck. Capturing and organizing open items and tracking contexts never worked for me and took more effort than it benefitted.

Effective Capture – v2.0

Then I discovered Bullet Journal. (Site / Book) It’s a light-process, flexibly structured system of capturing things – todo’s, notes of different kinds, events to be organized, and whatever else comes up. All it takes is a notebook and a pen. It clicked for me in really useful ways. My biggest win was finally capturing all the todo’s that used to (poorly) exist just in my head and getting them on paper. That alone has been life-changing.

Homelab & Nerding

Goodbye, CoreOS

For my now-retired at-home hosting, and for my first migration to the cloud, I used CoreOS as my hosting OS. It’s optimized to be a container server and for super-efficient automated deployments. It worked well, and I loved the elegance behind the design, even though I never did multi-node automated CI/CD deployments with it. I had had a problem with it since the cloud migration, where every 11 days after reboot, it would spike CPU and memory and get unresponsive for a few minutes at a time. I could never entirely nail down what was causing it, but best I can tell, logging was getting sticky trying to log failed login attempts from SSH. That I had MariaDB/MySQL, a couple of low-activity wordpress sites, and automated Let’sEncrypt cert management running in 2G of memory wasn’t helping either.

Since I built that version 8mos or so ago, the CoreOS people, since acquired by RedHat, announced upcoming the end-of-life of CoreOS on 26MAY2020. That, combined with my ongoing minor problems led me to port everything to a new cloud host.

This time I went with a more conventional linux, and ported everything over in a few hours, along with DR / rebuild documentation. In fact, this post is the first on the new platform. Here’s hope it’s happily running in 12 days, and thanks for your service CoreOS.

Cancer

Personal Disaster Recovery – part one

About last year at this time, Susie and I were knee-jerking to the fresh news of my cancer, and the first bucket list thing we jumped on was a long-discussed trip to Hawaii. It was an utterly amazing trip, and I’m so glad we were able to go. Thanks, too, to Delta for offering me very generous flight credits to be bumped – credits that paid for most of the airfare for both of us.

Susie and I at Waipio Valley, north of Hilo, HI. Life is good. #hansleythuglife

We skipped town just as hurricane Florence was hitting NC, and in preparation for the storm, I had powered down most of the tech in the house before we left, with instructions for the house-sitter on what to turn on first to get the house back online. In “normal people’s” houses, they turn their wireless router back on and they’re good to go. Not so in a nerd’s home. I had been playing with Windows Server, and had it hosting all of the DNS roles for both internal and external hosts, and it had worked pretty seamlessly up until this. When the house-sitter fired everything up, some glitch, either in the Xen server hosting everything, or in the Windows server startup kept it from completely starting, which meant that while the network was working, the router was handing out a dead address for DNS, which meant that nothing – Apple TV, the house-sitter’s laptop, etc. – was connecting to the outside world. So, Susie’s driving down from some gorgeous waterfall hike north of Hilo, HI, and I’m on the phone with the house-sitter, trying to talk her through how to set a manual DNS entry on all her devices so she could do work while she was here. We got her going – mostly – and this started a conversation with Susie about how we could simplify tech around the house so that if / when something ever happens to me, she won’t have to call in the techie friends to have them decode home tech and get it working again.

We’d talked about this before, but a little more abstractly, when the husband of a good grad school friend died suddenly in a one-car accident. She had techie friends come over and rip out most of the late husband’s custom home build and install a stock wireless router just so she and her kid could get on the internet. Susie didn’t want to be in that position if at all possible.

This time it wasn’t abstract. The goal was – without killing off technical tinkering, innovation, and learning – how to simplify and document things at home so that should something ever happen to me, that friends could step in and easily help keep her running. In the business world, companies talk and think about disaster recovery plans, and if they’re smart, the practice them occasionally – fail over to backup servers, restore things from backup, switch to alternative network feeds. It was time to think like that for all of our home systems, but first, we had to make sure that what we were running was set up in the best way.

Roughly speaking, what we had running at home that we had to analyze was:

1. FreeNAS – running 4 fairly old 4T drives in a ZFS array, storing pics, movies, backups, and other files
2. Xen Server – running VMs for Windows Server, Win10 running Blue Iris for the home IP cameras, and two CoreOS instances hosting a bunch of docker containers – WordPress, MariaDB, Ubiquity’s unifi controller, among others.
3. Home Networking – AT&T and Ubiquity with a mix of APs in the house and garage.
4. Physical Security – Exterior IP cameras talking to a Blue Iris VM.

More on this saga in the next segment.

Homelab & Nerding

No time for you…

Ever since we made the move to AT&T Uverse Fiber – and happily said “See ya!” to TimeWarner / Spectrum – we’ve been a household with clocks adrift. NTP (Network Time Protocol) set up on servers didn’t work – although I blamed myself for botched configurations, NTP on security cameras didn’t work, and network time sync from Windows and MacOS routinely failed. As you might expect, none of the system clocks were in sync, and some were out enough to cause the occasional problems. Then I ran across this forum post on AT&T’s forums, which detailed out what services / ports that AT&T blocks by default for its residential customers – and port 123, both inbound and outbound, which NTP uses was on the list. The complete list is:

So, I get why they do this – a misconfigured NTP server or botnet node can hammer a target server to its knees in a accidental or intentional DDOS. Protecting people from themselves makes sense for NTP and for the other ports they regularly block.

Identifying the problem was half the battle. Getting AT&T to fix it was it’s own effort. What started with a chat session from the Uverse site went like this:

  1. Chat session with the billing / business group, who kindly said he couldn’t help, and transferred my chat to a tech support group.
  2. The first tech support group changed this from a chat into a call, and couldn’t handle my request, and bounced me to a higher tier support group.
  3. This upper tier support group turned out to be a premium internet support group who would only talk to me if I was paying for premium internet support – which he kindly offered to sign me up for on the call. Annoyed at the prospect of paying for something which I thought I already had, I declined. This guy transferred me to a different internet support group who he said could help.
  4. Internet support group #2 – a different group from item 2 above – couldn’t help me either, but instead transferred me to the Fiber Support group.
  5. A fine support rep from the Fiber Support group knew what I was asking (“Can you unblock port 123 for me?”) and how to do it! Once I agreed that I was taking on risk by unblocking this, she proceeded, and by the end of the 40 minute call, I had at least one of my machines able to set its clock from an NTP server. Success!

The total time on the phone / chat with AT&T to get this resolved as just under an hour, and that was once I knew what the problem was and that it was fixable.

As collateral damage, when the rep made the NTP unblock change on my fiber gateway, they ended up breaking all other inbound ports, disconnecting several of my servers from the internet. No amount of reconfiguring and resetting on my side was able to resolve this. Another call to AT&T fiber support and this was resolved on the first call, but that’s another hour or two of my life spent on troubleshooting and resolving something that wasn’t part of the problem to start with.

So, kudos for the right person at AT&T being able to fix this, but it took far too much work on my part to realize that this was an intentional thing on AT&T’s part and that there was a fix for it.  Total time spent on configuration, troubleshooting, research, support call, and testing was easily 6-8 hours, but at least I can happily report that all our home systems now believe it’s the same time.

 

Uncategorized

WordPress Works!

Finally, after much wrangling with Apache, PHP, WordPress and its multisite configuration, and SSL certificates, I think I have our little network of sites up, usable, and secure. Time will tell.