Goodbye, CoreOS

For my now-retired at-home hosting, and for my first migration to the cloud, I used CoreOS as my hosting OS. It’s optimized to be a container server and for super-efficient automated deployments. It worked well, and I loved the elegance behind the design, even though I never did multi-node automated CI/CD deployments with it. I had had a problem with it since the cloud migration, where every 11 days after reboot, it would spike CPU and memory and get unresponsive for a few minutes at a time. I could never entirely nail down what was causing it, but best I can tell, logging was getting sticky trying to log failed login attempts from SSH. That I had MariaDB/MySQL, a couple of low-activity wordpress sites, and automated Let’sEncrypt cert management running in 2G of memory wasn’t helping either.

Since I built that version 8mos or so ago, the CoreOS people, since acquired by RedHat, announced upcoming the end-of-life of CoreOS on 26MAY2020. That, combined with my ongoing minor problems led me to port everything to a new cloud host.

This time I went with a more conventional linux, and ported everything over in a few hours, along with DR / rebuild documentation. In fact, this post is the first on the new platform. Here’s hope it’s happily running in 12 days, and thanks for your service CoreOS.


Personal Disaster Recovery – part one

About last year at this time, Susie and I were knee-jerking to the fresh news of my cancer, and the first bucket list thing we jumped on was a long-discussed trip to Hawaii. It was an utterly amazing trip, and I’m so glad we were able to go. Thanks, too, to Delta for offering me very generous flight credits to be bumped – credits that paid for most of the airfare for both of us.

Susie and I at Waipio Valley, north of Hilo, HI. Life is good. #hansleythuglife

We skipped town just as hurricane Florence was hitting NC, and in preparation for the storm, I had powered down most of the tech in the house before we left, with instructions for the house-sitter on what to turn on first to get the house back online. In “normal people’s” houses, they turn their wireless router back on and they’re good to go. Not so in a nerd’s home. I had been playing with Windows Server, and had it hosting all of the DNS roles for both internal and external hosts, and it had worked pretty seamlessly up until this. When the house-sitter fired everything up, some glitch, either in the Xen server hosting everything, or in the Windows server startup kept it from completely starting, which meant that while the network was working, the router was handing out a dead address for DNS, which meant that nothing – Apple TV, the house-sitter’s laptop, etc. – was connecting to the outside world. So, Susie’s driving down from some gorgeous waterfall hike north of Hilo, HI, and I’m on the phone with the house-sitter, trying to talk her through how to set a manual DNS entry on all her devices so she could do work while she was here. We got her going – mostly – and this started a conversation with Susie about how we could simplify tech around the house so that if / when something ever happens to me, she won’t have to call in the techie friends to have them decode home tech and get it working again.

We’d talked about this before, but a little more abstractly, when the husband of a good grad school friend died suddenly in a one-car accident. She had techie friends come over and rip out most of the late husband’s custom home build and install a stock wireless router just so she and her kid could get on the internet. Susie didn’t want to be in that position if at all possible.

This time it wasn’t abstract. The goal was – without killing off technical tinkering, innovation, and learning – how to simplify and document things at home so that should something ever happen to me, that friends could step in and easily help keep her running. In the business world, companies talk and think about disaster recovery plans, and if they’re smart, the practice them occasionally – fail over to backup servers, restore things from backup, switch to alternative network feeds. It was time to think like that for all of our home systems, but first, we had to make sure that what we were running was set up in the best way.

Roughly speaking, what we had running at home that we had to analyze was:

1. FreeNAS – running 4 fairly old 4T drives in a ZFS array, storing pics, movies, backups, and other files
2. Xen Server – running VMs for Windows Server, Win10 running Blue Iris for the home IP cameras, and two CoreOS instances hosting a bunch of docker containers – WordPress, MariaDB, Ubiquity’s unifi controller, among others.
3. Home Networking – AT&T and Ubiquity with a mix of APs in the house and garage.
4. Physical Security – Exterior IP cameras talking to a Blue Iris VM.

More on this saga in the next segment.


No time for you…

Ever since we made the move to AT&T Uverse Fiber – and happily said “See ya!” to TimeWarner / Spectrum – we’ve been a household with clocks adrift. NTP (Network Time Protocol) set up on servers didn’t work – although I blamed myself for botched configurations, NTP on security cameras didn’t work, and network time sync from Windows and MacOS routinely failed. As you might expect, none of the system clocks were in sync, and some were out enough to cause the occasional problems. Then I ran across this forum post on AT&T’s forums, which detailed out what services / ports that AT&T blocks by default for its residential customers – and port 123, both inbound and outbound, which NTP uses was on the list. The complete list is:

So, I get why they do this – a misconfigured NTP server or botnet node can hammer a target server to its knees in a accidental or intentional DDOS. Protecting people from themselves makes sense for NTP and for the other ports they regularly block.

Identifying the problem was half the battle. Getting AT&T to fix it was it’s own effort. What started with a chat session from the Uverse site went like this:

  1. Chat session with the billing / business group, who kindly said he couldn’t help, and transferred my chat to a tech support group.
  2. The first tech support group changed this from a chat into a call, and couldn’t handle my request, and bounced me to a higher tier support group.
  3. This upper tier support group turned out to be a premium internet support group who would only talk to me if I was paying for premium internet support – which he kindly offered to sign me up for on the call. Annoyed at the prospect of paying for something which I thought I already had, I declined. This guy transferred me to a different internet support group who he said could help.
  4. Internet support group #2 – a different group from item 2 above – couldn’t help me either, but instead transferred me to the Fiber Support group.
  5. A fine support rep from the Fiber Support group knew what I was asking (“Can you unblock port 123 for me?”) and how to do it! Once I agreed that I was taking on risk by unblocking this, she proceeded, and by the end of the 40 minute call, I had at least one of my machines able to set its clock from an NTP server. Success!

The total time on the phone / chat with AT&T to get this resolved as just under an hour, and that was once I knew what the problem was and that it was fixable.

As collateral damage, when the rep made the NTP unblock change on my fiber gateway, they ended up breaking all other inbound ports, disconnecting several of my servers from the internet. No amount of reconfiguring and resetting on my side was able to resolve this. Another call to AT&T fiber support and this was resolved on the first call, but that’s another hour or two of my life spent on troubleshooting and resolving something that wasn’t part of the problem to start with.

So, kudos for the right person at AT&T being able to fix this, but it took far too much work on my part to realize that this was an intentional thing on AT&T’s part and that there was a fix for it.  Total time spent on configuration, troubleshooting, research, support call, and testing was easily 6-8 hours, but at least I can happily report that all our home systems now believe it’s the same time.