TOC
前言
K3S sqlite 数据文件损坏
Apr 07 11:04:39 hk129 sshd[10472]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=47.76.35.183 user=postfix
Apr 07 11:04:39 hk129 sshd[10472]: pam_succeed_if(sshd:auth): requirement "uid >= 1000" not met by user "postfix"
Apr 07 11:04:40 hk129 sshd[10472]: Failed password for postfix from 47.76.35.183 port 34040 ssh2
Apr 07 11:13:11 hk129 k3s[11743]: time="2024-04-07T11:13:11.450737765+08:00" level=error msg="error while range on /registry/ingress/ /registry/ingress/: database disk image is malformed"
Apr 07 11:13:11 hk129 k3s[11743]: E0407 11:13:11.450982 11743 cacher.go:419] cacher (*networking.Ingress): unexpected ListAndWatch error: failed to list *networking.Ingress: rpc error: code = Unknown desc = database disk image is malformed; reinitializing...
Apr 07 11:13:11 hk129 k3s[11743]: time="2024-04-07T11:13:11.451692678+08:00" level=error msg="error while range on /registry/ingressclasses/ /registry/ingressclasses/: database disk image is malformed"
Apr 07 11:13:11 hk129 k3s[11743]: E0407 11:13:11.451935 11743 cacher.go:419] cacher (*networking.IngressClass): unexpected ListAndWatch error: failed to list *networking.IngressClass: rpc error: code = Unknown desc = database disk image is malformed; reinitializing...
Apr 07 11:13:11 hk129 k3s[11743]: time="2024-04-07T11:13:11.452367238+08:00" level=error msg="error while range on /registry/ingressclasses/ /registry/ingressclasses/: database disk image is malformed"
Apr 07 11:13:11 hk129 k3s[11743]: time="2024-04-07T11:13:11.452912258+08:00" level=error msg="error while range on /registry/runtimeclasses/ /registry/runtimeclasses/: database disk image is malformed"
Apr 07 11:13:11 hk129 k3s[11743]: time="2024-04-07T11:13:11.453322547+08:00" level=error msg="error while range on /registry/runtimeclasses/ /registry/runtimeclasses/: database disk image is malformed"
Apr 07 11:13:11 hk129 k3s[11743]: time="2024-04-07T11:13:11.453749222+08:00" level=error msg="error while range on /registry/poddisruptionbudgets/ /registry/poddisruptionbudgets/: database disk image is malformed"
Apr 07 11:13:11 hk129 k3s[11743]: E0407 11:13:11.453846 11743 cacher.go:419] cacher (*node.RuntimeClass): unexpected ListAndWatch error: failed to list *node.RuntimeClass: rpc error: code = Unknown desc = database disk image is malformed; reinitializing...
Apr 07 11:13:11 hk129 k3s[11743]: time="2024-04-07T11:13:11.454290558+08:00" level=error msg="error while range on /registry/poddisruptionbudgets/ /registry/poddisruptionbudgets/: database disk image is malformed"
近期遇到了 K3S sqlite3 数据文件损坏的问题,只有部分数据是可以查询的,这种情况下没办法对文件进行数据恢复,唯一的方法就是停止 k3s,然后使用备份文件重新启动 k3s.
文件的备份处理还是非常重要的,否则一不小心整个 k3s 集群都没了.
备份恢复
如果是使用默认存储的k3s,也就是使用 sqlite 作为存储,那么 k3s 集群的备份和恢复很简单,只需要备份 sqlite 的 db 文件. 至于使用 local 持久化在本地的数据
目前我在我的测试场景中频繁使用这个能力,将备份好的 K3S 在每一次实验时恢复出一个对应状态的集群,备份内容我通过 oras 将数据备份到的容器镜像里面.
这样我就可以每次都直接创建一个研究的 kubernetes 集群了,例如在没有备份恢复的情况下,我总是需要 部署一个默认的 K3S,然后部署我的研究项目,例如 karmada,接着需要等待 karmada 部署完成,这需要一定的时间,而有了备份恢复之后我就可以直接创建一个带有部署好 karmada 的 kubernetes,节省了很多时间.
在通过外部 Etcd 的场景下,使用 Etcd 的备份数据启动一个新的 K3S 的时候报错了: bootstrap data already found and encrypted with different token
,原因是 K3S 存储了一个 Token 在数据库中,在 Etcd 下则是存储在了 /bootstrap/
路径下,只需要删除对应的数据然后再启动 K3S 就可以正常启动了.
集群升级
lank8s.cn 稳定运行在 K3S 之上几年了,这周(20240307)对 K3S 做了一次升级,比想象中的要轻松,当然这也是因为是单节点的缘故.
K3S 的升级分两种,一种是手动升级,另一种是自动化升级,我采用的是自动化升级.
第一步,安装控制器
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml
第二步,编写并应用升级计划
# Server plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: In
values:
- "true"
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
version: v1.28.7+k3s1
正常来说还需要 apply 一个资源来升级 k3s agent 节点,但我是单节点,因此只需要一个 k3s server 的资源文件就可以了.
我是升级了多次,因此也 apply 了多次 server plan 的资源,从 1.26->1.27->1.28->1.29.
运气比较好,没有遇到例如某个资源被删除了这样的事情.
证书过期 You must be logged in to the server (Unauthorized)
今天上服务器想查询一下 pod 信息,发现执行命令 kubectl get pod
失败了,返回错误 You must be logged in to the server (Unauthorized)
通过命令 journalctl -r -u k3s
看一下日志,会提示 k3s 证书过期了:
May 15 17:49:17 hk129 k3s[13497]: E0515 17:49:17.925174 13497 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2025-05-15T17:49:17+08:00 is after 2025-04-07T10:20:08Z
May 15 17:49:17 hk129 k3s[13497]: E0515 17:49:17.691508 13497 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2025-05-15T17:49:17+08:00 is after 2025-04-07T10:20:08Z
May 15 17:49:17 hk129 k3s[13497]: time="2025-05-15T17:49:17.237466258+08:00" level=info msg="Cluster-Http-Server 2025/05/15 17:49:17 http: TLS handshake error from 172.17.50.31:56682: remote error: tls: bad certificate"
May 15 17:49:16 hk129 k3s[13497]: E0515 17:49:16.949277 13497 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2025-05-15T17:49:16+08:00 is after 2025-04-07T10:20:08Z
May 15 17:49:15 hk129 k3s[13497]: time="2025-05-15T17:49:15.231693990+08:00" level=info msg="Cluster-Http-Server 2025/05/15 17:49:15 http: TLS handshake error from 172.17.50.31:56674: remote error: tls: bad certificate"
May 15 17:49:13 hk129 k3s[13497]: time="2025-05-15T17:49:13.226010352+08:00" level=info msg="Cluster-Http-Server 2025/05/15 17:49:13 http: TLS handshake error from 172.17.50.31:56666: remote error: tls: bad certificate"
May 15 17:49:12 hk129 k3s[13497]: E0515 17:49:12.922365 13497 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2025-05-15T17:49:12+08:00 is after 2025-04-07T10:20:08Z
May 15 17:49:12 hk129 k3s[13497]: E0515 17:49:12.709164 13497 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2025-05-15T17:49:12+08:00 is after 2025-04-07T10:20:08Z
May 15 17:49:11 hk129 k3s[13497]: time="2025-05-15T17:49:11.220006250+08:00" level=info msg="Cluster-Http-Server 2025/05/15 17:49:11 http: TLS handshake error from 172.17.50.31:56656: remote error: tls: bad certificate"
May 15 17:49:10 hk129 k3s[13497]: E0515 17:49:10.607783 13497 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2025-05-15T17:49:10+08:00 is after 2025-04-07T10:20:08Z
May 15 17:49:09 hk129 k3s[13497]: time="2025-05-15T17:49:09.214272408+08:00" level=info msg="Cluster-Http-Server 2025/05/15 17:49:09 http: TLS handshake error from 172.17.50.31:56646: remote error: tls: bad certificate"
May 15 17:49:07 hk129 k3s[13497]: E0515 17:49:07.919848 13497 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2025-05-15T17:49:07+08:00 is after 2025-04-07T10:20:08Z
这是 K3S 证书过期了,重新生成证书即可解决.
systemctl restart k3s
微信公众号
扫描下面的二维码关注我们的微信公众号,第一时间查看最新内容。同时也可以关注我的Github,看看我都在了解什么技术,在页面底部可以找到我的Github。
