-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes when running in Kubernetes #376
Comments
I'm not sure what is wrong with the StatefulSet, however the 'minimal' pod has a small mistake with the ---
apiVersion: v1
kind: Pod
metadata:
name: thingsdb
labels:
app: thingsdb
spec:
containers:
- name: thingsdb
image: ghcr.io/thingsdb/node:latest
args:
- "--init"
- "--log-level"
- "debug"
ports:
- containerPort: 9200 I would however expect a different log: something like If it still doesn't work after fixing the 'minimal' pod, can you verify the output of the |
Thanks for the quick response. I've updated the args for the deployment but I'm still running into the same issue. Here's the output from the kubectl describe command: kubectl describe pod thingsdb
Name: thingsdb
Namespace: default
Priority: 0
Service Account: default
Node: node01/192.168.88.5
Start Time: Mon, 22 Apr 2024 16:36:37 +0200
Labels: app=thingsdb
Annotations: <none>
Status: Running
IP: 10.244.0.83
IPs:
IP: 10.244.0.83
Containers:
thingsdb:
Container ID: containerd://16ce8c55d159af73cc68aa9b2aa1b2cc798bfa58a3eab219dfe3d89a1825f8b9
Image: ghcr.io/thingsdb/node:latest
Image ID: ghcr.io/thingsdb/node@sha256:01aa77d067ffce69887f83d9b1ac129bb4bd5abadcf06d1277a412da36084e0e
Port: 9200/TCP
Host Port: 0/TCP
Args:
--init
--log-level
debug
State: Terminated
Reason: Error
Exit Code: 132
Started: Mon, 22 Apr 2024 16:37:18 +0200
Finished: Mon, 22 Apr 2024 16:37:19 +0200
Last State: Terminated
Reason: Error
Exit Code: 132
Started: Mon, 22 Apr 2024 16:36:54 +0200
Finished: Mon, 22 Apr 2024 16:36:55 +0200
Ready: False
Restart Count: 3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-plfgr (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-plfgr:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 52s default-scheduler Successfully assigned default/thingsdb to node01
Normal Pulled 51s kubelet Successfully pulled image "ghcr.io/thingsdb/node:latest" in 389ms (389ms including waiting)
Normal Pulled 49s kubelet Successfully pulled image "ghcr.io/thingsdb/node:latest" in 794ms (794ms including waiting)
Normal Pulled 35s kubelet Successfully pulled image "ghcr.io/thingsdb/node:latest" in 485ms (485ms including waiting)
Normal Pulling 11s (x4 over 52s) kubelet Pulling image "ghcr.io/thingsdb/node:latest"
Normal Created 11s (x4 over 51s) kubelet Created container thingsdb
Normal Started 11s (x4 over 51s) kubelet Started container thingsdb
Normal Pulled 11s kubelet Successfully pulled image "ghcr.io/thingsdb/node:latest" in 506ms (506ms including waiting)
Warning BackOff 10s (x5 over 48s) kubelet Back-off restarting failed container thingsdb in pod thingsdb_default(2cc92bb4-00c0-4459-9e68-99c861e0c3cb) |
I tried running the container locally with the following command: docker run \
--name thingsdb \
-d \
-p 9200:9200 \
ghcr.io/thingsdb/node --init It gives me an error that I'm running on a wrong platform. This is understandable as I'm running on Apple Silicon. But the container does start, experiencing the same behaviour as the container on K8s (which is running on Linux/AMD64). ThingsDB 'logo' appears, then crashes. I further tried troubleshooting the issue on a Windows AMD64 machine. I installed minikube and the Pod deloyment ran fine. So there is something wrong with my environment that makes ThingsDB crash. But I can't figure out what it is... |
Another quick update. I compiled ThingsDB from source on my Apple Silicon machine. Ran like a dream. So maybe it's an idea to make an ARM64 container available. I can create a fork and start working on that. |
@rickmoonex , I'd be happy to send you an email with a pre-release copy of the ThingsDB book. Just confirm if you'd like it sent to the email address associated with your GitHub account. |
@joente That would be great! The email associated with my GitHub account is fine. |
I've just sent you an email with the pre-release copy of the ThingsDB book! |
Great, thanks you! I've created PR #377 for the ARM container |
I have managed to do some more debugging. I have deployed the container as follows: ---
apiVersion: v1
kind: Pod
metadata:
name: thingsdb
labels:
app: thingsdb
spec:
containers:
- name: thingsdb
image: ghcr.io/thingsdb/node:latest
command: ["sh", "-c"]
args: ["while true; do echo 'yo' && sleep 5; done;"]
ports:
- containerPort: 9200 I then manually ran ThingsDB and it gave the following error: /usr/local/bin # thingsdb --version
_____ _ _ ____ _____
|_ _| |_|_|___ ___ ___| \| __ |
| | | | | | . |_ -| | | __ -|
|_| |_|_|_|_|_|_ |___|____/|_____| version: 1.6.0
|___|
Illegal instruction (core dumped) Further debugged this with GDB and came across the following: /usr/local/bin # gdb thingsdb
GNU gdb (GDB) 14.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-alpine-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from thingsdb...
(gdb) run
Starting program: /usr/local/bin/thingsdb
_____ _ _ ____ _____
|_ _| |_|_|___ ___ ___| \| __ |
| | | | | | . |_ -| | | __ -|
|_| |_|_|_|_|_|_ |___|____/|_____| version: 1.6.0
|___|
Program received signal SIGILL, Illegal instruction.
0x0000555555597a00 in ti_create ()
(gdb) bt full
#0 0x0000555555597a00 in ti_create ()
No symbol table info available.
#1 0x0000555555593c68 in main ()
No symbol table info available.
(gdb) Now I'm no C wizard, so I don't know why these instructions are not available on my CPU. The node that this pod is running on has a Intel Celeron N5105. |
v1.6.1-alpha1 has been build (with an ARM64 image included):
@rickmoonex , can you try this image? |
@joente The image works great on my Mac. But I still have the issue on my K8s node. (Just for clarity, that machine is not an ARM machine). I have done some more debugging and added it above. |
@rickmoonex , A debug build might help to troubleshoot the problem. To create a debug build of ThingsDB from the source code, run the following command:
This build prioritizes debugging information over optimization. It uses the |
@joente, I did some more debugging and came across some strange behaviour. I did a I then changed I then modified the Dockerfile and /usr/local/bin # gdb ./thingsdb
GNU gdb (GDB) 14.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-alpine-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./thingsdb...
(gdb) run
Starting program: /usr/local/bin/thingsdb
_____ _ _ ____ _____
|_ _| |_|_|___ ___ ___| \| __ |
| | | | | | . |_ -| | | __ -|
|_| |_|_|_|_|_|_ |___|____/|_____| version: 1.6.1-alpha0+debug
|___|
Program received signal SIGILL, Illegal instruction.
ti_counters_reset () at /tmp/thingsdb/src/ti/counters.c:47
warning: 47 /tmp/thingsdb/src/ti/counters.c: No such file or directory
(gdb) bt full
#0 ti_counters_reset () at /tmp/thingsdb/src/ti/counters.c:47
No locals.
#1 0x0000555555622f3a in ti_counters_create () at /tmp/thingsdb/src/ti/counters.c:18
No locals.
#2 0x00005555555aa5b9 in ti_create () at /tmp/thingsdb/src/ti.c:101
No locals.
#3 0x00005555555a2c99 in main (argc=1, argv=0x7fffffffe648) at /tmp/thingsdb/main.c:95
seed = -855361564
fd = 3
rc = 0
(gdb) So these findings are hinting that there is something wrong with the way the container is built. And not so much with ThingsDB itself. |
Looking at where it fails, it might be something related to the atomic counters. I've created a branch natomic where this is disabled (Using
|
Once again running into the same issue. If I build the container locally it runs great, if I let GitHub Actions build it it breaks. Here is the backtrace, same error as last time: /usr/local/bin # gdb thingsdb
GNU gdb (GDB) 14.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-alpine-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from thingsdb...
(gdb) run
Starting program: /usr/local/bin/thingsdb
_____ _ _ ____ _____
|_ _| |_|_|___ ___ ___| \| __ |
| | | | | | . |_ -| | | __ -|
|_| |_|_|_|_|_|_ |___|____/|_____| version: 1.6.1-alpha1+debug
|___|
Program received signal SIGILL, Illegal instruction.
ti_counters_reset () at /tmp/thingsdb/src/ti/counters.c:47
warning: 47 /tmp/thingsdb/src/ti/counters.c: No such file or directory
(gdb) bt full
#0 ti_counters_reset () at /tmp/thingsdb/src/ti/counters.c:47
No locals.
#1 0x0000555555622f43 in ti_counters_create () at /tmp/thingsdb/src/ti/counters.c:18
No locals.
#2 0x00005555555aa5b9 in ti_create () at /tmp/thingsdb/src/ti.c:101
No locals.
#3 0x00005555555a2c99 in main (argc=1, argv=0x7fffffffe648) at /tmp/thingsdb/main.c:95
seed = -1041532550
fd = 3
rc = 0
(gdb) |
@rickmoonex , can you provide the output of |
@joente, here you go: kubectl version
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.3 |
Sorry, can you provide the full output? |
{
"clientVersion": {
"major": "1",
"minor": "29",
"gitVersion": "v1.29.2",
"gitCommit": "4b8e819355d791d96b7e9d9efe4cbafae2311c88",
"gitTreeState": "clean",
"buildDate": "2024-02-14T10:32:39Z",
"goVersion": "go1.21.7",
"compiler": "gc",
"platform": "darwin/arm64"
},
"kustomizeVersion": "v5.0.4-0.20230601165947-6ce0bf390ce3",
"serverVersion": {
"major": "1",
"minor": "29",
"gitVersion": "v1.29.3",
"gitCommit": "6813625b7cd706db5bc7388921be03071e1a492d",
"gitTreeState": "clean",
"buildDate": "2024-03-14T23:58:36Z",
"goVersion": "go1.21.8",
"compiler": "gc",
"platform": "linux/amd64"
}
} |
On which platform did you build the container successfully, what this a native AMD64 host? Or is your |
@riklempens The actual architecture I built the container on is docker build --platform linux/amd64 --file docker/Dockerfile -t rickmoonen/thingsdb-test:dev --push .
I just tried building the container on a native AMD64 machine, and it also built and ran totally fine. |
The |
So to summarize:
|
Correct! |
I'm not sure how to continue with this issue. If I could reproduce the problem, solving would be easier. It seems to be related to older hardware, and compiling ThingsDB instead of relying on the pre-build images seems to work. @rickmoonex , do you have any suggestions how to continue? |
I have once again done some more troubleshooting. Here is the summary: I first tried to edit the GitHub actions pipeline to run a normal I then wanted to find out of the issues lied with the pre-build GitHub actions supplied by Docker. So I ran the pipeline locally using act. The pipeline ran fine and the resulting image worked on my K8s cluster. This leads me to believe that there is an underlying problem with the GitHub runners that are used for the pipelines. Maybe switching ubuntu version will help. |
I'm really at a dead-end here. Did some research on the hardware that I'm running on at found this in the datasheet:
So I don't think the problem is the hardware and any unavailable instructions. I did a docker build of the image. One locally and one on GH Actions, same command, same Dockerfile, same Docker version. I then ran these containers and extracted the binaries. I tried switching to different versions of Alpine for the container, this yielded no results. |
Alright I have some good news, found and solved one problem. But then another showed up haha. I turns out that the Intel processor I'm using does not support AVX instructions. Neither do my Mac and the other machine I was building on, that why those containers ran fine. But the one built in GitHub did not. So I disable AVX on the compiler with the following:
That solved that problem. Now ThingsDB still gives an illegal instruction error, but way later in the program. See logging below: (gdb) run
Starting program: /usr/local/bin/thingsdb
_____ _ _ ____ _____
|_ _| |_|_|___ ___ ___| \| __ |
| | | | | | . |_ -| | | __ -|
|_| |_|_|_|_|_|_ |___|____/|_____| version: 1.6.1-alpha0+debug
|___|
[I 2024-04-26 10:14:46] running on: linux/amd64
[W 2024-04-26 10:14:46] path is successfully locked but a lock file existed which indicates that the process was not closed correctly last time (/data/)
[D 2024-04-26 10:14:46] found node id `0` in file: `/data/.node`
[W 2024-04-26 10:14:46] store path not found: `/data/store/`
[I 2024-04-26 10:14:46] start listening for HTTP status requests on TCP port 8080
[D 2024-04-26 10:14:46] known committed on all nodes: `change:0`
[D 2024-04-26 10:14:46] known stored on all nodes: `change:0`
[D 2024-04-26 10:14:46] loading archive files from `/data/archive/`
[I 2024-04-26 10:14:46] changing status from SYNCHRONIZING to READY
[I 2024-04-26 10:14:46] start listening for node connections on TCP port 9220
[I 2024-04-26 10:14:46] start listening for client connections on TCP port 9200
[I 2024-04-26 10:14:46] start listening for HTTP API requests on TCP port 9210
[I 2024-04-26 10:14:46] start listening for WebSocket connections on TCP port 9270
Program received signal SIGILL, Illegal instruction.
0x00005555558e8b5c in lwsl_timestamp (level=16, p=0x555555ab1040 <buf> "", len=256) at /tmp/thingsdb/libwebsockets/lib/core/logs.c:228
warning: 228 /tmp/thingsdb/libwebsockets/lib/core/logs.c: No such file or directory
(gdb) bt full
#0 0x00005555558e8b5c in lwsl_timestamp (level=16, p=0x555555ab1040 <buf> "", len=256) at /tmp/thingsdb/libwebsockets/lib/core/logs.c:228
o_now = 1714126486
now = 17141264861705
tv = {tv_sec = 1714126486, tv_usec = 170531}
ptm = 0x7fffffffde40
tm = {tm_sec = 46, tm_min = 14, tm_hour = 10, tm_mday = 26, tm_mon = 3, tm_year = 124, tm_wday = 5, tm_yday = 116, tm_isdst = 0, tm_gmtoff = 0, tm_zone = 0x7ffff76db068 "UTC"}
n = 0 |
Second error was due to BMI2 instructions not being supported. Disabled with:
Now it works like a charm! @joente, what would be a logical next step? Are these instruction sets crucial to ThingsDB, if no I suppose they can be disabled. |
They are not crucial, it is just to tell the compiler what instructions can be used. It might have some performance impact, but probably not that much. @rickmoonex , I'll build an alpha version with the flags set as suggested. |
Works like a charm! Thanks for all the help. |
@joente, just ran into another instruction error when joining nodes. BMI1 also needs to be disabled:
|
Almost, I think I need to move the lines to keep the ARM build working
I've added an environment var
|
@rickmoonex , the images |
Hi @rickmoonex, did you have time to do some testing? If everything works as expected then the issue can be closed and I'll create a release version. |
Hi @joente, had a busy week but managed to do some testing yesterday evening and this morning. I haven't experience any issues. So we can close the issue. |
V1.6.1 released: https://github.com/thingsdb/ThingsDB/releases/tag/v1.6.1 |
Describe the bug
I'm trying to move my docker deployment of ThingsDB to Kubernetes. I have modified the StatefulSet documented under the GKE documentation. But the container becomes trapped in a crash loop with no useable logs. Even after stripping it down to a minimal Pod deployment it experiences the same issue.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The ThingsDB container should start and run as expected. Or at least show some debugging information.
Screenshots
Machine/OS:
The text was updated successfully, but these errors were encountered: