Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wifi Stability has gotten worse since move from Arduino to PlatformIO #115

Closed
bitrot-alpha opened this issue Jan 29, 2025 · 20 comments
Closed

Comments

@bitrot-alpha
Copy link
Contributor

Hardware: Elekstube v1
FW version: a337352

I'm using WPS so I don't have to reflash the entire firmware just to change Wi-Fi settings. I noticed the new firmware has Wi-Fi amnesia after a few hours. The ESP32 seems to "forget" the PSK (my OpenWRT router says failed WPA 4-way handshake, "possibly wrong PSK?"). After it forgets the key it gets stuck on "connecting.." on power cycle, but it doesn't time out. My router doesn't see any more connection attempts after power-cycling the ESP32..

I'm working on a PR to resolve this, but I don't have an ETA right now. I will open the PR as a draft after posting this.

@aly-fly
Copy link
Owner

aly-fly commented Jan 29, 2025

Interesting. Arduino and PlatformIO should not have anything to do with this issue. WiFi is handled by the Espressif's core. Maybe you had an older core installed in Arduino; maybe the IDE didn't update the core; maybe the old IDE didn't even support newer core versions.

My SOHO wifi setup doesn't even offer the WPS anymore as it was deemed a security risk (TP-Link Omada). So I can't test it.

Another thing that comes to mind is the partition table. I remember that WiFi needs a special slot in the Flash memory where it stores the key. Maybe we forgot to include that section in the .csv file when migrating from Arduino to PlatformIO.

@bitrot-alpha
Copy link
Contributor Author

You may be on to something there. I tried reprogramming the ESP32 with the old Arduino IDE firmware but it doesn't recognize the SPIFFS partition generated by PlatformIO.

@Martinius79
Copy link
Collaborator

Really interesting ;)
I only tested that the WPS code is working in general, but never used it "long term"...

I already worked on the "never leaving connecting loop" part and implemented a timeout in one of my branches...The problem is, that also the real time clock fall back seems to not working, if the WiFi is not working at all...The clock ist showing only zeros then...Never found the time to go on with that fix.

We have a NVS partition with a size of 0x7000 (28672) Bytes. And the whole settings are stored and read from there.

I checked the NVS stuff a bit further and it seems, as if we need to initalize the NVS flash usage before using anyway... Especially before using the higher level Wifi functions.
Under the hood, the wifi driver is initalized with the default values (esp_wifi.h) and they say "nvs_enable = WIFI_NVS_ENABLED". WIFI_NVS_ENABLED seem to be 1, because CONFIG_ESP32_WIFI_NVS_ENABLED is defined by the SDK.
So far my findings.
Maybe I am wrong.
Anyway, the wifi driver SHOULD use the NVS partition to store stuff (that what the esp32 docu says). So we should define "CONFIG_ESP32_WIFI_NVS_ENABLED" (just in case) and use the "nvs_flash_init" function before doing anything with config read or write.
Maybe this will do the trick.

On the other hand, if there is a general problem that the clock is unable to read the config values from NVS after a while, we need to check this further anyway...I never noticed something like that on any of my clocks yet.

The WPS structure (esp_wps_config_t) can be inialized with a macro.

static esp_wps_config_t wps_config = WPS_CONFIG_INIT_DEFAULT(ESP_WPS_MODE);

It should never be emptied to zeros. But it is all the same init in the end.

The idea to init "our" wifi structure completly with zeros on a WPS initial connect and in error case seems valid to me, but I am not sure, why this should fix, that the PSK seems to be deleted after some time...PSK is stored in RAM only, if NVS is not used and in the special values (namespace) for the wifi driver, if active.

After which time period does the problem show up to you? Hours? Days?
Multiple clock restarts can not force the behavoir? Or a forced disconnect from the router side, like rebooting it?

I will add some code snippets as well soon ;)

Bye
Martinius

@bitrot-alpha
Copy link
Contributor Author

See below for terminal output from my OpenWRT (Gargoyle) router. The output is from hostapd and I can simulate pushing the WPS button on the router by using a command.

I can force the clock to trigger the bug by calling the wifi command on my router which briefly turns off the WiFi (it restarts the daemons responsible for WiFi and turns the radios off and on). This is a problem for me because I have a script run on my router in the early morning and late night to change the radio transmission power of the router that calls the wifi command.

Image

@bitrot-alpha
Copy link
Contributor Author

I'll attempt to get the debug logs from my clock soon.

@bitrot-alpha
Copy link
Contributor Author

See attached

minicom.log

@bitrot-alpha
Copy link
Contributor Author

@Martinius79

On the other hand, if there is a general problem that the clock is unable to read the config values from NVS after a while, we need to check this further anyway...I never noticed something like that on any of my clocks yet.

My v1 Elekstube clock has been running for over a year at least without the 3.3V mod, so my ESP32 may be slightly toasted.. I did do the mod today just to see if the behavior of the clock would change, but it hasn't. I haven't had any problems reflashing the ESP32 or with the clock image files getting corrupted.

@Martinius79
Copy link
Collaborator

I created another PR with your changes and my idea :)

#117

Maybe you can give it a try and tell us, if there is a behavoir change...

And I have to watch the logs tomorrow...thx!

@aly-fly
Copy link
Owner

aly-fly commented Jan 30, 2025

I tried reprogramming the ESP32 with the old Arduino IDE firmware but it doesn't recognize the SPIFFS partition generated by PlatformIO.

I think this might be because the partition table we have in the .csv (and in the flash) is different than default selection(s) in the Arduino IDE. So it wants to overwrite whole flash with the different formatting.

I only tested that the WPS code is working in general, but never used it "long term"...

In my old wifi setup (simple router with WPS) and Clock programmed in Arduino it was working over many months and many clock and wifi restarts.

real time clock fall back seems to not working, if the WiFi is not working at all...The clock ist showing only zeros then.

This is weird. Did you check the battery of your RTC? On my v1 clock it works fine.

@bitrot-alpha
Copy link
Contributor Author

When my v1 Elekstube has a dead RTC battery it always thinks it's 16:00 when it can't reach the internet after power cycle (it still ticks up as normal). All zeroes would indicate some problem with communication with the DS3231 since that's how the library returns when there's a problem..

Just for fun I tried updating my router to the latest stable OpenWRT. Before it was on a fork with a different web UI called Gargoyle. If I use the "Disconnect" button in the router web UI the clock will reconnect successfully, but it still is unable to reconnect if I run the wifi command.

@aly-fly
Copy link
Owner

aly-fly commented Jan 30, 2025

Huh, so maybe router is the culprit that forgets the authentication key :)

@bitrot-alpha
Copy link
Contributor Author

Well I am unable to reflash the old Arduino firmware because the SPIFFS plugin doesn't work on my machine (it doesn't generate the partition correctly and errors out). I'm still not convinced yet that it's a bug in OpenWRT, because the old Arduino firmware worked fine.

I wish I had dumped the flash before writing the new firmware to it..

@Martinius79
Copy link
Collaborator

I wish I had dumped the flash before writing the new firmware to it..

The last version with the Arduino dir present is from last february

use

git checkout --no-track -b Branch_f7b4f47

to get it...

@Martinius79
Copy link
Collaborator

minicom.log

I watched the logs so far, and the first disconnect (Reason 2) could indicate that the signal strength is too weak to maintain the connection...
However, I don’t see any reason why the handshake fails on the reconnect attempt (Reason 15).

Which authentication methods are selected on the router? WPA2 only, or WPA2 and WPA3?
Does the problem persist if you reboot the clock after the first Reason 15 failure?

Can you also try adding a printout of the Wi-Fi signal strength?
Right now, we only log it in the MQTT status messages, but that might not be frequent enough.

Maybe log it in the main loop every second, and also while reconnecting (not sure if this works after a disconnect has already happened).

Serial.println(WiFi.RSSI());

And then tell us the results here. Maybe also with timestamps for both logs. Serial Monitor extension in VS Code from MS is my favorite for this.

I will also try to reduce the signal strength of my router while using my clocks, maybe one of them will have the same issues.

Another question: Is the red "NO WIFI" showing up if this happens or does it goes to the WPS push connection mode again?

Bye
Martinius

@Martinius79
Copy link
Collaborator

I tried my best to simulate your case, but no luck so far.
I reduced my router’s signal strength to 6% and then wrapped my EleksTube Original Version in aluminum foil.
After the third layer, it started failing the beacon checks (Reason 200) and no AP (201, but it recovered flawlessly after removing just one layer.
So I have no Idea, where this reason 2 is coming from...

see my logs:

COM7_2025_01_30.23.36.03.792.txt

Modified brach so far under:
https://github.com/Martinius79/EleksTubeHAX/tree/FixWiFiReconnectProblemsTryOut

relevant settings:
HARDWARE_Elekstube_CLOCK
WIFI_USE_WPS
no mqtt

@bitrot-alpha
Copy link
Contributor Author

Certain devices in my house didn't like having WPA3 enabled, so the router is WPA2 only. NO WIFI! shows as soon as I run the wifi command on the router, and upon rebooting the clock, it attempts to reconnect to the saved wifi again but times out. I will try your suggestions soon.

It seems to have been a bug in the previous OpenWRT 22.03 release. I have tried to trigger the bug (assumed to be with the ESP32) again by following the same steps I had done on the previous OpenWRT firmware and did not observe the same results with the new OpenWRT today.

Last night (on the new OpenWRT firmware..) I was able to trigger the bug but I'm assuming it's something really hard to fix/find with the PHY driver on my specific router and probably involves some sort of race condition. The clock did show up in the web UI after running the wifi command, but I had the same symptoms of the router thinking that the ESP had the wrong PSK. I'm wondering if the new firmware needed some time to "settle."

I will reopen the issue if I am able to get the bug to come back and I'm sure it's not actually a problem with my router.

Thank you for your time and effort!

@bitrot-alpha
Copy link
Contributor Author

well, I was able to get the bug to trigger again just now :(
I'm fairly convinced it's a bug with OpenWRT though. I think maybe I'm misusing the wifi command as well. The default behavior of wifi is to enable interfaces, not reload the configuration. I think I got the idea to just run wifi and not wifi reload from an outdated wiki article.

@bitrot-alpha
Copy link
Contributor Author

I tried the debug firmware from @Martinius79's branch and I haven't really gotten any more info than what I've already gathered. The signal does indeed cut out for a few seconds after running the wifi command, but I already knew that from the WiFi indicator on my PC.

@Martinius79
Copy link
Collaborator

Martinius79 commented Jan 31, 2025

The ESP32 has had similar problems in the past with some other well-known routers. Here in Germany, the "Fritz!Box" series from AVM is very popular, and they had issues with authentication and the handshake some years ago with older firmware versions.

See: espressif/arduino-esp32#2501

So, I guess it is a combination of the Espressif/Arduino framework's WiFi implementation and the router firmware you are using.
Maybe you can open an issue in the repository, as we have very limited control over "internal" settings for the WPA functionality in the WiFi driver.
Experimenting with different framework versions might also yield some results.

OpenWRT issues on the GitHub mirror: https://github.com/openwrt/openwrt/issues

I didn't find anything useful in the recent issues related to the keyword "ESP32," but maybe you have more insight and can find additional information.

@Martinius79
Copy link
Collaborator

Martinius79 commented Jan 31, 2025

Two more things:

  1. Change your Abstract API Key! It was posted in the log file.
  2. I guess this problem will also occur with WPS disabled and the SSID and password hardcoded. Maybe you can give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants