Replies: 19 comments 2 replies
-
Hi, Thanks for the report. |
Beta Was this translation helpful? Give feedback.
-
Thank you. I had updated it already. The error is still. |
Beta Was this translation helpful? Give feedback.
-
Hi @holiday01 thanks for the reply. Can you try if "poc" mode work for you? https://nvidia.github.io/NVFlare/quickstart.html#setting-up-the-application-environment-in-poc-mode Try to run an example app using the POC setup. If that works, |
Beta Was this translation helpful? Give feedback.
-
The server worked by using poc mode. The client and admin connect it from the different machines. When I run the example "brats18" and the clients have the mdoel training results (e.g. model.pt), but I get the error and the model.pt is not be sent to server and admin. Hence, I guess I should run it by using provision which have securet setting. |
Beta Was this translation helpful? Give feedback.
-
The error message shows that the client could not connect to the server at "103.124.23.130:8002". Please change the server to use hostname, instead of IP. |
Beta Was this translation helpful? Give feedback.
-
Actually, the clients had connected to the server. After the clients got local_model.pt, the error was displayed. |
Beta Was this translation helpful? Give feedback.
-
Are you using the POC mode? or the provision secure mode? Provision secure mode can only work with hostname. Also make sure both the server and client side use the same. |
Beta Was this translation helpful? Give feedback.
-
I used the poc mode to get grpc error. Following this https://nvidia.github.io/NVFlare/quickstart.html#quickstart. |
Beta Was this translation helpful? Give feedback.
-
Can you try using POC mode: https://nvidia.github.io/NVFlare/quickstart.html#quickstart And follow the hello-numpy-sag example first: https://nvidia.github.io/NVFlare/examples/hello_numpy.html See if you can get this to finish with 1 servers and 2 clients (on different machines) |
Beta Was this translation helpful? Give feedback.
-
I tried it. |
Beta Was this translation helpful? Give feedback.
-
Thanks, looks like hello-numpy-sag is running correctly in POC mode with your systems. This is a good start. If you want to try the SECURE mode, you will need to follow here: https://nvidia.github.io/NVFlare/user_guide/overview.html#provisioned-setup And please also try hello-numpy-sag example first and see that results. If things are good, please then try hello-pt to see if it completes. I am sorry this is not a short process, we are working on improving the user experience in the next release. If you want to try the BRATS example, you will need to install the required package in all the machines. The required package is listed here: https://github.com/NVIDIA/NVFlare/blob/main/examples/brats18/virtualenv/min-requirements.txt And you need to split the BRATs dataset into the number of clients you have. @holgerroth or @ZiyueXu77 can help from there if you got any questions. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your response and help. I know I should have the high cost for the debug. This issue had occurred as I ran "provision -p project.yml" which I edited the project.yml. Hence I ask it here. I ran the hello-pt. And got the cross_site_val folder in the server (cross_site_val - cross_val_results.json, model_shareables, result_shareables) I had finished the BRATS example, but I could not get the model in the server, whereas I got local_model.pt and best_local_model.pt in the clients. |
Beta Was this translation helpful? Give feedback.
-
Hi @holiday01 thanks for more information and your experiment. From "I ran the hello-pt. And got the cross_site_val folder in the server (cross_site_val - cross_val_results.json, model_shareables, result_shareables)", it seems that your system finish the hello-pt example correctly, which is a good thing. Can you elaborate on what steps you take to run the BRATS example? You said "I got local_model.pt and best_local_model.pt in the clients." => this means each client is doing their work. Note that the config provided in the BRATS example folder: https://github.com/NVIDIA/NVFlare/blob/main/examples/brats18/configs/brats18_fedavg/config/config_fed_server.json#L3 Specify the min_clients to be 13. So if you have client sites that is less than this number, the server side will not proceed (as it is waiting for all clients to submit result back). You should modify this number to the number of clients you want to run. |
Beta Was this translation helpful? Give feedback.
-
Only two clients I used. I edited the path of data in the config of clients, and image path in image json files. set_run_number 1 |
Beta Was this translation helpful? Give feedback.
-
Hi @holiday01 , Thanks for more information. Can you change this line to False: I think if this is false then the server side model will not be deleted.
|
Beta Was this translation helpful? Give feedback.
-
Thank you. And I tried it. And the clients also got model.pt, whereas the server did not. |
Beta Was this translation helpful? Give feedback.
-
Hi @holiday01 , Thanks for getting back. You said in the end "check_status client" show stopped. What about "check_status server", what did it show in the end? If possible, can you attach your server and client logs? They can be found in startup folder as log.txt at each site. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I just run it with 1 server and 2 clients and it is working on my machine. Seeing from your check_status server in the end, I see "Run number has not been set." Did your server died? Can you remove your old workspace and start fresh with 1 server and 2 clients and share the whole log file on both server and client here? |
Beta Was this translation helpful? Give feedback.
-
A new error I got while I bash server/startup/start.sh
E0602 10:38:02.562101740 608351 ssl_transport_security.cc:1495] Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.
Originally posted by @holiday01 in #631 (comment)
Beta Was this translation helpful? Give feedback.
All reactions