-
Notifications
You must be signed in to change notification settings - Fork 31
request latency #82
Comments
Elapsed time on a test cluster provided by Jack: $ time curl http://xxx.xxx.xxx.xxx:5555/cluster | wc real 0m6.461s |
Interestingly, two containers in the said cluster contain an extra long list of ExecIDs (about 100,000 entries in each). It should be eliminated by the cluster-insight minion. This may explain some of the increase in the elapsed time. |
Also, both these containers are the skydns containers. The same that had the extra long ExecIDs in the cassandra cluster projects. |
Before deleting the values of ExecIDs by the cluster-insight minion: real 0m8.405s real 0m5.354s |
After omitting the value of ExecIDs by the cluster-insight minions: $ time curl http://xxx.xxx.xxx.xxx:5555/cluster | wc real 0m3.726s real 0m0.358s *** important observations *** |
It takes about 1 second for Docker to return the details of this container. This time varies from 0.6 seconds to 1.1 seconds. Note that the "docker inspect" command returns over 8 MB of data. Caching at the cluster-insight minion should eliminate the long latency of reading the raw data from Docker. $ time sudo docker inspect 5e9a552724da | wc real 0m0.594s real 0m1.104s real 0m0.949s real 0m0.648s |
Current implementation: The performance now on a cluster running the Guestbook app. real 0m0.532s Hot collector cache: real 0m0.131s |
still need to do:
|
I will close this issue after trying to increase the number of worker threads and write a unit test. |
Measuring the collector running on a Cassandra test cluster with 4 minion nodes using the following shell command: Each result is the arithmetic average of 5 requests for the /cluster endpoint. The requests are spaced 11 seconds apart to force a cache miss. The access is from the node that runs the master collector. 1 worker thread: 4.284 seconds Summary: creating the same number of threads as the number of minion nodes produced the lowest latency. This is also the current default. |
When I measured the collector running on a Cassandra test cluster with 4 minion nodes again today, I got the following results: $ for i in $(seq 1 5); do sleep 11; time curl http://localhost:5555/debug | wc; done This does not make sense, because the average latency of the /debug endpoint should be less than the average latency of /cluster, because the output of /debug is 1/10 the size of the output of /cluster |
Repeating the measurements: $ for i in $(seq 1 5); do sleep 11; time curl http://localhost:5555/cluster | wc; done $ for i in $(seq 1 5); do sleep 11; time curl http://localhost:5555/debug | wc; done This make more sense. However, the variability of the measurements is worrisome. |
The logic for /debug does everything that /cluster does, plus some On Mon, Jul 6, 2015 at 1:34 PM, Eran Gabber [email protected]
|
Performance is still too slow. The master collector should respond in about 1 second for /cluster or /debug requests. |
…calls. This endpoint should help to find performance problems which increase the overall response time. See issue google#82.
The output of '/elapsed' endpoint show the following requests by decreasing elapsed time. $ time curl http://localhost:5555/cluster | wc The top low-level requests by decreasing elapsed times:
Note that there are 4 concurrent calls to "api/v1/pods". The reason is that all four requests start at about the same time with a cache miss. |
More elapsed time measurements: $ time curl http://localhost:5555/cluster | wc $ curl http://localhost:5555/elapsed Thread 140262325315328 had 44 calls. |
Observations:
The first 10 calls are: The first 5 calls made by all other threads are: |
More confusing evidence: Looking at the log file after one day, I could not find any double entries. This may have been an artifact of the "less" or "more" commands. |
Another problem: { The only expensive operation that JSONCache.lookup() is doing in addition to StringCache.lookup() is a json.loads() of the cached string. However, json.loads() is called all over the place. Maybe the problem is caused by memory management during concurrent processing. |
Running the cluster-insight collector with 1 worker thread and 100% cache hits for containers, images, and processes yields a response time between 0.4 seconds and 4 seconds. Possible reasons:
|
Fresh measurement of Cluster-Insight collector with aggressive caching (all Kubernetes requests except get process information are cached). "n1-standard-1" machines with 4 worker threads: Output of "curl http://localhost:5555/cluster | wc" was consistently |
Same measurements of Cluster-Insight collector with aggressive caching (all Kubernetes requests except get process information are cached) on a different cluster (thanks Vas!). "n1-standard-1" machines with 4 worker threads: Output of "curl http://localhost:5555/cluster | wc" was consistently These measurements indicate that the slowness and high variability of the Cluster-Insight collector are caused by the VM and cluster and not by the code. Aggressive caching is good in any case. |
The elapsed time of the collector fetching data for the first time from a cluster is more than 10 seconds, which seems like an eternity.
Note that the collector should use parallelism for extract the data, so this may be a bug somewhere.
The text was updated successfully, but these errors were encountered: