-
-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression with 2.17.x #672
Comments
Thanks @plokhotnyuk The switch away from ThreadLocal caching is the biggest change in 2.17. It can be switched back on via config. Fyi @cowtowncoder |
@pjfanning Is it |
Interesting. My tests with |
@plokhotnyuk I'm on a day trip so only have my phone. That looks right though. @cowtowncoder would it be fair to say that users should try to create as few ObjectMappers as possible and reuse them as much as possible? With the new default recycler, having a single ObjectMapper shared across threads might be the best option for high throughput. |
Yes, And for safety, reusing Having said that, it sounds like here reuse is done as expected so wasteful non-reuse would not be an issue. |
FWTW, json4s and play-json case is most worrying: it almost suggests like buffer recycling might be growing without bound? As if something was able to "return" buffers to recyclers, but not to actually recycle such buffer instances. Otherwise lack of recycling in itself should not cause increasingly bad performance. |
And one more thing: I will be off for about a week so may not be able to follow up quickly here -- but I am very interested in learning what exactly might be going on to cause this problem, and how to resolve it. |
I haven't looked at the play-json/json4s code but one possibility is that they create ObjectMappers and use them once and throw them away. They possibly don't use a JsonFactory instance to create the ObjectMappers. So every ObjectMapper could have its own JsonFactory and every JsonFactory has its own BufferRecycler instance - and this ends up hardly seeing any reuse of arrays. With the old ThreadLocal buffer-recycler the cache is not specific to a JsonFactory instance - so you can get away with not trying to reuse the specific buffer-recycler instance. |
Yes but that wouldnt explain gradually worsening behavior… itd just be
overall no-recycling case. I guess itd be possible to try with no op
recycler and compare numbers.
…On Wed, Mar 13, 2024 at 12:09 PM PJ Fanning ***@***.***> wrote:
FWTW, json4s and play-json case is most worrying: it almost suggests like
buffer recycling might be growing without bound? As if something was able
to "return" buffers to recyclers, but not to actually recycle such buffer
instances. Otherwise lack of recycling in itself should not cause
increasingly bad performance.
I haven't looked at the play-json/json4s code but one possibility is that
they create ObjectMappers and use them once and throw them away. They
possibly don't use a JsonFactory instance to create the ObjectMappers. So
every ObjectMapper could have its own JsonFactory and every JsonFactory has
its own BufferRecycler instance - and this ends up hardly seeing any reuse
of arrays.
With the old ThreadLocal buffer-recycler the cache is not specific to a
JsonFactory instance - so you can get away with not trying to reuse the
specific buffer-recycler instance.
—
Reply to this email directly, view it on GitHub
<#672 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAANOGKHLYHL6SIVG7EDTQLYYCP43AVCNFSM6AAAAABEUFKLHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJVGQZTGMRQHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I fear that a lot of Jackson users just create lots of ObjectMappers and are unaware of JsonFactory. These users do not get optimal performance by not not reusing ObjectMapper. In Jackson 2.16, they got ThreadLocal buffer recycling that spanned across all ObjectMappers. In Jackson 2.17, they will be getting new BufferRecycler instances for every ObjectMapper instance - and this will be not much better than switching off the BufferRecycler - especially with ObjectMappers that are used to parse or write just 1 small file. |
@cowtowncoder would it be possible to consider changing the default JsonRecyclerPools to be the sharedLockedFreePool implementation or possibly even to revert back to the threadLocal implementation? I think there will be too many users who never thought about reusing JsonFactory instances and/or ObjectMapper instances. |
While revert to previous default is possible, I would really like to understand mechanics here. So I feel like I am missing some piece of the puzzle here; something more seems to be going on. |
@cowtowncoder the ThreadLocal buffer recycler instance is shared. It doesn't matter if users reuse ObjectMappers or not - they get a shared buffer recycler. With the new Jackson 2.17 default, unless a user knows about creating JsonFactories and reusing those - and lots of users have no idea about this - the ObjectMapper gets its own JsonFactory and its own buffer recycler instance. This buffer recycler instance will not create new arrays. There is a good chance that with small docs that few if any arrays will be reused. I'm afraid I'm busy with settting up some new Apache projects so I don't really have any time to debug this. |
@plokhotnyuk I notice that you changed the Json4s benchmark to try to use the ThreadLocal buffer recycler. Does this help to get the performance back to where it was before Jackson 2.17? |
@pjfanning Yes I fully understand ThreadLocal-based pool is shared regardless of reuse of My point is that I am not convinced this reuse in itself would explain all of observed difference, even for small content. It definitely would help in such case. So: yes, for non-reuse case, ThreadLocal pool will work better. But does that really explain drastic difference? These are just simple So I do not want to reflexively assume that this is the main or only problem here, revert, call it good. I am not against changing the default pool for 2.x version(s) if this seems necessary, fwtw. |
@plokhotnyuk I was wondering if you were able to test out using
to configure pool back to pre-2.17 one? Also: I was not clear on whether but would want to have confidence it would make positive difference here. Unfortunately I haven't really been able to decipher flame graphs to help have an idea of exact details of performance degradation. |
One further data point: Trino issue trinodb/trino#21356 does suggest that the issue would indeed be retaining of huge amount of buffers and not (I think) so much that there is no But how could this happen? |
Currently all jackson-based benchmarks from No more 1000x times performance regressions when running parsing or serialization using 16 threads. Here is a comparison of benchmark results using JDK 21 before upgrade to 2.17.0 and after with mitigation applied, see |
@plokhotnyuk Ok so it seems fair to say that switching back to ThreadLocal recycler restored performance to pre-2.17 stage. Good. Thank you for confirming. |
@plokhotnyuk Ok, based on investigation on reproducing the issue there is some suspicious behavior by My thinking now is to revert back to Would it be possible to test what use of |
@cowtowncoder I've prepared a branch with the pool type replacement for all JSON parsers that use |
While full tests on desktop are running still, I tested sbt -java-home /usr/lib/jvm/jdk-21 jsoniter-scala-benchmarkJVM/clean 'jsoniter-scala-benchmarkJVM/jmh:run -prof "async:dir=target/async-reports;interval=1000000;output=flamegraph;libPath=/opt/async-profiler/lib/libasyncProfiler.so" -jvmArgsAppend "-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints" --p size=128 -wi 5 -i 5 -t 16 jacksonScala' For small (less than 100 bytes) messages up to 90% of CPU cycles are spent on acquiring, releasing and allocating byte buffers. I've highlighted them in the following flamegraph: |
I ran the same ADTWriting.jacksonScala test with a few different Jackson versions and tried a few different pool implementations with Jackson 2.17.0
|
@cowtowncoder @pjfanning Here are comparisons of benchmark results between GraalVM JDK 22 |
@pjfanning Differences in your runs seem much more modest than what @plokhotnyuk reports. I wonder if that is due to lower thread count (less contention) or is to output size relatively bigge? @plokhotnyuk I am still struggling to reconcile your results with understanding of what the specific reason for drastically lower performance is. It seems to me there are 3 possible causes:
If it is (1) that explains the problem, it would suggest there is not much to do, and default for traditional thread-case should remain ThreadLocalPool. And since Jackson should ideally not be in the business of auto-detecting runtime platform (or users' thread model choices), probably should be THE default pool, only to be overridden by developers/frameworks that know better. But if there is any chance (2) could be the case it'd be good to figure out why reuse was not working. |
@pjfanning I am bit surprised by there being any different between "shared" and "new" pools -- when a single |
@plokhotnyuk Thank you for links! Unfortunately I am not quite sure how to read them, like say https://jmh.morethan.io/?sources=https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages/jdk-17-t16.json,https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages-jackson-recycler/jdk-17-t16.json (JDK 17). |
Switch back to older version as there is a known performance regression in 2.17.0 ([Netflix#672]). [Netflix#672]: FasterXML/jackson-module-scala#672
Switch back to older version as there is a known performance regression in 2.17.0 ([#672]). [#672]: FasterXML/jackson-module-scala#672
Both benchmark runs are for the latest jackson-core release. The baseline is using It is compared with benchmark results for replacing that mitigation by To see up to 20x times slowdown as on the screenshot below please scroll down the page to the |
@cowtowncoder @pjfanning Thanks for your support! I'm closing this issue as fixed in 2.17.1 |
Thank you for reporting the issue @plokhotnyuk ! |
Switch back to older version as there is a known performance regression in 2.17.0 ([Netflix#672]). [Netflix#672]: FasterXML/jackson-module-scala#672
Both 2.17.0 and 2.17.0-rc1 are affected when small messages (less than 100 bytes) are parsed or serialized in multiple threads using the same object mapper instance.
The problem exists when parsing bigger messages but the performance difference is smaller.
All libraries that use the latest version of jackson library under the hood are affected: json4s, play-json, weePickle, etc.
Here are results of benchmarks that parse and serialize 62-byte and 1185-byte messages (
ADTReading
/ADTWriting
andArrayOfEnumADTsReading
/ArrayOfEnumADTsWriting
accordingly) on Intel(R) Core(TM) i7-11700 @ 2.50GHz (max 4.90 GHz) with DDR4-3200 using 16 threads:2.16.2
2.17.x
Steps to reproduce:
/opt/async-profiler
git clone https://github.com/plokhotnyuk/jsoniter-scala
cd jsoniter-scala
sbt -java-home /usr/lib/jvm/jdk-21 jsoniter-scala-benchmarkJVM/clean 'jsoniter-scala-benchmarkJVM/jmh:run -prof "async:dir=target/async-reports;interval=1000000;output=flamegraph;libPath=/opt/async-profiler/lib/libasyncProfiler.so" -jvmArgsAppend "-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints" --p size=128 -t 16 -wi 10 -i 10 ADT'
The issue exists when tested on other JVMs (GraalVM Community, Oracle GraalVM) and with other garbage collectors (G1GC, ZGC)
Here are flame-graphs of async-profiler reports for
ADTWriting.jacksonScala
benchmark:2.16.2
2.17.0
For json4s and play-json libraries performance could dynamically slowing down in about 1000x times:
Could be using of
ThreadLocal
for cashing of object mappers be an acceptable workaround for this issue?Is any workaround where we can explicitly define initial size of underlying buffers to be suitable for using with other libraries like play-json and weePickle?
The text was updated successfully, but these errors were encountered: