Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with 2.17.x #672

Closed
plokhotnyuk opened this issue Mar 13, 2024 · 32 comments
Closed

Performance regression with 2.17.x #672

plokhotnyuk opened this issue Mar 13, 2024 · 32 comments

Comments

@plokhotnyuk
Copy link

plokhotnyuk commented Mar 13, 2024

Both 2.17.0 and 2.17.0-rc1 are affected when small messages (less than 100 bytes) are parsed or serialized in multiple threads using the same object mapper instance.

The problem exists when parsing bigger messages but the performance difference is smaller.

All libraries that use the latest version of jackson library under the hood are affected: json4s, play-json, weePickle, etc.

Here are results of benchmarks that parse and serialize 62-byte and 1185-byte messages (ADTReading/ADTWriting and ArrayOfEnumADTsReading/ArrayOfEnumADTsWriting accordingly) on Intel(R) Core(TM) i7-11700 @ 2.50GHz (max 4.90 GHz) with DDR4-3200 using 16 threads:

2.16.2

[info] Benchmark                                           (size)   Mode  Cnt          Score         Error  Units
[info] ADTReading.borer                                       N/A  thrpt   10    6794141.947 ±   28943.355  ops/s
[info] ADTReading.circe                                       N/A  thrpt   10    5733802.213 ±   20142.997  ops/s
[info] ADTReading.circeJsoniter                               N/A  thrpt   10    6033807.062 ±   39206.180  ops/s
[info] ADTReading.jacksonScala                                N/A  thrpt   10    8600750.285 ±   29585.508  ops/s
[info] ADTReading.json4sJackson                               N/A  thrpt   10    1986617.545 ±   11937.239  ops/s
[info] ADTReading.json4sNative                                N/A  thrpt   10     728301.651 ±    3179.717  ops/s
[info] ADTReading.jsoniterScala                               N/A  thrpt   10   31056253.119 ±  162615.821  ops/s
[info] ADTReading.playJson                                    N/A  thrpt   10    2854140.465 ±   12044.125  ops/s
[info] ADTReading.playJsonJsoniter                            N/A  thrpt   10    5930774.513 ±   12228.253  ops/s
[info] ADTReading.smithy4sJson                                N/A  thrpt   10    8047920.054 ±   57498.071  ops/s
[info] ADTReading.sprayJson                                   N/A  thrpt   10    7559747.663 ±   37500.787  ops/s
[info] ADTReading.uPickle                                     N/A  thrpt   10   10820366.906 ±   38588.185  ops/s
[info] ADTReading.weePickle                                   N/A  thrpt   10    8520949.161 ±   38999.046  ops/s
[info] ADTReading.zioJson                                     N/A  thrpt   10    6516965.272 ±   30056.348  ops/s
[info] ADTWriting.borer                                       N/A  thrpt   10   12725701.358 ±   64383.508  ops/s
[info] ADTWriting.circe                                       N/A  thrpt   10    5702372.141 ±  123333.038  ops/s
[info] ADTWriting.circeJsoniter                               N/A  thrpt   10    6136183.702 ±   31911.935  ops/s
[info] ADTWriting.jacksonScala                                N/A  thrpt   10   20213415.218 ±  154415.084  ops/s
[info] ADTWriting.json4sJackson                               N/A  thrpt   10    1970762.580 ±    5888.709  ops/s
[info] ADTWriting.json4sNative                                N/A  thrpt   10    2129475.567 ±   10664.320  ops/s
[info] ADTWriting.jsoniterScala                               N/A  thrpt   10   86348750.403 ± 3923732.627  ops/s
[info] ADTWriting.jsoniterScalaPrealloc                       N/A  thrpt   10  126135815.779 ±  753875.520  ops/s
[info] ADTWriting.playJson                                    N/A  thrpt   10    7066647.975 ±   38110.136  ops/s
[info] ADTWriting.playJsonJsoniter                            N/A  thrpt   10   10034627.023 ±   45579.652  ops/s
[info] ADTWriting.smithy4sJson                                N/A  thrpt   10   19544469.284 ±  206715.422  ops/s
[info] ADTWriting.sprayJson                                   N/A  thrpt   10    8817363.222 ±   29355.150  ops/s
[info] ADTWriting.uPickle                                     N/A  thrpt   10   12346407.504 ±   41903.003  ops/s
[info] ADTWriting.weePickle                                   N/A  thrpt   10   15216077.687 ±   71823.446  ops/s
[info] ADTWriting.zioJson                                     N/A  thrpt   10    6696186.925 ±   37137.477  ops/s
[info] ArrayOfEnumADTsReading.borer                           128  thrpt   10    1485188.033 ±   11824.058  ops/s
[info] ArrayOfEnumADTsReading.circe                           128  thrpt   10    1060544.252 ±    5714.295  ops/s
[info] ArrayOfEnumADTsReading.circeJsoniter                   128  thrpt   10    1078942.307 ±    2906.939  ops/s
[info] ArrayOfEnumADTsReading.jacksonScala                    128  thrpt   10    1399993.550 ±    5345.969  ops/s
[info] ArrayOfEnumADTsReading.json4sJackson                   128  thrpt   10     386204.100 ±    1860.096  ops/s
[info] ArrayOfEnumADTsReading.json4sNative                    128  thrpt   10     312858.047 ±    3125.767  ops/s
[info] ArrayOfEnumADTsReading.jsoniterScala                   128  thrpt   10    4289813.648 ±   15066.858  ops/s
[info] ArrayOfEnumADTsReading.playJson                        128  thrpt   10     478973.090 ±    2376.637  ops/s
[info] ArrayOfEnumADTsReading.playJsonJsoniter                128  thrpt   10     515684.604 ±    1902.051  ops/s
[info] ArrayOfEnumADTsReading.sprayJson                       128  thrpt   10     755587.471 ±    4383.413  ops/s
[info] ArrayOfEnumADTsReading.uPickle                         128  thrpt   10    1242681.165 ±    2746.319  ops/s
[info] ArrayOfEnumADTsReading.weePickle                       128  thrpt   10    1169063.048 ±    4494.799  ops/s
[info] ArrayOfEnumADTsReading.zioJson                         128  thrpt   10     563521.111 ±    1445.949  ops/s
[info] ArrayOfEnumADTsWriting.borer                           128  thrpt   10    2054275.748 ±    9991.515  ops/s
[info] ArrayOfEnumADTsWriting.circe                           128  thrpt   10    1241981.451 ±  112432.709  ops/s
[info] ArrayOfEnumADTsWriting.circeJsoniter                   128  thrpt   10    2522632.787 ±   16045.668  ops/s
[info] ArrayOfEnumADTsWriting.jacksonScala                    128  thrpt   10    3174526.420 ±   17247.360  ops/s
[info] ArrayOfEnumADTsWriting.json4sJackson                   128  thrpt   10    1069988.662 ±    5451.828  ops/s
[info] ArrayOfEnumADTsWriting.json4sNative                    128  thrpt   10     387662.462 ±    1569.061  ops/s
[info] ArrayOfEnumADTsWriting.jsoniterScala                   128  thrpt   10   13584173.183 ±   65176.029  ops/s
[info] ArrayOfEnumADTsWriting.jsoniterScalaPrealloc           128  thrpt   10   15338486.834 ±   73143.637  ops/s
[info] ArrayOfEnumADTsWriting.playJson                        128  thrpt   10    2497807.549 ±   15178.768  ops/s
[info] ArrayOfEnumADTsWriting.playJsonJsoniter                128  thrpt   10    3063237.303 ±   25286.111  ops/s
[info] ArrayOfEnumADTsWriting.sprayJson                       128  thrpt   10    1816238.727 ±    8867.954  ops/s
[info] ArrayOfEnumADTsWriting.uPickle                         128  thrpt   10    1595086.586 ±    8474.119  ops/s
[info] ArrayOfEnumADTsWriting.weePickle                       128  thrpt   10    2396877.337 ±    8272.100  ops/s
[info] ArrayOfEnumADTsWriting.zioJson                         128  thrpt   10    2754854.478 ±   10817.622  ops/s

2.17.x

[info] Benchmark                                           (size)   Mode  Cnt          Score         Error  Units
[info] ADTReading.borer                                       N/A  thrpt   10    6808060.010 ±   24121.820  ops/s
[info] ADTReading.circe                                       N/A  thrpt   10    5504991.218 ±   28924.908  ops/s
[info] ADTReading.circeJsoniter                               N/A  thrpt   10    5965473.392 ±   24541.654  ops/s
[info] ADTReading.jacksonScala                                N/A  thrpt   10    3880829.241 ±   14779.470  ops/s
[info] ADTReading.json4sJackson                               N/A  thrpt   10    1684115.890 ±   17922.496  ops/s
[info] ADTReading.json4sNative                                N/A  thrpt   10     781783.122 ±    9209.704  ops/s
[info] ADTReading.jsoniterScala                               N/A  thrpt   10   28193676.278 ±  170866.780  ops/s
[info] ADTReading.playJson                                    N/A  thrpt   10    1861673.096 ±   32396.971  ops/s
[info] ADTReading.playJsonJsoniter                            N/A  thrpt   10    5936651.174 ±   28715.564  ops/s
[info] ADTReading.smithy4sJson                                N/A  thrpt   10    8183263.456 ±   78205.821  ops/s
[info] ADTReading.sprayJson                                   N/A  thrpt   10    7725596.379 ±   25148.636  ops/s
[info] ADTReading.uPickle                                     N/A  thrpt   10   10532982.352 ±   34026.274  ops/s
[info] ADTReading.weePickle                                   N/A  thrpt   10    3662160.317 ±   10805.963  ops/s
[info] ADTReading.zioJson                                     N/A  thrpt   10    6552279.264 ±   26565.487  ops/s
[info] ADTWriting.borer                                       N/A  thrpt   10   14804007.272 ±   66986.481  ops/s
[info] ADTWriting.circe                                       N/A  thrpt   10    5555829.627 ±   33717.558  ops/s
[info] ADTWriting.circeJsoniter                               N/A  thrpt   10    6289535.722 ±   24806.934  ops/s
[info] ADTWriting.jacksonScala                                N/A  thrpt   10    3815278.282 ±   44918.845  ops/s
[info] ADTWriting.json4sJackson                               N/A  thrpt   10       6895.294 ±    4815.743  ops/s
[info] ADTWriting.json4sNative                                N/A  thrpt   10    2123757.624 ±   10590.700  ops/s
[info] ADTWriting.jsoniterScala                               N/A  thrpt   10   82359727.429 ± 3453467.534  ops/s
[info] ADTWriting.jsoniterScalaPrealloc                       N/A  thrpt   10  134212123.633 ± 1162472.156  ops/s
[info] ADTWriting.playJson                                    N/A  thrpt   10     586570.259 ± 1309910.167  ops/s
[info] ADTWriting.playJsonJsoniter                            N/A  thrpt   10    9828872.088 ±   63874.635  ops/s
[info] ADTWriting.smithy4sJson                                N/A  thrpt   10   28487361.435 ±  130873.054  ops/s
[info] ADTWriting.sprayJson                                   N/A  thrpt   10   12135052.269 ±   49628.928  ops/s
[info] ADTWriting.uPickle                                     N/A  thrpt   10   10269863.568 ±   68308.485  ops/s
[info] ADTWriting.weePickle                                   N/A  thrpt   10    3263834.084 ±    8549.350  ops/s
[info] ADTWriting.zioJson                                     N/A  thrpt   10    6440516.792 ±   20713.803  ops/s
[info] ArrayOfEnumADTsReading.borer                           128  thrpt   10    1267431.783 ±    6201.075  ops/s
[info] ArrayOfEnumADTsReading.circe                           128  thrpt   10    1061247.664 ±    7164.785  ops/s
[info] ArrayOfEnumADTsReading.circeJsoniter                   128  thrpt   10    1079132.139 ±    5715.793  ops/s
[info] ArrayOfEnumADTsReading.jacksonScala                    128  thrpt   10    1396218.772 ±    6514.442  ops/s
[info] ArrayOfEnumADTsReading.json4sJackson                   128  thrpt   10     384574.257 ±    2804.130  ops/s
[info] ArrayOfEnumADTsReading.json4sNative                    128  thrpt   10     314105.580 ±    2377.774  ops/s
[info] ArrayOfEnumADTsReading.jsoniterScala                   128  thrpt   10    4304162.519 ±   14939.710  ops/s
[info] ArrayOfEnumADTsReading.playJson                        128  thrpt   10     394406.502 ±    1457.237  ops/s
[info] ArrayOfEnumADTsReading.playJsonJsoniter                128  thrpt   10     667298.395 ±    2485.742  ops/s
[info] ArrayOfEnumADTsReading.sprayJson                       128  thrpt   10     764851.351 ±    4757.110  ops/s
[info] ArrayOfEnumADTsReading.uPickle                         128  thrpt   10    1249665.477 ±    7025.275  ops/s
[info] ArrayOfEnumADTsReading.weePickle                       128  thrpt   10    1127465.787 ±    7466.347  ops/s
[info] ArrayOfEnumADTsReading.zioJson                         128  thrpt   10     563303.187 ±    2571.587  ops/s
[info] ArrayOfEnumADTsWriting.borer                           128  thrpt   10    2058803.540 ±   12165.516  ops/s
[info] ArrayOfEnumADTsWriting.circe                           128  thrpt   10    2011009.990 ±   12085.455  ops/s
[info] ArrayOfEnumADTsWriting.circeJsoniter                   128  thrpt   10    2667427.734 ±   13161.756  ops/s
[info] ArrayOfEnumADTsWriting.jacksonScala                    128  thrpt   10    2969747.276 ±   13888.207  ops/s
[info] ArrayOfEnumADTsWriting.json4sJackson                   128  thrpt   10     970464.315 ±    6924.823  ops/s
[info] ArrayOfEnumADTsWriting.json4sNative                    128  thrpt   10     388122.507 ±    1894.329  ops/s
[info] ArrayOfEnumADTsWriting.jsoniterScala                   128  thrpt   10   13694337.067 ±   54600.403  ops/s
[info] ArrayOfEnumADTsWriting.jsoniterScalaPrealloc           128  thrpt   10   15482028.886 ±   63118.277  ops/s
[info] ArrayOfEnumADTsWriting.playJson                        128  thrpt   10    2181855.381 ±   11481.359  ops/s
[info] ArrayOfEnumADTsWriting.playJsonJsoniter                128  thrpt   10    3063457.946 ±   22866.313  ops/s
[info] ArrayOfEnumADTsWriting.sprayJson                       128  thrpt   10    1822167.265 ±   11824.459  ops/s
[info] ArrayOfEnumADTsWriting.uPickle                         128  thrpt   10    2018705.050 ±   10173.067  ops/s
[info] ArrayOfEnumADTsWriting.weePickle                       128  thrpt   10    2164800.857 ±   22721.277  ops/s
[info] ArrayOfEnumADTsWriting.zioJson                         128  thrpt   10    2720774.650 ±   12951.514  ops/s

Steps to reproduce:

  1. Download binaries of async-profiler and unpack them into /opt/async-profiler
  2. git clone https://github.com/plokhotnyuk/jsoniter-scala
  3. cd jsoniter-scala
  4. sbt -java-home /usr/lib/jvm/jdk-21 jsoniter-scala-benchmarkJVM/clean 'jsoniter-scala-benchmarkJVM/jmh:run -prof "async:dir=target/async-reports;interval=1000000;output=flamegraph;libPath=/opt/async-profiler/lib/libasyncProfiler.so" -jvmArgsAppend "-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints" --p size=128 -t 16 -wi 10 -i 10 ADT'

The issue exists when tested on other JVMs (GraalVM Community, Oracle GraalVM) and with other garbage collectors (G1GC, ZGC)

Here are flame-graphs of async-profiler reports for ADTWriting.jacksonScala benchmark:

2.16.2

image

2.17.0

image

For json4s and play-json libraries performance could dynamically slowing down in about 1000x times:

[info] # Warmup Iteration   1: 498097.603 ops/s
[info] # Warmup Iteration   2: 1396602.529 ops/s
[info] # Warmup Iteration   3: 1390047.351 ops/s
[info] # Warmup Iteration   4: 1397460.036 ops/s
[info] # Warmup Iteration   5: 548693.538 ops/s
[info] # Warmup Iteration   6: 467155.766 ops/s
[info] # Warmup Iteration   7: 308437.551 ops/s
[info] # Warmup Iteration   8: 299814.921 ops/s
[info] # Warmup Iteration   9: 120140.861 ops/s
[info] # Warmup Iteration  10: 16015.216 ops/s
[info] Iteration   1: 6210.625 ops/s
[info] Iteration   2: 3199.072 ops/s
[info] Iteration   3: 9214.890 ops/s
[info] Iteration   4: 6565.308 ops/s
[info] Iteration   5: 1545.670 ops/s
[info] Iteration   6: 9597.542 ops/s
[info] Iteration   7: 6781.583 ops/s
[info] Iteration   8: 12569.237 ops/s
[info] Iteration   9: 5324.412 ops/s
[info] Iteration  10: 7944.598 ops/s

Could be using of ThreadLocal for cashing of object mappers be an acceptable workaround for this issue?

Is any workaround where we can explicitly define initial size of underlying buffers to be suitable for using with other libraries like play-json and weePickle?

plokhotnyuk pushed a commit to plokhotnyuk/jsoniter-scala that referenced this issue Mar 13, 2024
@pjfanning
Copy link
Member

Thanks @plokhotnyuk

The switch away from ThreadLocal caching is the biggest change in 2.17. It can be switched back on via config.

Fyi @cowtowncoder

@plokhotnyuk
Copy link
Author

plokhotnyuk commented Mar 13, 2024

It can be switched back on via config.

@pjfanning Is it .recyclerPool(JsonRecyclerPools.threadLocalPool()) call for the factory builder?

@cowtowncoder
Copy link
Member

Interesting. My tests with jackson-benchmarks didn't show regression for rc1 (I think I did run tests) -- possibly since they do not collect output as byte[] (writeValueAsBytes()). This was with JDK 8 tho in case that matters.

plokhotnyuk pushed a commit to plokhotnyuk/jsoniter-scala that referenced this issue Mar 13, 2024
@pjfanning
Copy link
Member

It can be switched back on via config.

@pjfanning Is it .recyclerPool(JsonRecyclerPools.threadLocalPool()) call for the factory builder?

@plokhotnyuk I'm on a day trip so only have my phone. That looks right though.

@cowtowncoder would it be fair to say that users should try to create as few ObjectMappers as possible and reuse them as much as possible? With the new default recycler, having a single ObjectMapper shared across threads might be the best option for high throughput.

plokhotnyuk added a commit to plokhotnyuk/jsoniter-scala that referenced this issue Mar 13, 2024
@cowtowncoder
Copy link
Member

cowtowncoder commented Mar 13, 2024

Yes, ObjectMappers should always be reused, regardless of other aspects: they (or rather, (de)serializer construction) are very heavy objects, and overhead of throwing them away negates any benefits of other caching, recycling.

And for safety, reusing ObjectReaders and ObjectWriters (constructed via ObjectMapper) is recommended: their creation is very cheap as well (as instances created by same ObjectMapper still share (de)serializer caching, reuse, in thread-safe manner).

Having said that, it sounds like here reuse is done as expected so wasteful non-reuse would not be an issue.

@cowtowncoder
Copy link
Member

FWTW, json4s and play-json case is most worrying: it almost suggests like buffer recycling might be growing without bound? As if something was able to "return" buffers to recyclers, but not to actually recycle such buffer instances. Otherwise lack of recycling in itself should not cause increasingly bad performance.

@cowtowncoder
Copy link
Member

And one more thing: I will be off for about a week so may not be able to follow up quickly here -- but I am very interested in learning what exactly might be going on to cause this problem, and how to resolve it.

@pjfanning
Copy link
Member

FWTW, json4s and play-json case is most worrying: it almost suggests like buffer recycling might be growing without bound? As if something was able to "return" buffers to recyclers, but not to actually recycle such buffer instances. Otherwise lack of recycling in itself should not cause increasingly bad performance.

I haven't looked at the play-json/json4s code but one possibility is that they create ObjectMappers and use them once and throw them away. They possibly don't use a JsonFactory instance to create the ObjectMappers. So every ObjectMapper could have its own JsonFactory and every JsonFactory has its own BufferRecycler instance - and this ends up hardly seeing any reuse of arrays.

With the old ThreadLocal buffer-recycler the cache is not specific to a JsonFactory instance - so you can get away with not trying to reuse the specific buffer-recycler instance.

@cowtowncoder
Copy link
Member

cowtowncoder commented Mar 13, 2024 via email

@pjfanning
Copy link
Member

I fear that a lot of Jackson users just create lots of ObjectMappers and are unaware of JsonFactory. These users do not get optimal performance by not not reusing ObjectMapper. In Jackson 2.16, they got ThreadLocal buffer recycling that spanned across all ObjectMappers. In Jackson 2.17, they will be getting new BufferRecycler instances for every ObjectMapper instance - and this will be not much better than switching off the BufferRecycler - especially with ObjectMappers that are used to parse or write just 1 small file.

@pjfanning
Copy link
Member

@cowtowncoder would it be possible to consider changing the default JsonRecyclerPools to be the sharedLockedFreePool implementation or possibly even to revert back to the threadLocal implementation? I think there will be too many users who never thought about reusing JsonFactory instances and/or ObjectMapper instances.

https://github.com/FasterXML/jackson-core/blob/5e109b2a52ec85b737f16071e31d925af8b07dbd/src/main/java/com/fasterxml/jackson/core/util/JsonRecyclerPools.java#L20-L30

@plokhotnyuk
Copy link
Author

The strange thing happening with json4s and play-json after 5-10 seconds working under load the allocation rate drops together with throughput while GC time reaches 1.5 seconds for 1-second iteration:

[info] # Benchmark: com.github.plokhotnyuk.jsoniter_scala.benchmark.ADTWriting.playJson
[info] # Run progress: 1.45% complete, ETA 04:17:51
[info] # Fork: 1 of 1
[info] # Warmup Iteration   1: 3013960.320 ops/s
[info] # Warmup Iteration   2: 789488.790 ops/s
[info] # Warmup Iteration   3: 139809.193 ops/s
[info] # Warmup Iteration   4: 32370.404 ops/s
[info] # Warmup Iteration   5: 11278.130 ops/s
[info] Iteration   1: 3290.532 ops/s
[info]                  gc.alloc.rate:      15.960 MB/sec
[info]                  gc.alloc.rate.norm: 6066.932 B/op
[info]                  gc.count:           36.000 counts
[info]                  gc.time:            1541.000 ms
[info] Iteration   2: 2347.943 ops/s
[info]                  gc.alloc.rate:      13.090 MB/sec
[info]                  gc.alloc.rate.norm: 6566.901 B/op
[info]                  gc.count:           38.000 counts
[info]                  gc.time:            1644.000 ms
[info] Iteration   3: 4864.039 ops/s
[info]                  gc.alloc.rate:      27.263 MB/sec
[info]                  gc.alloc.rate.norm: 6985.689 B/op
[info]                  gc.count:           37.000 counts
[info]                  gc.time:            1568.000 ms
[info] Iteration   4: 5223.979 ops/s
[info]                  gc.alloc.rate:      28.205 MB/sec
[info]                  gc.alloc.rate.norm: 7294.109 B/op
[info]                  gc.count:           37.000 counts
[info]                  gc.time:            1607.000 ms
[info] Iteration   5: 3428.775 ops/s
[info]                  gc.alloc.rate:      15.688 MB/sec
[info]                  gc.alloc.rate.norm: 6146.888 B/op
[info]                  gc.count:           36.000 counts
[info]                  gc.time:            1581.000 ms

Here is how asyn-profiler's flame graph looks for that case:
image

@cowtowncoder
Copy link
Member

@cowtowncoder would it be possible to consider changing the default JsonRecyclerPools to be the sharedLockedFreePool implementation or possibly even to revert back to the threadLocal implementation? I think there will be too many users who never thought about reusing JsonFactory instances and/or ObjectMapper instances.

https://github.com/FasterXML/jackson-core/blob/5e109b2a52ec85b737f16071e31d925af8b07dbd/src/main/java/com/fasterxml/jackson/core/util/JsonRecyclerPools.java#L20-L30

While revert to previous default is possible, I would really like to understand mechanics here.
Not reusing JsonFactory should not be catastrophic even if sub-optimal: it should be similar to just not having buffer recycling (pool instances are quite light-weight).
Conversely not reusing ObjectMapper should be so inefficient that recycling via ThreadLocal could not make things fast no matter what; it should already have been woefully inefficient.
(much more so than not reusing JsonFactory).

So I feel like I am missing some piece of the puzzle here; something more seems to be going on.

@pjfanning
Copy link
Member

@cowtowncoder the ThreadLocal buffer recycler instance is shared. It doesn't matter if users reuse ObjectMappers or not - they get a shared buffer recycler. With the new Jackson 2.17 default, unless a user knows about creating JsonFactories and reusing those - and lots of users have no idea about this - the ObjectMapper gets its own JsonFactory and its own buffer recycler instance. This buffer recycler instance will not create new arrays. There is a good chance that with small docs that few if any arrays will be reused.

I'm afraid I'm busy with settting up some new Apache projects so I don't really have any time to debug this.

@pjfanning
Copy link
Member

pjfanning commented Mar 21, 2024

@plokhotnyuk I notice that you changed the Json4s benchmark to try to use the ThreadLocal buffer recycler. Does this help to get the performance back to where it was before Jackson 2.17?

https://github.com/plokhotnyuk/jsoniter-scala/blob/58f2ef44fe6549bfeff60160d5094f2282d640ac/jsoniter-scala-benchmark/shared/src/main/scala/com/github/plokhotnyuk/jsoniter_scala/benchmark/Json4sFormats.scala#L166

@cowtowncoder
Copy link
Member

cowtowncoder commented Mar 21, 2024

@pjfanning Yes I fully understand ThreadLocal-based pool is shared regardless of reuse of JsonFactory.

My point is that I am not convinced this reuse in itself would explain all of observed difference, even for small content. It definitely would help in such case.
Because... if ObjectMappers are not reused, performance can't be good -- even in "least bad" case where ObjectMapper is only used to read JsonNode (mapper.readTree())`, for smallish payloads throughput is at least 50% lower than with reuse.

So: yes, for non-reuse case, ThreadLocal pool will work better. But does that really explain drastic difference? These are just simple byte[] and/or char[] instances, not heavy-weight objects or processing/

So I do not want to reflexively assume that this is the main or only problem here, revert, call it good.
Maybe tests by @plokhotnyuk can shed more light on this.

I am not against changing the default pool for 2.x version(s) if this seems necessary, fwtw.

@cowtowncoder
Copy link
Member

@plokhotnyuk I was wondering if you were able to test out using

.recyclerPool(JsonRecyclerPools.threadLocalPool())

to configure pool back to pre-2.17 one?

Also: I was not clear on whether ObjectMapper instances were reused or not (or, if using JsonFactory level, JsonFactory instances)?
I am seriously considering switching recycling pool default back to ThreadLocal-based one for 2.17.1:

FasterXML/jackson-core#1256

but would want to have confidence it would make positive difference here.

Unfortunately I haven't really been able to decipher flame graphs to help have an idea of exact details of performance degradation.

@cowtowncoder
Copy link
Member

One further data point: Trino issue trinodb/trino#21356 does suggest that the issue would indeed be retaining of huge amount of buffers and not (I think) so much that there is no JsonFactory reuse -- how else would it be possible for a Pool to keep on growing, unless somehow reuse failed (so new buffer allocated) but release to pool succeeded?
This is what bothers me most, possibility of unbounded growth.

But how could this happen?

@plokhotnyuk
Copy link
Author

@plokhotnyuk I was wondering if you were able to test out using

.recyclerPool(JsonRecyclerPools.threadLocalPool())

to configure pool back to pre-2.17 one?

Also: I was not clear on whether ObjectMapper instances were reused or not (or, if using JsonFactory level, JsonFactory instances)? I am seriously considering switching recycling pool default back to ThreadLocal-based one for 2.17.1:

FasterXML/jackson-core#1256

but would want to have confidence it would make positive difference here.

Unfortunately I haven't really been able to decipher flame graphs to help have an idea of exact details of performance degradation.

Currently all jackson-based benchmarks from jsoniter-scala-benchmark module were migrated to use JsonRecyclerPools.threadLocalPool() and one instance of ObjectMapper. That fixes speed of small message parsing using multiple threads.

No more 1000x times performance regressions when running parsing or serialization using 16 threads. Here is a comparison of benchmark results using JDK 21 before upgrade to 2.17.0 and after with mitigation applied, see jacksonScala, json4sJackson, playJson, weePickle

@cowtowncoder
Copy link
Member

@plokhotnyuk Ok so it seems fair to say that switching back to ThreadLocal recycler restored performance to pre-2.17 stage. Good. Thank you for confirming.

@cowtowncoder
Copy link
Member

@plokhotnyuk Ok, based on investigation on reproducing the issue there is some suspicious behavior by JsonRecyclerPools.newLockFreePool() -- oddly it seems most likely to occur very soon on test run and on tests situation also stabilizes. But there definitely is something. So I think it is necessary to use another implementation.

My thinking now is to revert back to threadLocalPool() for 2.17.x, but for 2.18 I think we'd switch default to JsonRecyclerPools.newConcurrentDequePool().

Would it be possible to test what use of JsonRecyclerPools.newConcurrentDequePool() would do with your benchmark? It would be a good sanity check as I haven't seen anomalous behavior with it on my test -- but my test's behavior itself is different from benchmark here so that's not sufficient proof.

@plokhotnyuk
Copy link
Author

plokhotnyuk commented Apr 19, 2024

@cowtowncoder I've prepared a branch with the pool type replacement for all JSON parsers that use jackson-core under the hood and will run same benchmarks on the same environment this weekend: plokhotnyuk/jsoniter-scala@master...new-concurrent-deque-pool

@plokhotnyuk
Copy link
Author

While full tests on desktop are running still, I tested JsonRecyclerPools.newConcurrentDequePool() for jackson-module-scala benchmarks on my laptop Intel® Core™ i7-11800H CPU @ 2.3GHz (max 4.6GHz), RAM 64Gb DDR4-3200 using 16 threads and async-profiler:

sbt -java-home /usr/lib/jvm/jdk-21 jsoniter-scala-benchmarkJVM/clean 'jsoniter-scala-benchmarkJVM/jmh:run -prof "async:dir=target/async-reports;interval=1000000;output=flamegraph;libPath=/opt/async-profiler/lib/libasyncProfiler.so" -jvmArgsAppend "-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints" --p size=128 -wi 5 -i 5 -t 16 jacksonScala'

For small (less than 100 bytes) messages up to 90% of CPU cycles are spent on acquiring, releasing and allocating byte buffers.

I've highlighted them in the following flamegraph:

image

@pjfanning
Copy link
Member

pjfanning commented Apr 20, 2024

I ran the same ADTWriting.jacksonScala test with a few different Jackson versions and tried a few different pool implementations with Jackson 2.17.0

  • 2.17.1-SNAPSHOT already has some changes that cause a small slowdown in ADTWriting (with ThreadLocal buffer recycler).
  • 2.18.0-SNAPSHOT is faster than 2.17.x and 2.16.x (when using ThreadLocal buffer recycler).
  • Among the different pool implementations, ThreadLocal is faster than SharedConcurrentDeque which is faster than NewConcurrentDeque.
  • I later ran with '-t 3' (3 threads). The pool order was the same but ThreadLocal was much better than the 2 ConcurrentDeque implementations with multiple threads.
sbt jsoniter-scala-benchmarkJVM/clean 'jsoniter-scala-benchmarkJVM/jmh:run -prof gc .*ADTWriting.jacksonScala'

2024-04-20 Default jsoniter-scala (Jackson 2.17.0 with ThreadLocal buffer recycler)

[info] Benchmark                                    Mode  Cnt        Score        Error   Units
[info] ADTWriting.jacksonScala                     thrpt    5  2290252.179 ± 107995.529   ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate       thrpt    5     1222.806 ±     57.697  MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm  thrpt    5      560.000 ±      0.001    B/op
[info] ADTWriting.jacksonScala:gc.count            thrpt    5        8.000               counts
[info] ADTWriting.jacksonScala:gc.time             thrpt    5        7.000  

2024-04-20 jsoniter-scala changed to use Jackson 2.16.2 with ThreadLocal buffer recycler)

[info] Benchmark                                    Mode  Cnt        Score        Error   Units
[info] ADTWriting.jacksonScala                     thrpt    5  2225989.506 ± 160118.600   ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate       thrpt    5     1188.489 ±     85.712  MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm  thrpt    5      560.000 ±      0.001    B/op
[info] ADTWriting.jacksonScala:gc.count            thrpt    5        7.000               counts
[info] ADTWriting.jacksonScala:gc.time             thrpt    5        7.000  

2024-04-20 jsoniter-scala changed to use Jackson 2.17.1-SNAPSHOT with ThreadLocal buffer recycler)

[info] Benchmark                                    Mode  Cnt        Score        Error   Units
[info] ADTWriting.jacksonScala                     thrpt    5  2099028.720 ± 156226.482   ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate       thrpt    5     1120.687 ±     83.382  MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm  thrpt    5      560.000 ±      0.001    B/op
[info] ADTWriting.jacksonScala:gc.count            thrpt    5        8.000               counts
[info] ADTWriting.jacksonScala:gc.time             thrpt    5        7.000  

2024-04-20 Default jsoniter-scala (Jackson 2.17.0 with NewConcurrentDeque buffer recycler)

[info] Benchmark                                    Mode  Cnt        Score        Error   Units
[info] ADTWriting.jacksonScala                     thrpt    5  1686594.004 ± 213058.648   ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate       thrpt    5     1093.464 ±    137.984  MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm  thrpt    5      680.000 ±      0.001    B/op
[info] ADTWriting.jacksonScala:gc.count            thrpt    5        7.000               counts
[info] ADTWriting.jacksonScala:gc.time             thrpt    5        5.000 

2024-04-20 Default jsoniter-scala (Jackson 2.17.0 with SharedConcurrentDeque buffer recycler)

[info] Benchmark                                    Mode  Cnt        Score        Error   Units
[info] ADTWriting.jacksonScala                     thrpt    5  1981627.720 ± 242523.449   ops/s
[info] ADTWriting.jacksonScala:gc.alloc.rate       thrpt    5     1103.358 ±    135.033  MB/sec
[info] ADTWriting.jacksonScala:gc.alloc.rate.norm  thrpt    5      584.000 ±      0.001    B/op
[info] ADTWriting.jacksonScala:gc.count            thrpt    5        7.000               counts
[info] ADTWriting.jacksonScala:gc.time             thrpt    5        7.000  

@plokhotnyuk
Copy link
Author

@cowtowncoder @pjfanning JsonRecyclerPools.newConcurrentDequePool() pool is too expansive for small (less than 1Kb) messages, especially in the multithread environment.

Here are comparisons of benchmark results between threadLocalPool() and newConcurrentDequePool() mitigations for the recycler pool performance regression jackson-core 2.17.0 on different JVMs using 16 threads (move Ignoring deviations below 0% slider to 0 and scroll down to Declined benchmarks section):

GraalVM JDK 22
GraalVM JDK 21
GraalVM JDK 17
GraalVM Community JDK 23
GraalVM Community JDK 21
JDK 23
JDK 21
JDK 17

@cowtowncoder
Copy link
Member

@pjfanning Differences in your runs seem much more modest than what @plokhotnyuk reports. I wonder if that is due to lower thread count (less contention) or is to output size relatively bigge?

@plokhotnyuk I am still struggling to reconcile your results with understanding of what the specific reason for drastically lower performance is. It seems to me there are 3 possible causes:

  1. Contention (locking/blocking) of threads when acquiring and releasing buffers. I could see this being possible with really high thread counts and small content (because time between acquire/release calls is very short)
  2. Some implementation flaw that prevents successful reuse. If so, we'd see more allocations with concurrentDequePool implementation compared to thread local. And numbers would be similar to non-reusing "pool" implementation
  3. If there is no reuse of underlying JsonFactory -- my understanding is this has been ruled out as possibility.

If it is (1) that explains the problem, it would suggest there is not much to do, and default for traditional thread-case should remain ThreadLocalPool. And since Jackson should ideally not be in the business of auto-detecting runtime platform (or users' thread model choices), probably should be THE default pool, only to be overridden by developers/frameworks that know better.

But if there is any chance (2) could be the case it'd be good to figure out why reuse was not working.

@cowtowncoder
Copy link
Member

@pjfanning I am bit surprised by there being any different between "shared" and "new" pools -- when a single JsonFactory is reused there should be no difference at all (there's just a single pool used by the factory).

@cowtowncoder
Copy link
Member

@plokhotnyuk Unfortunately I can't see the expanded flamegraph: link to https://private-user-images.githubusercontent.com/890289/324165548-d40280ad-ed4d-4468-b4ab-c5f2de528fce.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQxMDY2NDYsIm5iZiI6MTcxNDEwNjM0NiwicGF0aCI6Ii84OTAyODkvMzI0MTY1NTQ4LWQ0MDI4MGFkLWVkNGQtNDQ2OC1iNGFiLWM1ZjJkZTUyOGZjZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNDI2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDQyNlQwNDM5MDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0yMWEzN2FmYjJlZjUzNzIwYjJiOWE0Mjk0ZTM5OTAyZmExZTU5ZTNiOGNmM2JkYWM5MDk3MTQ2M2M2YTZkZDdmJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.4v64TVkhq9yp5Mgo2zqNc5hBCtcaa7ZDk_fPBTP16KA shows up empty.

But on " 90% of CPU cycles are spent on acquiring, releasing and allocating byte buffers." (on 16 threads, with 100 byte write (or read?)): I would assume that "allocating" part should be negligible, since if not, it would mean that recycling did not actually work: with reused JsonFactory, pool size would eventually settle for larger number of buffers used and no new allocations should occur.
So all the overhead should be due other 2 things (acquiring and releasing), due to contention. Is this what you see?

If not I wonder how and why recycling itself did not work.

@cowtowncoder
Copy link
Member

@plokhotnyuk Thank you for links! Unfortunately I am not quite sure how to read them, like say https://jmh.morethan.io/?sources=https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages/jdk-17-t16.json,https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages-jackson-recycler/jdk-17-t16.json (JDK 17).
There are many entries, differences -- so what should I be looking at? And what is the baseline against which changes are?

brharrington added a commit to brharrington/atlas that referenced this issue Apr 29, 2024
Switch back to older version as there is a known performance
regression in 2.17.0 ([Netflix#672]).

[Netflix#672]: FasterXML/jackson-module-scala#672
brharrington added a commit to Netflix/atlas that referenced this issue Apr 29, 2024
Switch back to older version as there is a known performance
regression in 2.17.0 ([#672]).

[#672]: FasterXML/jackson-module-scala#672
@plokhotnyuk
Copy link
Author

plokhotnyuk commented Apr 30, 2024

@plokhotnyuk Thank you for links! Unfortunately I am not quite sure how to read them, like say https://jmh.morethan.io/?sources=https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages/jdk-17-t16.json,https://raw.githubusercontent.com/plokhotnyuk/jsoniter-scala/gh-pages-jackson-recycler/jdk-17-t16.json (JDK 17). There are many entries, differences -- so what should I be looking at? And what is the baseline against which changes are?

Both benchmark runs are for the latest jackson-core release.

The baseline is using .recyclerPool(JsonRecyclerPools.threadLocalPool()) mitigation for all affected JSON parsers (jackson-module-scala, play-json, json4s, weePickle).

It is compared with benchmark results for replacing that mitigation by .recyclerPool(JsonRecyclerPools.newConcurrentDequePool()) as committed here

To see up to 20x times slowdown as on the screenshot below please scroll down the page to the Declined Benchmarks section:

image

@plokhotnyuk
Copy link
Author

plokhotnyuk commented May 6, 2024

@cowtowncoder @pjfanning Thanks for your support!

I'm closing this issue as fixed in 2.17.1

@cowtowncoder
Copy link
Member

Thank you for reporting the issue @plokhotnyuk !

manolama pushed a commit to manolama/atlas that referenced this issue May 22, 2024
Switch back to older version as there is a known performance
regression in 2.17.0 ([Netflix#672]).

[Netflix#672]: FasterXML/jackson-module-scala#672
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants