Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

block clone and bulk deletions with regards to files generated by Veeam. #16680

Open
ashleyw-gh opened this issue Oct 23, 2024 · 7 comments
Open
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@ashleyw-gh
Copy link

ashleyw-gh commented Oct 23, 2024

Problem:

We are testing ZFS as backend storage to Veeam.
We are running Rocky9 on x64 on a Dell Poweredge server with 23x16TB spindles striping across 4x raid1 vdevs wih 192GB ram.
We notice that the system seems to deadlock/hang when deleting Veeam backup files where Veeam has block cloning enabled.
For example a deletion of the files below with a simple rm command hangs and deadlocks all subsequent transactions.

System information

# zfs version
zfs-2.2.6-1
zfs-kmod-2.2.6-1
# cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.4 (Blue Onyx)"
# uname -r
5.14.0-427.40.1.el9_4.x86_64
# zpool status
  pool: VeeamBackup
 state: ONLINE
config:

        NAME                                         STATE     READ WRITE CKSUM
        VeeamBackup                                  ONLINE       0     0     0
          raidz1-0                                   ONLINE       0     0     0
            sdc                                      ONLINE       0     0     0
            scsi-35000039ae873d3e5                   ONLINE       0     0     0
            scsi-35000039ae873c119                   ONLINE       0     0     0
            scsi-35000039ae873d3c9                   ONLINE       0     0     0
            scsi-35000039ae878d31d                   ONLINE       0     0     0
            sdh                                      ONLINE       0     0     0
          raidz1-1                                   ONLINE       0     0     0
            sdi                                      ONLINE       0     0     0
            scsi-STOSHIBA_MG06SCA800EY_51X0A13AF1GF  ONLINE       0     0     0
            sdk                                      ONLINE       0     0     0
            sdl                                      ONLINE       0     0     0
            sdm                                      ONLINE       0     0     0
            sdb                                      ONLINE       0     0     0
          raidz1-2                                   ONLINE       0     0     0
            sdn                                      ONLINE       0     0     0
            sdo                                      ONLINE       0     0     0
            sdp                                      ONLINE       0     0     0
            sdq                                      ONLINE       0     0     0
            sdr                                      ONLINE       0     0     0
            sds                                      ONLINE       0     0     0
          raidz1-3                                   ONLINE       0     0     0
            sdt                                      ONLINE       0     0     0
            sdu                                      ONLINE       0     0     0
            sdv                                      ONLINE       0     0     0
            sdw                                      ONLINE       0     0     0
        spares
          scsi-STOSHIBA_MG06SCA800EY_51X0A13GF1GF    AVAIL

view of subset of file system.

# ls -l
-rw-r--r--. 1 root root  13026721792 Oct  7 06:18 zvma-portal-vc-d.vm-633444D2024-10-07T061239_1C14.vbk
-rw-r--r--. 1 root root   1456865280 Oct  8 01:24 zvma-portal-vc-d.vm-633444D2024-10-08T010000_04A1.vib
-rw-r--r--. 1 root root   1229193216 Oct  9 01:23 zvma-portal-vc-d.vm-633444D2024-10-09T010006_905E.vib
-rw-r--r--. 1 root root   1257635840 Oct 10 01:21 zvma-portal-vc-d.vm-633444D2024-10-10T010023_3AE2.vib
-rw-r--r--. 1 root root   1246494720 Oct 11 01:20 zvma-portal-vc-d.vm-633444D2024-10-11T010000_1FB6.vib
-rw-r--r--. 1 root root  13095272448 Oct 14 08:34 zvma-portal-vc-d.vm-633444D2024-10-14T082823_73C3.vbk
-rw-r--r--. 1 root root   1525153792 Oct 15 01:26 zvma-portal-vc-d.vm-633444D2024-10-15T010009_AF3F.vib
-rw-r--r--. 1 root root   1396441088 Oct 16 01:10 zvma-portal-vc-d.vm-633444D2024-10-16T010004_A2B2.vib
-rw-r--r--. 1 root root   1441529856 Oct 16 20:27 zvma-portal-vc-d.vm-633444D2024-10-16T190202_71D5.vib
-rw-r--r--. 1 root root   1417674752 Oct 17 20:18 zvma-portal-vc-d.vm-633444D2024-10-17T190201_EA4C.vib
-rw-r--r--. 1 root root   1441529856 Oct 18 20:20 zvma-portal-vc-d.vm-633444D2024-10-18T190255_D9C5.vib
-rw-r--r--. 1 root root   1398931456 Oct 20 00:14 zvma-portal-vc-d.vm-633444D2024-10-19T190353_E0D3.vib
-rw-r--r--. 1 root root  12690915328 Oct 22 00:04 zvma-portal-vc-d.vm-633444D2024-10-22T000228_8E22.vbk
-rw-r--r--. 1 root root   1352531968 Oct 22 20:33 zvma-portal-vc-d.vm-633444D2024-10-22T190223_08ED.vib
-rw-r--r--. 1 root root       174721 Oct 23 11:54 zvma-portal-vc-t_01B3E.vbm
-rw-r--r--. 1 root root  14080671744 Oct  7 05:58 zvma-portal-vc-t.vm-632724D2024-10-07T055316_BCAA.vbk
-rw-r--r--. 1 root root   1624899584 Oct  8 01:24 zvma-portal-vc-t.vm-632724D2024-10-08T010000_E5BF.vib
-rw-r--r--. 1 root root   1263534080 Oct  9 01:23 zvma-portal-vc-t.vm-632724D2024-10-09T010006_5EA3.vib
-rw-r--r--. 1 root root   1289224192 Oct 10 01:21 zvma-portal-vc-t.vm-632724D2024-10-10T010023_8A53.vib
-rw-r--r--. 1 root root   1276116992 Oct 11 01:20 zvma-portal-vc-t.vm-632724D2024-10-11T010000_9BBC.vib
-rw-r--r--. 1 root root  14259847168 Oct 14 08:32 zvma-portal-vc-t.vm-632724D2024-10-14T082514_E772.vbk
-rw-r--r--. 1 root root   1694236672 Oct 15 01:26 zvma-portal-vc-t.vm-632724D2024-10-15T010009_78B7.vib
-rw-r--r--. 1 root root   1411252224 Oct 16 01:10 zvma-portal-vc-t.vm-632724D2024-10-16T010004_8C65.vib
-rw-r--r--. 1 root root   1479016448 Oct 16 20:27 zvma-portal-vc-t.vm-632724D2024-10-16T190202_D57D.vib
-rw-r--r--. 1 root root   1554382848 Oct 17 20:18 zvma-portal-vc-t.vm-632724D2024-10-17T190201_B46B.vib
-rw-r--r--. 1 root root   1504968704 Oct 18 20:25 zvma-portal-vc-t.vm-632724D2024-10-18T190255_E049.vib
-rw-r--r--. 1 root root   1446641664 Oct 20 00:17 zvma-portal-vc-t.vm-632724D2024-10-19T190353_D929.vib
-rw-r--r--. 1 root root  14511112192 Oct 22 00:08 zvma-portal-vc-t.vm-632724D2024-10-22T000528_2F84.vbk
-rw-r--r--. 1 root root   1348599808 Oct 22 20:35 zvma-portal-vc-t.vm-632724D2024-10-22T190223_E796.vib
# rm -f zvma*

In the /var/log/messages file and console we see repetitions of messages such as;

Oct 23 16:56:51 test-server kernel: INFO: task z_zrele:2129 blocked for more than 368 seconds.
Oct 23 16:56:51 test-server kernel:      Tainted: P           OE     -------  ---  5.14.0-427.40.1.el9_4.x86_64 #1
Oct 23 16:56:51 test-server kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 23 16:56:51 test-server kernel: task:z_zrele         state:D stack:0     pid:2129  ppid:2      flags:0x00004000
Oct 23 16:56:51 test-server kernel: Call Trace:
Oct 23 16:56:51 test-server kernel: <TASK>
Oct 23 16:56:51 test-server kernel: __schedule+0x21b/0x550
Oct 23 16:56:51 test-server kernel: schedule+0x2d/0x70
Oct 23 16:56:51 test-server kernel: io_schedule+0x42/0x70
Oct 23 16:56:51 test-server kernel: cv_wait_common+0xaa/0x130 [spl]
Oct 23 16:56:51 test-server kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Oct 23 16:56:51 test-server kernel: txg_wait_open+0x91/0xe0 [zfs]
Oct 23 16:56:51 test-server kernel: dmu_free_long_range_impl+0x2c6/0x480 [zfs]
Oct 23 16:56:51 test-server kernel: dmu_free_long_range+0x72/0xb0 [zfs]
Oct 23 16:56:51 test-server kernel: zfs_rmnode+0x329/0x430 [zfs]
Oct 23 16:56:51 test-server kernel: zfs_inactive+0x8a/0x2a0 [zfs]
Oct 23 16:56:51 test-server kernel: ? unmap_mapping_range+0x7e/0x140
Oct 23 16:56:51 test-server kernel: zpl_evict_inode+0x3e/0x60 [zfs]
Oct 23 16:56:51 test-server kernel: evict+0xcc/0x1d0
Oct 23 16:56:51 test-server kernel: taskq_thread+0x28f/0x4d0 [spl]
Oct 23 16:56:51 test-server kernel: ? __pfx_default_wake_function+0x10/0x10
Oct 23 16:56:51 test-server kernel: ? __pfx_taskq_thread+0x10/0x10 [spl]
Oct 23 16:56:51 test-server kernel: kthread+0xdd/0x100
Oct 23 16:56:51 test-server kernel: ? __pfx_kthread+0x10/0x10
Oct 23 16:56:51 test-server kernel: ret_from_fork+0x29/0x50
Oct 23 16:56:51 test-server kernel: </TASK>

Describe how to reproduce the problem

Deadlock can easily be reproduced by deletion of many large files at an OS level with high degree of cloned blocks in them.
When a deadlock occurs any subsequent activities from another shell session hang.
It appears that the transactions eventually complete after about 20 minutes or longer (depending on how large the files are and how many are in the batch).
I suspect similar deadlocking issues are causing various Veeam errors I've been seeing (due to the threading architecture within Veeam).
We have sync disabled for performance.
Is file deletion when block cloning is in the mix an extremely slow process and should this block all other activity on the zvol?
Is there anything to improve the reliability and throuput

and the zpool options for completeness.

# zpool list
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
VeeamBackup   160T  62.8T  97.2T        -         -    18%    39%  1.00x    ONLINE  -
# zpool get all
NAME         PROPERTY                       VALUE                          SOURCE
VeeamBackup  size                           160T                           -
VeeamBackup  capacity                       39%                            -
VeeamBackup  altroot                        -                              default
VeeamBackup  health                         ONLINE                         -
VeeamBackup  guid                           2745054902736870025            -
VeeamBackup  version                        -                              default
VeeamBackup  bootfs                         -                              default
VeeamBackup  delegation                     on                             default
VeeamBackup  autoreplace                    off                            default
VeeamBackup  cachefile                      -                              default
VeeamBackup  failmode                       wait                           default
VeeamBackup  listsnapshots                  off                            default
VeeamBackup  autoexpand                     off                            default
VeeamBackup  dedupratio                     1.00x                          -
VeeamBackup  free                           97.2T                          -
VeeamBackup  allocated                      62.8T                          -
VeeamBackup  readonly                       off                            -
VeeamBackup  ashift                         0                              default
VeeamBackup  comment                        -                              default
VeeamBackup  expandsize                     -                              -
VeeamBackup  freeing                        0                              -
VeeamBackup  fragmentation                  18%                            -
VeeamBackup  leaked                         0                              -
VeeamBackup  multihost                      off                            default
VeeamBackup  checkpoint                     -                              -
VeeamBackup  load_guid                      16119869783397281463           -
VeeamBackup  autotrim                       off                            default
VeeamBackup  compatibility                  off                            default
VeeamBackup  bcloneused                     18.9T                          -
VeeamBackup  bclonesaved                    32.2T                          -
VeeamBackup  bcloneratio                    2.70x                          -
VeeamBackup  feature@async_destroy          enabled                        local
VeeamBackup  feature@empty_bpobj            enabled                        local
VeeamBackup  feature@lz4_compress           active                         local
VeeamBackup  feature@multi_vdev_crash_dump  enabled                        local
VeeamBackup  feature@spacemap_histogram     active                         local
VeeamBackup  feature@enabled_txg            active                         local
VeeamBackup  feature@hole_birth             active                         local
VeeamBackup  feature@extensible_dataset     active                         local
VeeamBackup  feature@embedded_data          active                         local
VeeamBackup  feature@bookmarks              enabled                        local
VeeamBackup  feature@filesystem_limits      enabled                        local
VeeamBackup  feature@large_blocks           enabled                        local
VeeamBackup  feature@large_dnode            enabled                        local
VeeamBackup  feature@sha512                 enabled                        local
VeeamBackup  feature@skein                  enabled                        local
VeeamBackup  feature@edonr                  enabled                        local
VeeamBackup  feature@userobj_accounting     active                         local
VeeamBackup  feature@encryption             enabled                        local
VeeamBackup  feature@project_quota          active                         local
VeeamBackup  feature@device_removal         enabled                        local
VeeamBackup  feature@obsolete_counts        enabled                        local
VeeamBackup  feature@zpool_checkpoint       enabled                        local
VeeamBackup  feature@spacemap_v2            active                         local
VeeamBackup  feature@allocation_classes     enabled                        local
VeeamBackup  feature@resilver_defer         enabled                        local
VeeamBackup  feature@bookmark_v2            enabled                        local
VeeamBackup  feature@redaction_bookmarks    enabled                        local
VeeamBackup  feature@redacted_datasets      enabled                        local
VeeamBackup  feature@bookmark_written       enabled                        local
VeeamBackup  feature@log_spacemap           active                         local
VeeamBackup  feature@livelist               enabled                        local
VeeamBackup  feature@device_rebuild         enabled                        local
VeeamBackup  feature@zstd_compress          enabled                        local
VeeamBackup  feature@draid                  enabled                        local
VeeamBackup  feature@zilsaxattr             enabled                        local
VeeamBackup  feature@head_errlog            active                         local
VeeamBackup  feature@blake3                 enabled                        local
VeeamBackup  feature@block_cloning          active                         local
VeeamBackup  feature@vdev_zaps_v2           active                         local

@ashleyw-gh ashleyw-gh added the Type: Defect Incorrect behavior (e.g. crash, hang) label Oct 23, 2024
@ashleyw-gh ashleyw-gh changed the title block clone and bulk deletions with regards to files genrated by Veeam. block clone and bulk deletions with regards to files generated by Veeam. Oct 23, 2024
@amotin
Copy link
Member

amotin commented Oct 23, 2024

Just on terminology, deadlock is called dead because it never recovers. You are saying your system recovers after ~20 minutes, so it is not a deadlock. The question though is what the system doing all that time.

The backtrace you've quoted shows waiting for a new transaction open. It likely means that current one is already full and waiting to be synced. The question may be what the sync thread is doing. Please take a look on what your system is doing (kernel/zfs processes, disks, etc). The sync threads and free requests it produces should be visible in top. I slightly suspected that RAIDZ pool of HDDs might be slow on reading the BRT tables if they somehow evicted from ARC, and was thinking to propose trying addition of special vdev, but I can not reproduce it making a noticeable problem on my tests, so it might be not the cause.

Waiting for the new transaction we see likely stop all the writes to the pool, since they all should stuck waiting for it. In addition to that I see that the wait you quoted seems to be going from inode reclaim of the kernel, which makes me wonder if it can block even more of the system. This code is OS-specific, and I am not very good in Linux VFS layer to say what it means. I wonder if your system has many other open files besides the ones you are deleting here, that might overflow the limit or memory and trigger the eviction and something/everything else to wait for recycling.

I've tried to reproduce this on FreeBSD with latest ZFS 2.3 (just what I have on my test system) and so far can't, even turning it to the max. What I do is creating similar HDD RAIDZ pool, writing there 6 10GB files, clones of each-other, and also 10 regular unique 1GB files and then delete them after exporting/importing pool to wipe the ARC caches. Even after I reduced recordsize to insane 4KB to blow out BRT tables, I still see the deletion process completes within may be 10 seconds. Profiler shows some lock contention related to cloning, which I might to look at, but nothing even close to the minutes you are talking about. So it could help to see more of what your system is doing during the process.

@amotin
Copy link
Member

amotin commented Oct 31, 2024

While running some of my own tests with equivalent of ~30TB of cloned data I was able to reproduce some slowdowns, especially when BRT table is not in RAM. So would like to quote here some updates.

While deleting cloned files, ZFS have to update per-block reference counters in BRT table. Roughly that table takes about 72+ bytes per block in RAM (plus overheads) and twice of that on disk. From zpool get all output above I see that you have ~19TB of cloned blocks. If your blocks have default size of 128KB, it means ~148 million blocks. It means at least 10GB of BRT table in RAM or twice of that on disk (compression can make is smaller, but ZAP overheads can make it larger). Due to random allocation nature of ZAP massive deletion will affect random subset of blocks from those 10GB. I am not ready to say what subset will it be, but in worst case you may expect ~20GB of write traffic to handle some massive delete. You would definitely see that in disk stats. But writes are only a part of the problem. If those 10GB are not in ARC, ZFS will need to read it, that might take time, especially since it will be quite a random read, despite ZFS should do massive parallel prefetch. And those reads you should see in disk stats also, possibly switching between reads and writes there and back at intervals.

To reduce/avoid those reads you might want to give ARC more RAM, which was improved in ZFS 2.3. May be some ARC tuning could help to keep the BRTs in ARC. To speedup the BRT reads you may use some good SSD(s) and special vdev, just consider that it will need to handle all the BRT writes, not only the reads. ZFS 2.3 got a command to manually preload the DDT tables (dedup has the similar problems, just several times worse), that should be more efficient than doing it on demand, but it still waits to be implemented for BRT tables. Also I create a couple of PRs to reduce per-block overheads, that might slightly help BRT memory usage with its 4KB blocks: #16684 and #16694 .

@ashleyw-gh
Copy link
Author

thanks Alex, great work. We've changed our infrastructure a little now so we are now running on tin with 192GB RAM on Rocky9 (previously 64GB and running as a VM) This will certainly help to improve the overall efficiency. I've also switched to 2.3rc2 to help eliminate potential issues.
The root issue we are trying to chase is the reliability of synthetic full behavior of Veeam backups - and while researching this issue I noticed that mass deletions of large files on OpenZFS (involving cloned blocks) were resulting in longer than expected stalls on the storage device (which I thought might be related).

@amotin
Copy link
Member

amotin commented Oct 31, 2024

@ashleyw-gh Here #16708 (comment) @robn is looking on the problem from a different side.

@amotin
Copy link
Member

amotin commented Nov 12, 2024

#16740 should improve CPU-bound side of block cloning performance.

@amotin
Copy link
Member

amotin commented Nov 19, 2024

#16773 should do even more for block cloning performance.

@amotin
Copy link
Member

amotin commented Nov 27, 2024

#16814 should improve performance of large block cloning requests on HDD pools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants