Database backup-fetch failing with GCS storage #733

lgastako · 2017-02-10T01:09:07Z

Hi,

I have set up a deis cluster with a default values.yml and everything works fine. When I configure global/storage to be gcs (and fill in the gcs section with key_json and the three bucket names) I get

2017-02-09T22:55:00.099230147Z wal_e.worker.upload INFO     MSG: begin archiving a file
2017-02-09T22:55:00.099264214Z         DETAIL: Uploading "pg_xlog/000000010000000000000001" to "gs://spc-deis-database/wal_005/000000010000000000000001.lzo".
2017-02-09T22:55:00.099287610Z         STRUCTURED: time=2017-02-09T22:55:00.098818-00 pid=105 action=push-wal key=gs://spc-deis-database/wal_005/000000010000000000000001.lzo prefix= seg=000000010000000000000001 state=begin
2017-02-09T22:55:00.209521074Z oauth2client.transport INFO     Attempting refresh to obtain initial access_token
2017-02-09T22:55:00.249670161Z oauth2client.client INFO     Refreshing access_token
2017-02-09T22:55:00.695326891Z .wal_e.worker.upload INFO     MSG: completed archiving to a file
2017-02-09T22:55:00.695366530Z         DETAIL: Archiving to "gs://spc-deis-database/wal_005/000000010000000000000001.lzo" complete at 5375.99KiB/s.
2017-02-09T22:55:00.695374784Z         STRUCTURED: time=2017-02-09T22:55:00.694966-00 pid=105 action=push-wal key=gs://spc-deis-database/wal_005/000000010000000000000001.lzo prefix= rate=5375.99 seg=000000010000000000000001 state=complete
2017-02-09T22:55:01.654531754Z  done
2017-02-09T22:55:01.654579668Z server stopped
2017-02-09T22:55:01.975884613Z wal_e.main   INFO     MSG: starting WAL-E
2017-02-09T22:55:01.975933927Z         DETAIL: The subcommand is "backup-fetch".
2017-02-09T22:55:01.975941906Z         STRUCTURED: time=2017-02-09T22:55:01.975502-00 pid=112
2017-02-09T22:55:02.062573765Z oauth2client.transport INFO     Attempting refresh to obtain initial access_token
2017-02-09T22:55:02.107529520Z oauth2client.client INFO     Refreshing access_token
2017-02-09T22:55:04.857593378Z wal_e.worker.gs.gs_worker INFO     MSG: beginning partition download
2017-02-09T22:55:04.857642536Z         DETAIL: The partition being downloaded is part_00000000.tar.lzo.
2017-02-09T22:55:04.857650636Z         HINT: The absolute GCS object is basebackups_005/base_000000040000000000000040_00000040/tar_partitions/part_00000000.tar.lzo.
2017-02-09T22:55:04.857657121Z         STRUCTURED: time=2017-02-09T22:55:04.856135-00 pid=112
2017-02-09T22:55:04.857663708Z oauth2client.transport INFO     Attempting refresh to obtain initial access_token
2017-02-09T22:55:04.895442331Z oauth2client.client INFO     Refreshing access_token
2017-02-09T22:55:05.175146189Z wal_e.retries WARNING  MSG: retrying after encountering exception
2017-02-09T22:55:05.175192036Z         DETAIL: Exception information dump:
2017-02-09T22:55:05.175199382Z         Traceback (most recent call last):
2017-02-09T22:55:05.175205036Z           File "/usr/local/lib/python3.5/dist-packages/wal_e/retries.py", line 62, in shim
 return f(*args, **kwargs)
2017-02-09T22:55:05.175215212Z           File "/usr/local/lib/python3.5/dist-packages/wal_e/worker/gs/gs_worker.py", line 79, in fetch_partition
2017-02-09T22:55:05.175219762Z             with get_download_pipeline(PIPE, PIPE, self.decrypt) as pl:
2017-02-09T22:55:05.175224619Z           File "/usr/local/lib/python3.5/dist-packages/wal_e/pipeline.py", line 92, in __enter__
2017-02-09T22:55:05.175245736Z             self.stdin = pipebuf.NonBlockBufferedWriter(stdin)
2017-02-09T22:55:05.175251562Z           File "/usr/local/lib/python3.5/dist-packages/wal_e/pipebuf.py", line 225, in __init__
2017-02-09T22:55:05.175256967Z             _setup_fd(self._fd)
2017-02-09T22:55:05.175261263Z           File "/usr/local/lib/python3.5/dist-packages/wal_e/pipebuf.py", line 62, in _setup_fd
2017-02-09T22:55:05.175265647Z             set_buf_size(fd)
2017-02-09T22:55:05.175269601Z           File "/usr/local/lib/python3.5/dist-packages/wal_e/pipebuf.py", line 53, in set_buf_size
2017-02-09T22:55:05.175274496Z             fcntl.fcntl(fd, fcntl.F_SETPIPE_SZ, OS_PIPE_SZ)
2017-02-09T22:55:05.175279181Z         PermissionError: [Errno 1] Operation not permitted
2017-02-09T22:55:05.175284987Z
2017-02-09T22:55:05.175298376Z         HINT: A better error message should be written to handle this exception.  Please report this output and, if possible, the situation under which it arises.
2017-02-09T22:55:05.175303122Z         STRUCTURED: time=2017-02-09T22:55:05.174619-00 pid=112
2017-02-09T22:55:05.276069255Z wal_e.worker.gs.gs_worker INFO     MSG: beginning partition download
2017-02-09T22:55:05.276105566Z         DETAIL: The partition being downloaded is part_00000000.tar.lzo.
2017-02-09T22:55:05.276112358Z         HINT: The absolute GCS object is basebackups_005/base_000000040000000000000040_00000040/tar_partitions/part_00000000.tar.lzo.
2017-02-09T22:55:05.276118373Z         STRUCTURED: time=2017-02-09T22:55:05.275685-00 pid=112
2017-02-09T22:55:05.373191229Z wal_e.retries WARNING  MSG: retrying after encountering exception

...with the same error repeated ad infinitum...

It appears that the service account has access to the bucket as I see data being written to the bucket (and I see the successful write in the log above) but the backup-fetch process appears to be failing.

I also tried setting WALE_GS_PREFIX and GOOGLE_APPLICATION_CREDENTIALS and running strace -f /usr/bin/python3 /usr/local/bin/wal-e backup-fetch /var/lib/postgresql/data LATEST in kubectl exec and the relevant bit seems to be:

[pid  7560] close(19)                   = 0
[pid  7560] read(20, strace: Process 7583 attached
 <unfinished ...>
[pid  7583] set_robust_list(0x7f70153359e0, 24) = 0
[pid  7583] close(17)                   = 0
[pid  7583] close(18)                   = 0
[pid  7583] close(20)                   = 0
[pid  7583] dup2(8, 0)                  = 0
[pid  7583] dup2(19, 1)                 = 1
[pid  7583] close(8)                    = 0
[pid  7583] close(19)                   = 0
[pid  7583] rt_sigaction(SIGPIPE, {SIG_DFL, [], SA_RESTORER, 0x7f7014f0d3e0}, {SIG_IGN, [], SA_RESTORER, 0x7f7014f0d3e0}, 8) = 0
[pid  7583] rt_sigaction(SIGXFSZ, {SIG_DFL, [], SA_RESTORER, 0x7f7014f0d3e0}, {SIG_IGN, [], SA_RESTORER, 0x7f7014f0d3e0}, 8) = 0
[pid  7583] open("/proc/self/fd", O_RDONLY|O_CLOEXEC) = 8
[pid  7583] getdents64(8, /* 11 entries */, 280) = 264
[pid  7583] close(3)                    = 0
[pid  7583] close(4)                    = 0
[pid  7583] close(5)                    = 0
[pid  7583] close(6)                    = 0
[pid  7583] close(7)                    = 0
[pid  7583] getdents64(8, /* 9 entries */, 280) = 216
[pid  7583] close(9)                    = 0
[pid  7583] close(10)                   = 0
[pid  7583] close(11)                   = 0
[pid  7583] close(12)                   = 0
[pid  7583] close(13)                   = 0
[pid  7583] close(14)                   = 0
[pid  7583] close(15)                   = 0
[pid  7583] close(16)                   = 0
[pid  7583] getdents64(8, /* 0 entries */, 280) = 0
[pid  7583] close(8)                    = 0
[pid  7583] execve("/usr/lib/postgresql/9.4/bin/lzop", ["lzop", "-d", "-c", "-"], [/* 139 vars */]) = -1 ENOENT (No such file or directory)
[pid  7583] execve("/usr/local/sbin/lzop", ["lzop", "-d", "-c", "-"], [/* 139 vars */]) = -1 ENOENT (No such file or directory)
[pid  7583] execve("/usr/local/bin/lzop", ["lzop", "-d", "-c", "-"], [/* 139 vars */]) = -1 ENOENT (No such file or directory)
[pid  7583] execve("/usr/sbin/lzop", ["lzop", "-d", "-c", "-"], [/* 139 vars */]) = -1 ENOENT (No such file or directory)
[pid  7583] execve("/usr/bin/lzop", ["lzop", "-d", "-c", "-"], [/* 139 vars */] <unfinished ...>
[pid  7560] <... read resumed> "", 50000) = 0
[pid  7560] close(20)                   = 0
[pid  7560] fcntl(17, F_GETFL)          = 0x1 (flags O_WRONLY)
[pid  7560] fcntl(17, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
[pid  7560] fcntl(17, F_SETPIPE_SZ, 1048576) = -1 EPERM (Operation not permitted)
[pid  7560] stat("/usr/local/lib/python3.5/dist-packages/wal_e/retries.py", {st_mode=S_IFREG|0644, st_size=3769, ...}) = 0
[pid  7560] stat("/usr/local/lib/python3.5/dist-packages/wal_e/pipebuf.py", {st_mode=S_IFREG|0644, st_size=8804, ...}) = 0
[pid  7560] stat("/usr/local/lib/python3.5/dist-packages/wal_e/worker/gs/gs_worker.py", {st_mode=S_IFREG|0644, st_size=3881, ...}) = 0
[pid  7560] stat("/usr/local/lib/python3.5/dist-packages/wal_e/pipeline.py", {st_mode=S_IFREG|0644, st_size=7729, ...}) = 0
[pid  7560] clock_gettime(CLOCK_REALTIME, {1486688614, 850810903}) = 0
[pid  7560] clock_gettime(CLOCK_REALTIME, {1486688614, 851316969}) = 0
[pid  7560] write(2, "wal_e.retries WARNING  MSG: retr"..., 1335wal_e.retries WARNING  MSG: retrying after encountering exception
        DETAIL: Exception information dump:
        Traceback (most recent call last):
          File "/usr/local/lib/python3.5/dist-packages/wal_e/retries.py", line 62, in shim
            return f(*args, **kwargs)
          File "/usr/local/lib/python3.5/dist-packages/wal_e/worker/gs/gs_worker.py", line 79, in fetch_partition
            with get_download_pipeline(PIPE, PIPE, self.decrypt) as pl:
          File "/usr/local/lib/python3.5/dist-packages/wal_e/pipeline.py", line 92, in __enter__
            self.stdin = pipebuf.NonBlockBufferedWriter(stdin)
          File "/usr/local/lib/python3.5/dist-packages/wal_e/pipebuf.py", line 225, in __init__
            _setup_fd(self._fd)
          File "/usr/local/lib/python3.5/dist-packages/wal_e/pipebuf.py", line 62, in _setup_fd
            set_buf_size(fd)
          File "/usr/local/lib/python3.5/dist-packages/wal_e/pipebuf.py", line 53, in set_buf_size
            fcntl.fcntl(fd, fcntl.F_SETPIPE_SZ, OS_PIPE_SZ)
        PermissionError: [Errno 1] Operation not permitted

        HINT: A better error message should be written to handle this exception.  Please report this output and, if possible, the situation under which it arises.
        STRUCTURED: time=2017-02-10T01:03:34.850810-00 pid=7560) = 1335
[pid  7560] write(2, "\n", 1
)           = 1
[pid  7560] wait4(7583, 0x7ffd56de25f4, WNOHANG, NULL) = 0

It looks as if perhaps file descriptor 17 was closed and then and attempt is made to F_SETPIPE_SZ which causes the problem. It looks like this is just the pipe to lzop and in fact I see ~5000 lzop processes running in the pod where I've been experimenting with this:

root@deis-database-540367895-x6r6d:/# ps auxw  | wc -l
5059

they all look the same, FWIW:

root      7932  0.1  0.0   6768  1592 ?        S    01:07   0:00 lzop -d -c -
root      7933  0.1  0.0   6768  1592 ?        S    01:07   0:00 lzop -d -c -
root      7942  0.1  0.0   6768  1648 ?        S    01:07   0:00 lzop -d -c -
root      7943  0.1  0.0   6768  1568 ?        S    01:07   0:00 lzop -d -c -
root      7944  0.2  0.0   6768  1656 ?        S    01:07   0:00 lzop -d -c -
root      7945  0.1  0.0   6768  1592 ?        S    01:07   0:00 lzop -d -c -
root      7946  0.1  0.0   6768  1592 ?        S    01:07   0:00 lzop -d -c -
root      7947  0.2  0.0   6768  1600 ?        S    01:07   0:00 lzop -d -c -
root      7948  0.2  0.0   6768  1592 ?        S    01:07   0:00 lzop -d -c -

Any pointers would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

bacongobbler · 2017-02-10T01:15:32Z

Just to confirm which version of Workflow are you running? I know there was some GCS-specific fixes in wal-e v1.0.1 which landed in v2.10.0 IIRC.

bacongobbler · 2017-02-10T01:17:15Z

This issue looks related to wal-e/wal-e#270. Can you try wal-e/wal-e#270 (comment) and see if that resolves the issue for you?

lgastako · 2017-02-10T01:24:24Z

Re. the version of Workflow I'm on, I'm not sure how to tell -- I installed it via the suggested path using helm which I assume installs the latest?

When I tried setting pipe-user-pages-soft ala wal-e/wal-e#270 (comment) I get this:

root@deis-database-540367895-x6r6d:/# echo 0 > /proc/sys/fs/pipe-user-pages-soft
bash: /proc/sys/fs/pipe-user-pages-soft: Read-only file system

I'm looking around for a solution to that now.

lgastako · 2017-02-10T01:25:33Z

wal-e appears to be v1.0.1 though:

root@deis-database-540367895-x6r6d:/# wal-e version
1.0.1

bacongobbler · 2017-02-10T01:32:19Z

The issue I linked to seems to remain as an open issue so my assumption of it being fixed in v1.0.1 is incorrect.

I'm going to close this as a duplicate of deis/postgres#154 so we can track efforts in one ticket. Thank you for the report though!

lgastako · 2017-02-10T01:35:00Z

Thanks.

bacongobbler closed this as completed Feb 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database backup-fetch failing with GCS storage #733

Database backup-fetch failing with GCS storage #733

lgastako commented Feb 10, 2017

bacongobbler commented Feb 10, 2017 •

edited

Loading

bacongobbler commented Feb 10, 2017

lgastako commented Feb 10, 2017

lgastako commented Feb 10, 2017

bacongobbler commented Feb 10, 2017

lgastako commented Feb 10, 2017

Database backup-fetch failing with GCS storage #733

Database backup-fetch failing with GCS storage #733

Comments

lgastako commented Feb 10, 2017

bacongobbler commented Feb 10, 2017 • edited Loading

bacongobbler commented Feb 10, 2017

lgastako commented Feb 10, 2017

lgastako commented Feb 10, 2017

bacongobbler commented Feb 10, 2017

lgastako commented Feb 10, 2017

bacongobbler commented Feb 10, 2017 •

edited

Loading