Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Logs unable to reach OpenSearch #5275

Closed
zandernelson opened this issue Nov 16, 2022 · 5 comments
Closed

[BUG] Logs unable to reach OpenSearch #5275

zandernelson opened this issue Nov 16, 2022 · 5 comments
Labels
bug Something isn't working distributed framework

Comments

@zandernelson
Copy link

Bug Report

Describe the bug
We have Fluentbit sidecars, the logs are unable to reach OpenSearch. We see no errors in Fluentbit logs. We do not understand what is happening because we see no errors in the Fluentbit container logs.

Fluentbit Sidecar Log Message:

Fluent Bit v1.9.9
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/11/16 00:34:42] [ info] Configuration:
[2022/11/16 00:34:42] [ info]  flush time     | 1.000000 seconds
[2022/11/16 00:34:42] [ info]  grace          | 5 seconds
[2022/11/16 00:34:42] [ info]  daemon         | 0
[2022/11/16 00:34:42] [ info] ___________
[2022/11/16 00:34:42] [ info]  inputs:
[2022/11/16 00:34:42] [ info]      tail
[2022/11/16 00:34:42] [ info]      tail
[2022/11/16 00:34:42] [ info] ___________
[2022/11/16 00:34:42] [ info]  filters:
[2022/11/16 00:34:42] [ info]      modify.0
[2022/11/16 00:34:42] [ info]      modify.1
[2022/11/16 00:34:42] [ info]      modify.2
[2022/11/16 00:34:42] [ info] ___________
[2022/11/16 00:34:42] [ info]  outputs:
[2022/11/16 00:34:42] [ info]      es.0
[2022/11/16 00:34:42] [ info]      splunk.1
[2022/11/16 00:34:42] [ info] ___________
[2022/11/16 00:34:42] [ info]  collectors:
[2022/11/16 00:34:42] [ info] [fluent bit] version=1.9.9, commit=, pid=89
[2022/11/16 00:34:42] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2022/11/16 00:34:42] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/11/16 00:34:42] [ info] [cmetrics] version=0.3.7
[2022/11/16 00:34:42] [debug] [tail:tail.0] created event channels: read=21 write=22
[2022/11/16 00:34:42] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2022/11/16 00:34:42] [debug] [input:tail:tail.0] inotify watch fd=27
[2022/11/16 00:34:42] [debug] [input:tail:tail.0] scanning path /var/log/iserver/ServerAudit*.log
[2022/11/16 00:34:42] [debug] [input:tail:tail.0] cannot read info from: /var/log/iserver/ServerAudit*.log
[2022/11/16 00:34:42] [debug] [input:tail:tail.0] 0 new files found on path '/var/log/iserver/ServerAudit*.log'
[2022/11/16 00:34:42] [debug] [tail:tail.1] created event channels: read=28 write=29
[2022/11/16 00:34:42] [debug] [input:tail:tail.1] flb_tail_fs_inotify_init() initializing inotify tail input
[2022/11/16 00:34:42] [debug] [input:tail:tail.1] inotify watch fd=34
[2022/11/16 00:34:42] [debug] [input:tail:tail.1] scanning path /var/log/iserver/DSSErrors*.log
[2022/11/16 00:34:42] [debug] [input:tail:tail.1] cannot read info from: /var/log/iserver/DSSErrors*.log
[2022/11/16 00:34:42] [debug] [input:tail:tail.1] 0 new files found on path '/var/log/iserver/DSSErrors*.log'
[2022/11/16 00:34:42] [debug] [filter:modify:modify.0] Initialized modify filter with 0 conditions and 3 rules
[2022/11/16 00:34:42] [debug] [filter:modify:modify.1] Initialized modify filter with 0 conditions and 1 rules
[2022/11/16 00:34:42] [debug] [filter:modify:modify.2] Initialized modify filter with 0 conditions and 1 rules
[2022/11/16 00:34:42] [debug] [es:es.0] created event channels: read=35 write=36
[2022/11/16 00:34:42] [debug] [output:es:es.0] host=vpc-elasticsearchcl-redacted.es.amazonaws.com port=443 uri=/_bulk index=iserver-logs type=doc
[2022/11/16 00:34:42] [debug] [splunk:splunk.1] created event channels: read=47 write=48
[2022/11/16 00:34:42] [ info] [output:es:es.0] worker #0 started
[2022/11/16 00:34:42] [ info] [output:splunk:splunk.1] worker #1 started
[2022/11/16 00:34:42] [ info] [output:splunk:splunk.1] worker #0 started
[2022/11/16 00:34:42] [ info] [output:es:es.0] worker #1 started
[2022/11/16 00:34:42] [debug] [router] match rule tail.0:es.0
[2022/11/16 00:34:42] [debug] [router] match rule tail.0:splunk.1
[2022/11/16 00:34:42] [debug] [router] match rule tail.1:es.0
[2022/11/16 00:34:42] [ info] [sp] stream processor started
[2022/11/16 00:34:42] [debug] [input:tail:tail.0] [static files] processed 0b, done
[2022/11/16 00:34:42] [debug] [input:tail:tail.1] [static files] processed 0b, done
[2022/11/16 00:35:42] [debug] [input:tail:tail.0] scanning path /var/log/iserver/ServerAudit*.log
[2022/11/16 00:35:42] [debug] [input:tail:tail.0] inode=5342925928760031785 with offset=0 appended as /var/log/iserver/ServerAudit-2022-11-16.log
[2022/11/16 00:35:42] [debug] [input:tail:tail.0] scan_glob add(): /var/log/iserver/ServerAudit-2022-11-16.log, inode 5342925928760031785
[2022/11/16 00:35:42] [debug] [input:tail:tail.0] 1 new files found on path '/var/log/iserver/ServerAudit*.log'
[2022/11/16 00:35:42] [debug] [input:tail:tail.1] scanning path /var/log/iserver/DSSErrors*.log
[2022/11/16 00:35:42] [debug] [input:tail:tail.1] inode=10747040000057398530 with offset=0 appended as /var/log/iserver/DSSErrors.log
[2022/11/16 00:35:42] [debug] [input:tail:tail.1] scan_glob add(): /var/log/iserver/DSSErrors.log, inode -7699704073652153086
[2022/11/16 00:35:42] [debug] [input:tail:tail.1] 1 new files found on path '/var/log/iserver/DSSErrors*.log'
[2022/11/16 00:35:42] [debug] [input chunk] update output instances with new chunk size diff=10262
[2022/11/16 00:35:42] [debug] [input:tail:tail.0] [static files] processed 6.8K
[2022/11/16 00:35:42] [debug] [input:tail:tail.0] inode=5342925928760031785 file=/var/log/iserver/ServerAudit-2022-11-16.log promote to TAIL_EVENT
[2022/11/16 00:35:42] [ info] [input:tail:tail.0] inotify_fs_add(): inode=5342925928760031785 watch_fd=1 name=/var/log/iserver/ServerAudit-2022-11-16.log
[2022/11/16 00:35:42] [debug] [input:tail:tail.0] [static files] processed 0b, done
[2022/11/16 00:35:42] [debug] [input chunk] update output instances with new chunk size diff=53586
[2022/11/16 00:35:42] [debug] [input:tail:tail.1] [static files] processed 31.8K
[2022/11/16 00:35:42] [debug] [input chunk] update output instances with new chunk size diff=53269
[2022/11/16 00:35:42] [debug] [input:tail:tail.1] [static files] processed 31.9K
[2022/11/16 00:35:42] [debug] [input chunk] update output instances with new chunk size diff=3941
[2022/11/16 00:35:42] [debug] [input:tail:tail.1] [static files] processed 2.5K
[2022/11/16 00:35:42] [debug] [input:tail:tail.1] inode=10747040000057398530 file=/var/log/iserver/DSSErrors.log promote to TAIL_EVENT
[2022/11/16 00:35:42] [ info] [input:tail:tail.1] inotify_fs_add(): inode=10747040000057398530 watch_fd=1 name=/var/log/iserver/DSSErrors.log
[2022/11/16 00:35:42] [debug] [input:tail:tail.1] [static files] processed 0b, done
[2022/11/16 00:35:43] [debug] [task] created task=0x7fddc8e42d20 id=0 OK
[2022/11/16 00:35:43] [debug] [output:es:es.0] task_id=0 assigned to thread #0
[2022/11/16 00:35:43] [debug] [output:splunk:splunk.1] task_id=0 assigned to thread #0
[2022/11/16 00:35:43] [debug] [task] created task=0x7fddc8e42d90 id=1 OK
[2022/11/16 00:35:43] [debug] [output:es:es.0] task_id=1 assigned to thread #1
[2022/11/16 00:35:43] [debug] [http_client] not using http_proxy for header
[2022/11/16 00:35:43] [debug] [http_client] not using http_proxy for header
[2022/11/16 00:35:43] [debug] [upstream] KA connection #86 to ip-10-0-0-0.redacted.compute.internal:8088 is now available
[2022/11/16 00:35:43] [debug] [out flush] cb_destroy coro_id=0
[2022/11/16 00:35:43] [debug] [http_client] not using http_proxy for header
[2022/11/16 00:35:43] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2022/11/16 00:35:43] [debug] [upstream] KA connection #87 to vpc-elasticsearchcl-redacted.es.amazonaws.com:443 is now available
[2022/11/16 00:35:43] [debug] [out flush] cb_destroy coro_id=0
[2022/11/16 00:35:43] [debug] [retry] new retry created for task_id=0 attempts=1
[2022/11/16 00:35:43] [ warn] [engine] failed to flush chunk '89-1668558942.65916959.flb', retry in 11 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
[2022/11/16 00:35:43] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2022/11/16 00:35:43] [debug] [upstream] KA connection #88 to vpc-elasticsearchcl-redacted.es.amazonaws.com:443 is now available
[2022/11/16 00:35:43] [debug] [out flush] cb_destroy coro_id=0
[2022/11/16 00:35:43] [debug] [retry] new retry created for task_id=1 attempts=1
[2022/11/16 00:35:43] [ warn] [engine] failed to flush chunk '89-1668558942.70974100.flb', retry in 7 seconds: task_id=1, input=tail.1 > output=es.0 (out_id=0)
[2022/11/16 00:35:50] [debug] [output:es:es.0] task_id=1 assigned to thread #0
[2022/11/16 00:35:50] [debug] [upstream] KA connection #87 to vpc-elasticsearchcl-redacted.es.amazonaws.com:443 has been assigned (recycled)
[2022/11/16 00:35:50] [debug] [http_client] not using http_proxy for header
[2022/11/16 00:35:50] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2022/11/16 00:35:50] [debug] [upstream] KA connection #87 to vpc-elasticsearchcl-redacted.es.amazonaws.com:443 is now available
[2022/11/16 00:35:50] [debug] [out flush] cb_destroy coro_id=1
[2022/11/16 00:35:50] [debug] [task] task_id=1 reached retry-attempts limit 1/1
[2022/11/16 00:35:50] [ warn] [engine] chunk '89-1668558942.70974100.flb' cannot be retried: task_id=1, input=tail.1 > output=es.0
[2022/11/16 00:35:50] [debug] [task] destroy task=0x7fddc8e42d90 (task_id=1)
[2022/11/16 00:35:54] [debug] [output:es:es.0] task_id=0 assigned to thread #1
[2022/11/16 00:35:54] [debug] [upstream] KA connection #88 to vpc-elasticsearchcl-redacted.es.amazonaws.com:443 has been assigned (recycled)
[2022/11/16 00:35:54] [debug] [http_client] not using http_proxy for header
[2022/11/16 00:35:54] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2022/11/16 00:35:54] [debug] [upstream] KA connection #88 to vpc-elasticsearchcl-redacted.es.amazonaws.com:443 is now available
[2022/11/16 00:35:54] [debug] [out flush] cb_destroy coro_id=1
[2022/11/16 00:35:54] [debug] [task] task_id=0 reached retry-attempts limit 1/1
[2022/11/16 00:35:54] [ warn] [engine] chunk '89-1668558942.65916959.flb' cannot be retried: task_id=0, input=tail.0 > output=es.0
[2022/11/16 00:35:54] [debug] [task] destroy task=0x7fddc8e42d20 (task_id=0)
[2022/11/16 00:36:13] [debug] [upstream] drop keepalive connection #-1 to ip-10-0-0-0.redacted.compute.internal:8088 (keepalive idle timeout)
[2022/11/16 00:36:13] [debug] [socket] could not validate socket status for #86 (don't worry)
[2022/11/16 00:36:20] [debug] [upstream] drop keepalive connection #-1 to vpc-elasticsearchcl-redacted.es.amazonaws.com:443 (keepalive idle timeout)
[2022/11/16 00:36:20] [debug] [socket] could not validate socket status for #87 (don't worry)
[2022/11/16 00:36:25] [debug] [upstream] drop keepalive connection #-1 to vpc-elasticsearchcl-redacted.es.amazonaws.com:443 (keepalive idle timeout)
[2022/11/16 00:36:25] [debug] [socket] could not validate socket status for #88 (don't worry)

Fluentbit Configmap:

apiVersion: v1
metadata:
  name: iserver-fluentbit
data:
  fluent-bit.conf: """
  
    [SERVICE]
        Flush         1
        Log_Level     debug
        Daemon        off
        Parsers_File  parsers.conf

    [INPUT]
        Name tail
        Tag  iserver.audit.log
        Multiline Off
        Path /var/log/iserver/ServerAudit*.log

    [INPUT]
        Name tail
        Tag  iserver.dsserrors.log
        Multiline Off
        Path /var/log/iserver/DSSErrors*.log

    [FILTER]
        Name      modify
        Match     *
        Add       POD_NAME      {{ POD NAME }}
        Add       ENV_NAME      {{ RELEASE NAME }}
        Add       CLUSTER_NAME  {{ CLUSTER NAME }}

    [FILTER]
        Name modify
        Match iserver.dsserrors.log       
        Add    LOGNAME      dsserror

    [OUTPUT]
        Name    es
        Match   iserver.*.*
        Host    {{ OPENSEARCH HOST }}
        Port    {{ OPENSEARCH PORT }}
        Logstash_Format  True
        Logstash_Prefix  iserver-logs
        Index   iserver-logs
        Type    doc
        AWS_Auth Off
        AWS_Region {{ AWS REGION }}
        tls     On
        tls.verify Off

  """

  parsers.conf: |2

    [PARSER]
        Name   apache2
        Format regex
        Regex  ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>.*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   apache_error
        Format regex
        Regex  ^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\](?: \[pid (?<pid>[^\]]*)\])?( \[client (?<client>[^\]]*)\])? (?<message>.*)$

    [PARSER]
        Name   nginx
        Format regex
        Regex  ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   json
        Format json
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

    [PARSER]
        Name        syslog
        Format      regex
        Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key    time
        Time_Format %b %d %H:%M:%S

    [PARSER]
        Name    kube-custom
        Format  regex
        Regex   (?<tag>[^.]+)?\.?(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$

    [PARSER]
        Name dsserror
        Format  regex
        Time_key  time
        Regex  ^(?<time>[^\[]+)\s(\[HOST:(?<host>[^\]]+)\]\[SERVER:(?<server>[^\]]+)\]\[PID:(?<processid>[^\]]+)\]\[THR:(?<threadid>[^\]]+)\]\[([^\]]+)\]\[([^\]]+)\](\[([^\]]+)\])?\[UID:([^\]]+)\]\[SID:([^\]]+)\]\[OID:([^\]]+)\])\s*(?<message>.*)\s*(?<message>.*)\s*(?<message>.*)\s*(?<message>.*)\s*(?<message>.*)\s*(?<message>.*)\s*(?<message>.*)$
        Time_Format %Y-%m-%d %H:%M:%S.%L
        Time_Keep On

Expected behavior
Ability to send logs to OpenSearch, errors produced when unable to send logs to OpenSearch

Your Environment

  • Version used: fluent-bit:1.9.9
  • Configuration: config map provided
  • Environment name and version: Kubernetes 1.22
@zandernelson zandernelson added bug Something isn't working untriaged labels Nov 16, 2022
@kmdrn7
Copy link

kmdrn7 commented Nov 23, 2022

Hi @zander312 , seems like your fluentbit having some trouble when pushing the logs to your es output. you can try enable error tracing on your es output plugin like this for better debugging

    [OUTPUT]
        Name    es
        Match   iserver.*.*
        Host    {{ OPENSEARCH HOST }}
        Port    {{ OPENSEARCH PORT }}
        Logstash_Format  True
        Logstash_Prefix  iserver-logs
        Index   iserver-logs
        Type    doc
        AWS_Auth Off
        AWS_Region {{ AWS REGION }}
        tls     On
        tls.verify Off
        Trace_Error On

@andrross
Copy link
Member

Enabling error tracing is probably the right way to tackle this problem. Please reopen if you have more details and think there might be a problem in OpenSearch.

@zandernelson
Copy link
Author

@andrross @kmdrn7 we were able to solve the issue, there were issues with the health of the OpenSearch cluster and we had run out of space for indexes.

However, one thing that concerned me was that we were receiving 200 responses back from the OpenSearch cluster. As you can see from the logs I attached. HTTP Status=200 URI=/_bulk We were not receiving the usual errors in FluentBit that we get when this kind of thing happens.

We were receiving these 200 responses even though the actual operation was failing on the OpenSearch side. I would have expected a better response code that indicates a failure. The AWS ElasticSearch team pointed out that it is a bulk API and that is why were receiving 200 responses back. But I am still a bit puzzled about that functionality.

@andrross
Copy link
Member

Check out the documentation for the bulk API. Since it is a batch-style request, you have to look at the per-item responses since partial failures are possible. More detailed status codes are available for each item.

@dblock
Copy link
Member

dblock commented Nov 30, 2022

Short of transactional support, I think there's a feature request here to supply a parameter to bulk to abort on first failure. We also have #4562 and #3625. @zander312 if you believe that would help, open an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

No branches or pull requests

6 participants