4paradigm · vagetablechicken · Nov 10, 2023 · Aug 18, 2023 · Sep 28, 2023 · Sep 28, 2023
diff --git a/.github/workflows/udf-doc.yml b/.github/workflows/udf-doc.yml
@@ -54,8 +54,8 @@ jobs:
         if: github.event_name != 'pull_request'
         with:
           add-paths: |
-            docs/en/reference/sql/functions_and_operators/Files/udfs_8h.md
-            docs/zh/openmldb_sql/functions_and_operators/Files/udfs_8h.md
+            docs/en/reference/sql/udfs_8h.md
+            docs/zh/openmldb_sql/udfs_8h.md
           labels: |
             udf
           branch: docs-udf-patch

diff --git a/docs/en/developer/built_in_function_develop_guide.md b/docs/en/developer/built_in_function_develop_guide.md
@@ -792,7 +792,7 @@ select date(timestamp(1590115420000)) as dt;
 
 ## 5. Document Management
 
-Documents for all built-in functions can be found in [Built-in Functions](http://4paradigm.github.io/OpenMLDB/zh/main/reference/sql/functions_and_operators/Files/udfs_8h.html). It is a markdown file automatically generated from source, so please do not edit it directly.
+Documents for all built-in functions can be found in [Built-in Functions](http://4paradigm.github.io/OpenMLDB/zh/main/reference/sql/udfs_8h.html). It is a markdown file automatically generated from source, so please do not edit it directly.
 
 - If you are adding a document for a new function, please refer to [2.2.4 Documenting Function](#224-documenting-function). 
 - If you are trying to revise a document of an existing function, you can find source code in the files of `hybridse/src/udf/default_udf_library.cc` or `hybridse/src/udf/default_defs/*_def.cc` .

diff --git a/docs/en/developer/udf_develop_guide.md b/docs/en/developer/udf_develop_guide.md
@@ -9,7 +9,7 @@ SQL functions can be categorised into scalar functions and aggregate functions.
 #### 2.1.1 Naming Specification of C++ Built-in Function
 - The naming of C++ built-in function should follow the [snake_case](https://en.wikipedia.org/wiki/Snake_case) style.
 - The name should clearly express the function's purpose.
-- The name of a function should not be the same as the name of a built-in function or other custom functions. The list of all built-in functions can be seen [here](../reference/sql/functions_and_operators/Files/udfs_8h.md).
+- The name of a function should not be the same as the name of a built-in function or other custom functions. The list of all built-in functions can be seen [here](../reference/sql/udfs_8h.md).
 
 #### 2.1.2 
 The types of the built-in C++ functions' parameters should be BOOL, NUMBER, TIMESTAMP, DATE, or STRING.

diff --git a/docs/en/reference/sql/dql/WINDOW_CLAUSE.md b/docs/en/reference/sql/dql/WINDOW_CLAUSE.md
@@ -320,5 +320,5 @@ WINDOW w1 AS (PARTITION BY col1 ORDER BY col5 ROWS_RANGE BETWEEN 10s PRECEDING A
 ```
 
 ```{seealso}
-Please refer to [Built-in Functions](../functions_and_operators/Files/udfs_8h.md) for aggregate functions that can be used in window computation.
+Please refer to [Built-in Functions](../udfs_8h.md) for aggregate functions that can be used in window computation.
 ````
diff --git a/docs/en/reference/sql/index.rst b/docs/en/reference/sql/index.rst
@@ -9,6 +9,7 @@ SQL
     language_structure/index 
     data_types/index
     functions_and_operators/index
+    udfs_8h
     dql/index
     dml/index
     ddl/index

diff --git a/...nce/sql/functions_and_operators/index.rst → docs/en/reference/sql/operators/index.rst b/...nce/sql/functions_and_operators/index.rst → docs/en/reference/sql/operators/index.rst
@@ -1,10 +1,9 @@
 =============================
-Expressions, Functions, and Operations
+Expressions and Operations
 =============================
 
 
 .. toctree::
     :maxdepth: 1
 
     operators
-    Files/udfs_8h
diff --git a/.../sql/functions_and_operators/operators.md → docs/en/reference/sql/operators/operators.md b/.../sql/functions_and_operators/operators.md → docs/en/reference/sql/operators/operators.md
diff --git a/.../functions_and_operators/Files/udfs_8h.md → docs/en/reference/sql/udfs_8h.md b/.../functions_and_operators/Files/udfs_8h.md → docs/en/reference/sql/udfs_8h.md
diff --git a/docs/zh/deploy/index.rst b/docs/zh/deploy/index.rst
@@ -8,6 +8,5 @@
     install_deploy
     conf
     compile
-    integrate_hadoop
     offline_integrate_kubernetes
     [Alpha]在线引擎基于 Kubernetes 部署 <https://github.com/4paradigm/openmldb-k8s>
diff --git a/docs/zh/developer/built_in_function_develop_guide.md b/docs/zh/developer/built_in_function_develop_guide.md
@@ -1034,10 +1034,9 @@ RegisterUdafTemplate<DistinctCountDef>("distinct_count")
 
 ## 6. 文档管理
 
-内置函数文档可在 [Built-in Functions](https://openmldb.ai/docs/zh/main/openmldb_sql/functions_and_operators/Files/udfs_8h.html) 查看，它是一个代码生成的 markdown 文件，注意请不要进行直接编辑。
+内置函数文档可在 [Built-in Functions](../openmldb_sql/udfs_8h.md) 查看，它是一个代码生成的 markdown 文件，注意请不要进行直接编辑。
 
-- 如果需要对新增加的函数添加文档，请参照 2.2.4 配置函数文档 章节，说明了内置函数的文档是在 CPP 源代码中管理的。后续会通过一系列步骤生成如上网页中更加可读的文档， 即`docs/*/openmldb_sql/functions_and_operators/`目录下的内容。
+- 如果需要对新增加的函数添加文档，请参照 2.2.4 配置函数文档 章节，说明了内置函数的文档是在 CPP 源代码中管理的。后续会通过一系列步骤生成如上网页中更加可读的文档， 即`docs/*/openmldb_sql/`目录下的内容。
 - 如果需要修改一个已存在函数的文档，可以在文件 `hybridse/src/udf/default_udf_library.cc` 或者 `hybridse/src/udf/default_defs/*_def.cc` 下查找到对应函数的文档说明，进行修改。
 
 OpenMLDB 项目中创建了一个定期天级别的 GitHub Workflow 任务来定期更新这里的相关文档。因此内置函数文档相关的改动只需按照上面的步骤修改对应源代码位置的内容即可，`docs` 目录和网站的内容会随之定期更新。具体的文档生成流程可以查看源代码路径下的 [udf_doxygen](https://github.com/4paradigm/OpenMLDB/tree/main/hybridse/tools/documentation/udf_doxygen)。
-
diff --git a/docs/zh/faq/client_faq.md b/docs/zh/faq/client_faq.md
@@ -0,0 +1,88 @@
+# Client FAQ
+
+## fail to get tablet ... 的错误日志
+
+优先检查集群中tablet server是否意外下线，或者在线表是否不可读写。推荐通过[openmldb_tool](../maintain/diagnose.md)诊断，使用`status`（status --diff）和`inspect online`两个检查命令。
+TODO diag tool 测到offline或online表不正常，会输出警告和下一步应该怎么操作？
+如果只能手动检查，需要两步：
+- `show components`，检查server是否存在在列表中（TaskManager如果下线，将不在表中。Tablet如果下线，将在表中，但状态为offline）,以及在列表中的server的状态是否为online。如果存在offline的server，**先将server重启加入集群**。
+- `show table status like '%'`（低版本如果不支持like，需要分别查询系统db和用户db），检查每个表的"Warnings"是否报错。
+
+一般会得到`real replica number X does not match the configured replicanum X`等错误，具体错误信息请参考[SHOW TABLE STATUS](../openmldb_sql/ddl/SHOW_TABLE_STATUS.md)。这些错误都说明表目前是有问题的，无法提供正常读写功能，通常是由于Tablet
+
+## 为什么收到 Reached timeout 的警告日志？
+```
+rpc_client.h:xxx] request error. [E1008] Reached timeout=xxxms
+```
+这是由于client端本身发送的rpc request的timeout设置小了，client端自己主动断开，注意这是rpc的超时。需要更改通用的`request_timeout`配置。
+1. CLI: 启动时配置`--request_timeout_ms`
+2. JAVA/Python SDK: Option或url中调整`SdkOption.requestTimeout`
+```{note}
+同步的离线命令通常不会出现这个错误，因为同步离线命令的timeout设置为了TaskManager可接受的最长时间。
+```
+
+## 为什么收到 Got EOF of Socket 的警告日志？
+```
+rpc_client.h:xxx] request error. [E1014]Got EOF of Socket{id=x fd=x addr=xxx} (xx)
+```
+这是因为`addr`端主动断开了连接，`addr`的地址大概率是TaskManager。这不代表TaskManager不正常，而是TaskManager端认为这个连接没有活动，超过keepAliveTime了，而主动断开通信channel。
+在0.5.0及以后的版本中，可以调大TaskManager的`server.channel_keep_alive_time`来提高对不活跃channel的容忍度。默认值为1800s(0.5h)，特别是使用同步的离线命令时，这个值可能需要适当调大。
+在0.5.0以前的版本中，无法更改此配置，请升级TaskManager版本。
+
+## 离线查询结果显示中文为什么乱码？
+
+在使用离线查询时，可能出现包含中文的查询结果乱码，主要和系统默认编码格式与Spark任务编码格式参数有关。
+
+如果出现乱码情况，可以通过添加Spark高级参数`spark.driver.extraJavaOptions=-Dfile.encoding=utf-8`和`spark.executor.extraJavaOptions=-Dfile.encoding=utf-8`来解决。
+
+客户端配置方法可参考[客户端Spark配置文件](../reference/client_config/client_spark_config.md)，也可以在TaskManager配置文件中添加此项配置。
+
+```
+spark.default.conf=spark.driver.extraJavaOptions=-Dfile.encoding=utf-8;spark.executor.extraJavaOptions=-Dfile.encoding=utf-8
+```
+
+## 如何配置TaskManager来访问开启Kerberos的Yarn集群？
+
+如果Yarn集群开启Kerberos认证，TaskManager可以通过添加以下配置来访问开启Kerberos认证的Yarn集群。注意请根据实际配置修改keytab路径以及principal账号。
+
+```
+spark.default.conf=spark.yarn.keytab=/tmp/test.keytab;[email protected]
+```
+
+## 如何配置客户端的core日志？
+
+客户端core日志主要有两种，zk日志和sdk日志（glog日志），两者是独立的。
+
+zk日志：
+1. CLI：启动时配置`--zk_log_level`调整level,`--zk_log_file`配置日志保存文件。
+2. JAVA/Python SDK：Option或url中使用`zkLogLevel`调整level，`zkLogFile`配置日志保存文件。
+
+- `zk_log_level`(int, 默认=0, 即DISABLE_LOGGING): 
+打印这个等级及**以下**等级的日志。0-禁止所有zk log, 1-error, 2-warn, 3-info, 4-debug。
+
+sdk日志（glog日志）：
+1. CLI：启动时配置`--glog_level`调整level,`--glog_dir`配置日志保存文件。
+2. JAVA/Python SDK：Option或url中使用`glogLevel`调整level，`glogDir`配置日志保存文件。
+
+- `glog_level`(int, 默认=1, 即WARNING):
+打印这个等级及**以上**等级的日志。 INFO, WARNING, ERROR, and FATAL日志分别对应 0, 1, 2, and 3。
+
+
+## 插入错误，日志显示`please use getInsertRow with ... first`
+
+在JAVA client使用InsertPreparedStatement进行插入，或在Python中使用sql和parameter进行插入时，client底层实际有cache影响，第一步`getInsertRow`生成sql cache并返回sql还需要补充的parameter信息，第二步才会真正执行insert，而执行insert需要使用第一步缓存的sql cache。所以，当多线程使用同一个client时，可能因为插入和查询频繁更新cache表，将你想要执行的insert sql cache淘汰掉了，所以会出现好像第一步`getInsertRow`并未执行的样子。
+
+目前可以通过调大`maxSqlCacheSize`这一配置项来避免错误。仅JAVA/Python SDK支持配置。
+
+## 离线命令Spark报错
+
+`java.lang.OutOfMemoryError: Java heap space`
+
+离线命令的Spark配置默认为`local[*]`，并发较高可能出现OutOfMemoryError错误，请调整`spark.driver.memory`和`spark.executor.memory`两个spark配置项。可以写在TaskManager运行目录的`conf/taskmanager.properties`的`spark.default.conf`并重启TaskManager，或者使用CLI客户端进行配置，参考[客户端Spark配置文件](../reference/client_config/client_spark_config.md)。
+```
+spark.default.conf=spark.driver.memory=16g;spark.executor.memory=16g
+```
+
+Container killed by YARN for exceeding memory limits. 5 GB of 5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
+
+local时drivermemory
diff --git a/docs/zh/faq/index.rst b/docs/zh/faq/index.rst
@@ -0,0 +1,10 @@
+=============================
+FAQ
+=============================
+
+
+.. toctree::
+    :maxdepth: 1
+
+    client_faq
+    server_faq
diff --git a/docs/zh/faq/server_faq.md b/docs/zh/faq/server_faq.md
@@ -0,0 +1,61 @@
+# Server FAQ
+
+Server中有任何上下线变化或问题，都先openmldb_tool status + inspect online检查下集群是否正常。
+
+## 部署和启动 FAQ
+
+### 1. 如何确认集群已经正常运行？
+虽然有一键启动脚本，但由于配置繁多，可能出现“端口已被占用”，“目录无读写权限”等问题。这些问题都是server进程运行之后才能发现，退出后没有及时反馈。（如果配置了监控，可以通过监控直接检查。）
+所以，请先确认集群的所有server进程都正常运行。
+
+可以通过`ps axu | grep openmldb`或sql命令`show components;`来查询。（注意，如果你使用了守护进程，openmldb server进程可能是在启动停止的循环中，并不代表持续运行，可以通过日志或`show components;`连接时间来确认。）
+
+如果进程都活着，集群还是表现不正常，需要查询一下server日志。可以优先看WARN和ERROR级日志，很大概率上，它们就是根本原因。
+
+### 2. 如果数据没有自动恢复成功怎么办？
+
+通常情况，当我们重启服务，表中数据会自动进行恢复，但有些情况可能会造成恢复失败，通常失败的情况包括：
+
+- tablet异常退出
+- 多副本表多个副本所在的tablets同时重启或者重启太快，造成某些`auto_failover`操作还没完成tablet就重启
+- auto_failover设成`false`
+
+当服务启动成功后，可以通过`gettablestatus`获得所有表的状态：
+```
+python tools/openmldb_ops.py --openmldb_bin_path=./bin/openmldb --zk_cluster=172.24.4.40:30481 --zk_root_path=/openmldb --cmd=gettablestatus
+```
+
+如果表中有`Warnings`，可以通过`recoverdata`来自动恢复数据：
+```
+python tools/openmldb_ops.py --openmldb_bin_path=./bin/openmldb --zk_cluster=172.24.4.40:30481 --zk_root_path=/openmldb --cmd=recoverdata
+```
+
+## Server FAQ
+
+### 1. 为什么日志中有 Fail to write into Socket 的警告日志？
+```
+http_rpc_protocol.cpp:911] Fail to write into Socket{id=xx fd=xx addr=xxx} (0x7a7ca00): Unknown error 1014 [1014]
+```
+这是server端会打印的日志。一般是client端使用了连接池或短连接模式，在RPC超时后会关闭连接，server写回response时发现连接已经关了就报这个错。Got EOF就是指之前已经收到了EOF（对端正常关闭了连接）。client端使用单连接模式server端一般不会报这个。
+
+### 2. 表数据的ttl初始设置不合适，如何调整？
+这需要使用nsclient来修改，普通client无法做到。nsclient启动方式与命令，见[ns client](../maintain/cli.md#ns-client)。
+
+在nsclient中使用命令`setttl`可以更改一个表的ttl，类似
+```
+setttl table_name ttl_type ttl [ttl] [index_name]
+```
+可以看到，如果在命令末尾配置index的名字，可以做到只修改单个index的ttl。
+```{caution}
+`setttl`的改变不会及时生效，会受到tablet server的配置`gc_interval`的影响。（每台tablet server的配置是独立的，互不影响。）
+
+举例说明，有一个tablet server的`gc_interval`是1h，那么ttl的配置重载，会在下一次gc的最后时刻进行（最坏情况下，会在1h后重载）。重载ttl的这一次gc就不会按最新ttl来淘汰数据。再下一次gc时才会使用最新ttl进行数据淘汰。
+
+所以，**ttl更改后，需要等待两次gc interval的时间才会生效**。请耐心等待。
+
+当然，你可以调整tablet server的`gc_interval`，但这个配置无法动态更改，只能重启生效。所以，如果内存压力较大，可以尝试扩容，迁移数据分片，来减少内存压力。不推荐轻易调整`gc_interval`。
+```
+
+### 3. 出现警告日志：Last Join right table is empty，这是什么意思？
+通常来讲，这是一个正常现象，不代表集群异常。只是runner中join右表为空，是可能的现象，大概率是数据问题。
+
diff --git a/docs/zh/index.rst b/docs/zh/index.rst
@@ -16,3 +16,4 @@ OpenMLDB 文档 (|version|)
     maintain/index
     reference/index
     developer/index
+    faq/index