Skip to content

Commit

Permalink
DAOS-16467 rebuild: add DAOS_POOL_RF ENV for massive failure case (#1…
Browse files Browse the repository at this point in the history
…5057)

* DAOS-16467 rebuild: add DAOS_PW_RF ENV for massive failure case

Allow user to set DAOS_PW_RF as pw_rf (pool wise RF).
If SWIM detected engine failure is going to break pw_rf, don't change
pool map, also don't trigger rebuild.
With critical log message to ask administrator to bring back those
engines in top priority (just "system start --ranks=xxx", need not to
reintegrate those engines).

a few functions renamed to avoid confuse -
pool_map_find_nodes() -> pool_map_find_ranks()
pool_map_find_node_by_rank() -> pool_map_find_dom_by_rank()
pool_map_node_nr() -> pool_map_rank_nr()

Signed-off-by: Xuezhao Liu <[email protected]>
  • Loading branch information
liuxuezhao authored Sep 6, 2024
1 parent f7d3523 commit 8a8c7d4
Show file tree
Hide file tree
Showing 20 changed files with 196 additions and 74 deletions.
1 change: 1 addition & 0 deletions docs/admin/env_variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Environment variables in this section only apply to the server side.
|DAOS\_DTX\_RPC\_HELPER\_THD|DTX RPC helper threshold. The valid range is [18, unlimited). The default value is 513.|
|DAOS\_DTX\_BATCHED\_ULT\_MAX|The max count of DTX batched commit ULTs. The valid range is [0, unlimited). 0 means to commit DTX synchronously. The default value is 32.|
|DAOS\_FORWARD\_NEIGHBOR|Set to enable I/O forwarding on neighbor xstream in the absence of helper threads.|
|DAOS\_POOL\_RF|Redundancy factor for the pool. The valid range is [1, 4]. The default value is 2.|

## Server and Client environment variables

Expand Down
24 changes: 24 additions & 0 deletions docs/admin/pool_operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -916,6 +916,30 @@ and possibly repair a pmemobj file. As discussed in the previous section, the
rebuild status can be consulted via the pool query and will be expanded
with more information.

## Pool Redundancy Factor

If the DAOS system experiences cascading failures, where the number of failed
fault domains exceeds a pool's redundancy factor, there could be unrecoverable
errors and applications could suffer from data loss. This can happen in cases
of power or network outages and would cause node/engine failures. In most cases
those failures can be recovered and DAOS engines can be restarted and the system
can function again.

Administrator can set the default pool redundancy factor by environment variable
"DAOS_POOL_RF" in the server yaml file. If SWIM detects and reports an engine is
dead and the number of failed fault domain exceeds or is going to exceed the pool
redundancy factor, it will not change pool map immediately. Instead, it will give
critical log message:
intolerable unavailability: engine rank x
In this case, the system administrator should check and try to recover those
failed engines and bring them back with:
dmg system start --ranks=x
one by one. A reintegrate call is not needed.

For true unrecoverable failures, the administrator can still exclude engines.
However, data loss is expected as the number of unrecoverable failures exceeds
the pool redundancy factor.

## Recovering Container Ownership

Typically users are expected to manage their containers. However, in the event
Expand Down
4 changes: 2 additions & 2 deletions src/chk/chk_engine.c
Original file line number Diff line number Diff line change
Expand Up @@ -668,7 +668,7 @@ chk_engine_pool_mbs_one(struct chk_pool_rec *cpr, struct pool_map *map, struct c
int rc = 0;
bool unknown;

dom = pool_map_find_node_by_rank(map, mbs->cpm_rank);
dom = pool_map_find_dom_by_rank(map, mbs->cpm_rank);
if (dom == NULL) {
D_ASSERT(mbs->cpm_rank != dss_self_rank());

Expand Down Expand Up @@ -777,7 +777,7 @@ chk_engine_find_dangling_pm(struct chk_pool_rec *cpr, struct pool_map *map)
int j;
bool down;

rank_nr = pool_map_find_nodes(map, PO_COMP_ID_ALL, &doms);
rank_nr = pool_map_find_ranks(map, PO_COMP_ID_ALL, &doms);
if (rank_nr <= 0)
D_GOTO(out, rc = rank_nr);

Expand Down
18 changes: 9 additions & 9 deletions src/common/pool_map.c
Original file line number Diff line number Diff line change
Expand Up @@ -1573,7 +1573,7 @@ add_domain_tree_to_pool_buf(struct pool_map *map, struct pool_buf *map_buf,
if (map) {
struct pool_domain *found_dom;

found_dom = pool_map_find_node_by_rank(map, rank);
found_dom = pool_map_find_dom_by_rank(map, rank);
if (found_dom) {
if (found_dom->do_comp.co_status == PO_COMP_ST_NEW)
found_new_dom = true;
Expand Down Expand Up @@ -2038,7 +2038,7 @@ pool_map_find_domain(struct pool_map *map, pool_comp_type_t type, uint32_t id,
}

/**
* Find all nodes in the pool map.
* Find all ranks in the pool map.
*
* \param map [IN] pool map to search.
* \param id [IN] id to search.
Expand All @@ -2048,7 +2048,7 @@ pool_map_find_domain(struct pool_map *map, pool_comp_type_t type, uint32_t id,
* 0 if none.
*/
int
pool_map_find_nodes(struct pool_map *map, uint32_t id,
pool_map_find_ranks(struct pool_map *map, uint32_t id,
struct pool_domain **domain_pp)
{
return pool_map_find_domain(map, PO_COMP_TP_RANK, id,
Expand Down Expand Up @@ -2102,14 +2102,14 @@ pool_map_find_target(struct pool_map *map, uint32_t id,
* \return domain found by rank.
*/
struct pool_domain *
pool_map_find_node_by_rank(struct pool_map *map, uint32_t rank)
pool_map_find_dom_by_rank(struct pool_map *map, uint32_t rank)
{
struct pool_domain *doms;
struct pool_domain *found = NULL;
int doms_cnt;
int i;

doms_cnt = pool_map_find_nodes(map, PO_COMP_ID_ALL, &doms);
doms_cnt = pool_map_find_ranks(map, PO_COMP_ID_ALL, &doms);
if (doms_cnt <= 0)
return NULL;

Expand Down Expand Up @@ -2150,7 +2150,7 @@ pool_map_find_targets_on_ranks(struct pool_map *map, d_rank_list_t *rank_list,
for (i = 0; i < rank_list->rl_nr; i++) {
struct pool_domain *dom;

dom = pool_map_find_node_by_rank(map, rank_list->rl_ranks[i]);
dom = pool_map_find_dom_by_rank(map, rank_list->rl_ranks[i]);
if (dom == NULL) {
pool_target_id_list_free(tgts);
return 0;
Expand Down Expand Up @@ -2191,7 +2191,7 @@ pool_map_find_target_by_rank_idx(struct pool_map *map, uint32_t rank,
{
struct pool_domain *dom;

dom = pool_map_find_node_by_rank(map, rank);
dom = pool_map_find_dom_by_rank(map, rank);
if (dom == NULL)
return 0;

Expand Down Expand Up @@ -2867,7 +2867,7 @@ pool_map_find_by_rank_status(struct pool_map *map,

*tgt_ppp = NULL;
*tgt_cnt = 0;
dom = pool_map_find_node_by_rank(map, rank);
dom = pool_map_find_dom_by_rank(map, rank);
if (dom == NULL)
return 0;

Expand Down Expand Up @@ -2902,7 +2902,7 @@ pool_map_get_ranks(uuid_t pool_uuid, struct pool_map *map, bool get_enabled, d_r
struct pool_domain *domains = NULL;
d_rank_list_t *ranklist = NULL;

nnodes_tot = pool_map_find_nodes(map, PO_COMP_ID_ALL, &domains);
nnodes_tot = pool_map_find_ranks(map, PO_COMP_ID_ALL, &domains);
for (i = 0; i < nnodes_tot; i++) {
if (pool_map_node_status_match(&domains[i], ENABLED))
nnodes_enabled++;
Expand Down
2 changes: 1 addition & 1 deletion src/container/cli.c
Original file line number Diff line number Diff line change
Expand Up @@ -3386,7 +3386,7 @@ dc_cont_node_id2ptr(daos_handle_t coh, uint32_t node_id,
pool = dc_hdl2pool(dc->dc_pool_hdl);
D_ASSERT(pool != NULL);
D_RWLOCK_RDLOCK(&pool->dp_map_lock);
n = pool_map_find_nodes(pool->dp_map, node_id, dom);
n = pool_map_find_ranks(pool->dp_map, node_id, dom);
D_RWLOCK_UNLOCK(&pool->dp_map_lock);
dc_pool_put(pool);
dc_cont_put(dc);
Expand Down
15 changes: 7 additions & 8 deletions src/container/srv_container.c
Original file line number Diff line number Diff line change
Expand Up @@ -1667,7 +1667,7 @@ cont_ec_agg_alloc(struct cont_svc *cont_svc, uuid_t cont_uuid,
{
struct cont_ec_agg *ec_agg = NULL;
struct pool_domain *doms;
int node_nr;
int rank_nr;
int rc = 0;
int i;

Expand All @@ -1676,19 +1676,18 @@ cont_ec_agg_alloc(struct cont_svc *cont_svc, uuid_t cont_uuid,
return -DER_NOMEM;

D_ASSERT(cont_svc->cs_pool->sp_map != NULL);
node_nr = pool_map_find_nodes(cont_svc->cs_pool->sp_map,
PO_COMP_ID_ALL, &doms);
if (node_nr < 0)
D_GOTO(out, rc = node_nr);
rank_nr = pool_map_find_ranks(cont_svc->cs_pool->sp_map, PO_COMP_ID_ALL, &doms);
if (rank_nr < 0)
D_GOTO(out, rc = rank_nr);

D_ALLOC_ARRAY(ec_agg->ea_server_ephs, node_nr);
D_ALLOC_ARRAY(ec_agg->ea_server_ephs, rank_nr);
if (ec_agg->ea_server_ephs == NULL)
D_GOTO(out, rc = -DER_NOMEM);

uuid_copy(ec_agg->ea_cont_uuid, cont_uuid);
ec_agg->ea_servers_num = node_nr;
ec_agg->ea_servers_num = rank_nr;
ec_agg->ea_current_eph = 0;
for (i = 0; i < node_nr; i++) {
for (i = 0; i < rank_nr; i++) {
ec_agg->ea_server_ephs[i].rank = doms[i].do_comp.co_rank;
ec_agg->ea_server_ephs[i].eph = 0;
}
Expand Down
14 changes: 7 additions & 7 deletions src/dtx/dtx_coll.c
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ dtx_coll_prep(uuid_t po_uuid, daos_unit_oid_t oid, struct dtx_id *xid, struct dt
struct dtx_coll_target *dct;
struct dtx_coll_entry *dce = NULL;
struct daos_obj_md md = { 0 };
uint32_t node_nr;
uint32_t rank_nr;
d_rank_t my_rank = dss_self_rank();
d_rank_t max_rank = 0;
int rc = 0;
Expand Down Expand Up @@ -192,19 +192,19 @@ dtx_coll_prep(uuid_t po_uuid, daos_unit_oid_t oid, struct dtx_id *xid, struct dt
}
}

node_nr = pool_map_node_nr(map->pl_poolmap);
if (unlikely(node_nr == 1))
rank_nr = pool_map_rank_nr(map->pl_poolmap);
if (unlikely(rank_nr == 1))
D_GOTO(out, rc = 0);

dce->dce_ranks = d_rank_list_alloc(node_nr - 1);
dce->dce_ranks = d_rank_list_alloc(rank_nr - 1);
if (dce->dce_ranks == NULL)
D_GOTO(out, rc = -DER_NOMEM);

D_ALLOC_ARRAY(dce->dce_hints, node_nr);
D_ALLOC_ARRAY(dce->dce_hints, rank_nr);
if (dce->dce_hints == NULL)
D_GOTO(out, rc = -DER_NOMEM);

for (i = 0; i < node_nr; i++)
for (i = 0; i < rank_nr; i++)
dce->dce_hints[i] = (uint8_t)(-1);

md.omd_id = oid.id_pub;
Expand All @@ -220,7 +220,7 @@ dtx_coll_prep(uuid_t po_uuid, daos_unit_oid_t oid, struct dtx_id *xid, struct dt
goto out;
}

for (i = 0, j = 0; i < layout->ol_nr && j < node_nr - 1; i++) {
for (i = 0, j = 0; i < layout->ol_nr && j < rank_nr - 1; i++) {
if (layout->ol_shards[i].po_target == -1 || layout->ol_shards[i].po_shard == -1)
continue;

Expand Down
8 changes: 4 additions & 4 deletions src/include/daos/pool_map.h
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,7 @@ int pool_map_find_target(struct pool_map *map, uint32_t id,
struct pool_target **target_pp);
int pool_map_find_domain(struct pool_map *map, pool_comp_type_t type,
uint32_t id, struct pool_domain **domain_pp);
int pool_map_find_nodes(struct pool_map *map, uint32_t id,
int pool_map_find_ranks(struct pool_map *map, uint32_t id,
struct pool_domain **domain_pp);
int pool_map_find_tgts_by_state(struct pool_map *map,
pool_comp_state_t match_states,
Expand Down Expand Up @@ -311,7 +311,7 @@ bool
pool_map_node_status_match(struct pool_domain *dom, unsigned int status);

struct pool_domain *
pool_map_find_node_by_rank(struct pool_map *map, uint32_t rank);
pool_map_find_dom_by_rank(struct pool_map *map, uint32_t rank);

int pool_map_find_by_rank_status(struct pool_map *map,
struct pool_target ***tgt_ppp,
Expand Down Expand Up @@ -339,9 +339,9 @@ pool_map_target_nr(struct pool_map *map)
}

static inline unsigned int
pool_map_node_nr(struct pool_map *map)
pool_map_rank_nr(struct pool_map *map)
{
return pool_map_find_nodes(map, PO_COMP_ID_ALL, NULL);
return pool_map_find_ranks(map, PO_COMP_ID_ALL, NULL);
}

/*
Expand Down
11 changes: 6 additions & 5 deletions src/include/daos_prop.h
Original file line number Diff line number Diff line change
Expand Up @@ -464,11 +464,12 @@ enum {

/** container redundancy factor */
enum {
DAOS_PROP_CO_REDUN_RF0,
DAOS_PROP_CO_REDUN_RF1,
DAOS_PROP_CO_REDUN_RF2,
DAOS_PROP_CO_REDUN_RF3,
DAOS_PROP_CO_REDUN_RF4,
DAOS_PROP_CO_REDUN_RF0 = 0,
DAOS_PROP_CO_REDUN_RF1 = 1,
DAOS_PROP_CO_REDUN_RF2 = 2,
DAOS_PROP_CO_REDUN_RF3 = 3,
DAOS_PROP_CO_REDUN_RF4 = 4,
DAOS_RF_MAX = 4,
};

/**
Expand Down
2 changes: 1 addition & 1 deletion src/object/cli_coll.c
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ obj_coll_oper_args_init(struct coll_oper_args *coa, struct dc_object *obj, bool
D_ASSERT(coa->coa_dcts == NULL);

D_RWLOCK_RDLOCK(&pool->dp_map_lock);
pool_ranks = pool_map_node_nr(pool->dp_map);
pool_ranks = pool_map_rank_nr(pool->dp_map);
D_RWLOCK_UNLOCK(&pool->dp_map_lock);

D_RWLOCK_RDLOCK(&obj->cob_lock);
Expand Down
2 changes: 1 addition & 1 deletion src/object/srv_coll.c
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,7 @@ obj_coll_punch_prep(struct obj_coll_punch_in *ocpi, struct daos_coll_target *dct
D_GOTO(out, rc = -DER_INVAL);
}

size = pool_map_node_nr(map->pl_poolmap);
size = pool_map_rank_nr(map->pl_poolmap);
D_ALLOC_ARRAY(dce->dce_hints, size);
if (dce->dce_hints == NULL)
D_GOTO(out, rc = -DER_NOMEM);
Expand Down
4 changes: 2 additions & 2 deletions src/pool/cli.c
Original file line number Diff line number Diff line change
Expand Up @@ -503,7 +503,7 @@ update_rsvc_client(struct dc_pool *pool)
{
struct subtract_rsvc_rank_arg arg;

arg.srra_nodes_len = pool_map_find_nodes(pool->dp_map, PO_COMP_ID_ALL, &arg.srra_nodes);
arg.srra_nodes_len = pool_map_find_ranks(pool->dp_map, PO_COMP_ID_ALL, &arg.srra_nodes);
/* There must be at least one rank. */
D_ASSERTF(arg.srra_nodes_len > 0, "%d > 0\n", arg.srra_nodes_len);

Expand Down Expand Up @@ -2016,7 +2016,7 @@ choose_map_refresh_rank(struct map_refresh_arg *arg)
if (arg->mra_n <= 0)
return CRT_NO_RANK;

n = pool_map_find_nodes(arg->mra_pool->dp_map, PO_COMP_ID_ALL, &nodes);
n = pool_map_find_ranks(arg->mra_pool->dp_map, PO_COMP_ID_ALL, &nodes);
/* There must be at least one rank. */
D_ASSERTF(n > 0, "%d\n", n);

Expand Down
10 changes: 10 additions & 0 deletions src/pool/rpc.h
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,16 @@ CRT_RPC_DECLARE(pool_op, DAOS_ISEQ_POOL_OP, DAOS_OSEQ_POOL_OP)
CRT_RPC_DECLARE(pool_create, DAOS_ISEQ_POOL_CREATE, DAOS_OSEQ_POOL_CREATE)

/* clang-format on */

/* the source of pool map update operation */
enum map_update_source {
MUS_SWIM = 0,
/* May need to differentiate from administrator/csum scrubber/nvme healthy monitor later.
* Now all non-swim cases fall to DMG category.
*/
MUS_DMG,
};

enum map_update_opc {
MAP_EXCLUDE = 0,
MAP_DRAIN,
Expand Down
16 changes: 15 additions & 1 deletion src/pool/srv.c
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,12 @@
#include "rpc.h"
#include "srv_internal.h"
#include "srv_layout.h"
bool ec_agg_disabled;

bool ec_agg_disabled;
uint32_t pw_rf; /* pool wise RF */
#define PW_RF_DEFAULT (2)
#define PW_RF_MIN (1)
#define PW_RF_MAX (4)

static int
init(void)
Expand Down Expand Up @@ -47,6 +52,15 @@ init(void)
if (unlikely(ec_agg_disabled))
D_WARN("EC aggregation is disabled.\n");

pw_rf = PW_RF_DEFAULT;
d_getenv_uint32_t("DAOS_POOL_RF", &pw_rf);
if (pw_rf < PW_RF_MIN || pw_rf > PW_RF_MAX) {
D_INFO("pw_rf %d is out of range [%d, %d], take default %d\n",
pw_rf, PW_RF_MIN, PW_RF_MAX, PW_RF_DEFAULT);
pw_rf = PW_RF_DEFAULT;
}
D_INFO("pool wise RF %d\n", pw_rf);

ds_pool_rsvc_class_register();

bio_register_ract_ops(&nvme_reaction_ops);
Expand Down
2 changes: 2 additions & 0 deletions src/pool/srv_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
#include <daos_security.h>
#include <gurt/telemetry_common.h>

extern uint32_t pw_rf;

/**
* Global pool metrics
*/
Expand Down
Loading

0 comments on commit 8a8c7d4

Please sign in to comment.