【求助】集群扩容的时候出现Validate copyset failed in scatterWidthFloatingPercentage = 20的错误

集群扩容的时候,新加入了若干台服务器,使用的是1.8T的SSD,扩容之前的服务器都是7.8T的HDD,建立逻辑pool失败,查看日志,显示Validate copyset failed in scatterWidthFloatingPercentage = 20, actual minValue = 33, maxValue = 48, average = 37.8667,配置文件里scatterWidthFloatingPercentage = 20,似乎也修改后,并不起作用,,我是需要把集群干掉重新装么?

2023-03-21T17:45:15.711776+0800 91 copyset_policy.cpp:107] Generate copyset success, numCopysets = 2000
2023-03-21T17:45:15.719236+0800 91 copyset_validation.cpp:101] Validate copyset failed in scatterWidthFloatingPercentage = 20, actual minValue = 33, maxValue = 48, average = 37.8667
2023-03-21T17:45:15.719247+0800 91 copyset_manager.cpp:128] Validate copyset metric failed, retry = 9
2023-03-21T17:45:15.719249+0800 91 copyset_manager.cpp:132] GenCopyset retry times exceed, times = 10
2023-03-21T17:45:15.719259+0800 91 topology_service_manager.cpp:907] GenCopysetForPageFilePool failed, Cluster size = 60, copysetNum = 2000, scatterWidth = 5, logicalPoolid = 24
2023-03-21T17:45:15.719363+0800 91 topology_service_manager.cpp:828] CreateCopysetForLogicalPool fail in : GenCopysetForPageFilePool.
2023-03-21T17:45:15.720600+0800 91 topology_service.cpp:611] Send response[log_id=1] from 10.225.167.29:6700 to 10.225.167.29:54884. [CreateLogicalPoolResponse] statusCode: -12

使用curveadm部署的吗?
可以把扩容前后的topology.yaml贴一下吗?

我用了curveadm,报错信息是

Error-Code: 410019
Error-Description: create logical pool failed
Error-Clue: E 2023-03-22T11:23:01.855381+0800 17073 curvefsTool.cpp:193] CreateLogicalPool Rpc response fail. Message is :statusCode: -12
E 2023-03-22T11:23:01.855410+0800 17073 curvefsTool.cpp:1109] exec fail, ret = -12

下面是扩容前的topology.yaml的mds和chunkserver的配置

mds_services:
config:
listen.ip: ${service_host}
listen.port: 670${service_host_sequence}
listen.dummy_port: 770${service_host_sequence}
mds.curvefs.maxFileLength: 549755813888000
deploy:
- host: ${machine06}
- host: ${machine12}
- host: ${machine18}

chunkserver_services:
config:
listen.ip: ${service_host}
listen.port: 82${format_replicas_sequence}
data_dir: /data/chunkserver${service_replicas_sequence}
copysets: 100
deploy:
- host: ${machine01}
replicas: 12
- host: ${machine02}
replicas: 12
- host: ${machine03}
replicas: 12
- host: ${machine04}
replicas: 12
- host: ${machine05}
replicas: 12
- host: ${machine06}
replicas: 12
- host: ${machine07}
replicas: 12
- host: ${machine08}
replicas: 12
- host: ${machine09}
replicas: 12
- host: ${machine10}
replicas: 12
- host: ${machine11}
replicas: 12
- host: ${machine12}
replicas: 12
- host: ${machine13}
replicas: 12
- host: ${machine14}
replicas: 12
- host: ${machine15}
replicas: 12
- host: ${machine16}
replicas: 12
- host: ${machine17}
replicas: 12
- host: ${machine18}
replicas: 12
- host: ${machine19}
replicas: 12

  1. 用于扩容的topology.yaml呢?
  2. 扩容前集群是可以正常部署的吗

扩容是加入了5台机器,也是设置12个chunkserver,只不过是盘不一样,是SSD 2T,之前是HDD 8T,扩容前集群已经成功部署,我看生成的json配置文件,扩容的服务器是被编入新的物理pool和新的逻辑pool,但是物理pool建立成功,到了建逻辑pool的时候就报错了,我试图在topology.yaml里设置mds.copyset.scatterWidthFloatingPercentage为30,因为我用日志里的最大值和最小值计算,发现实际的scatterWidthFloatingPercentage为29,超过了20,但是遗憾的是设置没有生效,日志里仍然显示scatterWidthFloatingPercentage=20。我测试了新加入服务器的网络带宽,和老的服务器没有区别,唯一就是TCP的重连次数略高,但是总的来说没有太大的影响。另外两批机器不在一个网段,但是之间的带宽经过测试,也没有问题,只是两个网段之间的UDP似乎是不通的,不知道这个有没有影响。

应该和服务器数量以及实例数量和默认的zone和副本数有关

可以直接提供下用于扩容的topology.yaml么?

非常感谢支持,对配置文件做了调整,问题已经解决了。扩容的topology.yaml非常简单,就是加了4台机器,其他没有任何变化。

1 个赞