【求助】集群扩容的时候出现Validate copyset failed in scatterWidthFloatingPercentage = 20的错误

smasher · 2023 年3 月 21 日 10:05

集群扩容的时候，新加入了若干台服务器，使用的是1.8T的SSD，扩容之前的服务器都是7.8T的HDD，建立逻辑pool失败，查看日志，显示Validate copyset failed in scatterWidthFloatingPercentage = 20, actual minValue = 33, maxValue = 48, average = 37.8667，配置文件里scatterWidthFloatingPercentage = 20，似乎也修改后，并不起作用，，我是需要把集群干掉重新装么？

2023-03-21T17:45:15.711776+0800 91 copyset_policy.cpp:107] Generate copyset success, numCopysets = 2000
2023-03-21T17:45:15.719236+0800 91 copyset_validation.cpp:101] Validate copyset failed in scatterWidthFloatingPercentage = 20, actual minValue = 33, maxValue = 48, average = 37.8667
2023-03-21T17:45:15.719247+0800 91 copyset_manager.cpp:128] Validate copyset metric failed, retry = 9
2023-03-21T17:45:15.719249+0800 91 copyset_manager.cpp:132] GenCopyset retry times exceed, times = 10
2023-03-21T17:45:15.719259+0800 91 topology_service_manager.cpp:907] GenCopysetForPageFilePool failed, Cluster size = 60, copysetNum = 2000, scatterWidth = 5, logicalPoolid = 24
2023-03-21T17:45:15.719363+0800 91 topology_service_manager.cpp:828] CreateCopysetForLogicalPool fail in : GenCopysetForPageFilePool.
2023-03-21T17:45:15.720600+0800 91 topology_service.cpp:611] Send response[log_id=1] from 10.225.167.29:6700 to 10.225.167.29:54884. [CreateLogicalPoolResponse] statusCode: -12

ilixiaocui · 2023 年3 月 22 日 02:08

使用curveadm部署的吗？
可以把扩容前后的topology.yaml贴一下吗？

smasher · 2023 年3 月 22 日 03:40

我用了curveadm，报错信息是

Error-Code: 410019
Error-Description: create logical pool failed
Error-Clue: E 2023-03-22T11:23:01.855381+0800 17073 curvefsTool.cpp:193] CreateLogicalPool Rpc response fail. Message is :statusCode: -12
E 2023-03-22T11:23:01.855410+0800 17073 curvefsTool.cpp:1109] exec fail, ret = -12

下面是扩容前的topology.yaml的mds和chunkserver的配置

mds_services:
config:
listen.ip: ${service_host}
listen.port: 670${service_host_sequence}
listen.dummy_port: 770${service_host_sequence}
mds.curvefs.maxFileLength: 549755813888000
deploy:
- host: ${machine06}
- host: ${machine12}
- host: ${machine18}

chunkserver_services:
config:
listen.ip: ${service_host}
listen.port: 82${format_replicas_sequence}
data_dir: /data/chunkserver${service_replicas_sequence}
copysets: 100
deploy:
- host: ${machine01}
replicas: 12
- host: ${machine02}
replicas: 12
- host: ${machine03}
replicas: 12
- host: ${machine04}
replicas: 12
- host: ${machine05}
replicas: 12
- host: ${machine06}
replicas: 12
- host: ${machine07}
replicas: 12
- host: ${machine08}
replicas: 12
- host: ${machine09}
replicas: 12
- host: ${machine10}
replicas: 12
- host: ${machine11}
replicas: 12
- host: ${machine12}
replicas: 12
- host: ${machine13}
replicas: 12
- host: ${machine14}
replicas: 12
- host: ${machine15}
replicas: 12
- host: ${machine16}
replicas: 12
- host: ${machine17}
replicas: 12
- host: ${machine18}
replicas: 12
- host: ${machine19}
replicas: 12

ilixiaocui · 2023 年3 月 23 日 02:30

用于扩容的topology.yaml呢？
扩容前集群是可以正常部署的吗

smasher · 2023 年3 月 23 日 07:50

扩容是加入了5台机器，也是设置12个chunkserver，只不过是盘不一样，是SSD 2T，之前是HDD 8T，扩容前集群已经成功部署，我看生成的json配置文件，扩容的服务器是被编入新的物理pool和新的逻辑pool，但是物理pool建立成功，到了建逻辑pool的时候就报错了，我试图在topology.yaml里设置mds.copyset.scatterWidthFloatingPercentage为30，因为我用日志里的最大值和最小值计算，发现实际的scatterWidthFloatingPercentage为29，超过了20，但是遗憾的是设置没有生效，日志里仍然显示scatterWidthFloatingPercentage=20。我测试了新加入服务器的网络带宽，和老的服务器没有区别，唯一就是TCP的重连次数略高，但是总的来说没有太大的影响。另外两批机器不在一个网段，但是之间的带宽经过测试，也没有问题，只是两个网段之间的UDP似乎是不通的，不知道这个有没有影响。

smasher · 2023 年3 月 24 日 03:30

应该和服务器数量以及实例数量和默认的zone和副本数有关

ilixiaocui · 2023 年3 月 27 日 03:04

可以直接提供下用于扩容的topology.yaml么？

smasher · 2023 年3 月 28 日 03:49

非常感谢支持，对配置文件做了调整，问题已经解决了。扩容的topology.yaml非常简单，就是加了4台机器，其他没有任何变化。