Target ID duplicate - Beegfs

222 views Asked by At

Check if you can help me.

We have an old BeeGFS install running version 7.1.5 on EL7 and one of the TargetIDs gone offline (without replacing). After it came back buddy mirror entered in a failed state that we can’t recover.

If we try to change the Target back to online it fails:

[root@headnode beegfs]# beegfs-ctl --nodetype=storage --setstate --state=good --force --targetid=13

Node did not accept state change. Error: Unknown storage target

The state shows as this:

root@headnode ~]# beegfs-ctl --listtargets --nodetype=storage --state

TargetID Reachability Consistency NodeID ======== ============ =========== ====== 1 Online Good 1 2 Online Good 2 3 Online Good 3 4 Online Good 4 5 Online Good 5 6 Online Good 6 7 Online Good 7 8 Online Good 8 9 Online Good 9 10 Online Good 10 11 Online Good 11 12 Online Good 12 13 Offline Good 13 14 Online Good 14 16 Online Good 13 Please note that a new TargetID numbered as 16 appeared where it should be 13.I tried to swap it back to 13 but I was unable to.

[[email protected] ~]# beegfs-ctl --removetarget 13

Given target is part of a buddy mirror group. Aborting.

[root@n13 ~]# beegfs-ctl --removemirrorgroup --mirrorgroupid=7 --nodetype=storage --dry-run

Could not remove buddy group: Communication error

I think we are doing something wrong, because of the buddy mirror setup that sometimes is difficult.

Any help is greatly appreciated. Thank you.

PS: For completude, the checks seems to be fine:

[[email protected] ~]# beegfs-df

METADATA SERVERS: TargetID Cap. Pool Total Free % ITotal IFree % ======== ========= ===== ==== = ====== ===== = 1 normal 218.2GiB 66.9GiB 31% 109.2M 107.8M 99%

STORAGE TARGETS: TargetID Cap. Pool Total Free % ITotal IFree % ======== ========= ===== ==== = ====== ===== =

[ERROR from beegfs-storage n13.mintrop.usp.br [ID: 13]: Unknown storage target] 13 emergency 0.0GiB 0.0GiB 0% 0.0M 0.0M 0%

1

There are 1 answers

0
Jaqueline Botelho On

Solution found: Problem was in the node that was using different inputs than the headnode was seeing. The headnode sees the file below, which corresponds to each node in ascending order (n01, n02...n14):

[root@headnode ~]# cat /data1/beegfs/mgmtd/targetNumIDs

0-5E3B6573-1=1
0-5E3B6592-2=2
0-5E3B65B2-3=3
0-5E3B65D1-4=4
0-5E3B65F1-5=5
0-5E3B6610-6=6
0-5E3B6630-7=7
0-5E3B664F-8=8
0-5E3B666E-9=9
0-5E3B6690-A=A
0-5E3B66B1-B=B
0-5E3B66D2-C=C
0-5E3B66F3-D=D
0-5E3B6714-E=E
0-626C29BD-D=F
0-62853797-D=10

In the n13 file /data1/beegfs/storage/targetID was the corresponding number in tenth 0-62853797-D=10. If you do the calculation this corresponds to 16 in decimal:

    [root@headnode~]# echo "obase=16; 16" | bc
10

So the solution was to change the targetID to the hexadecimal corresponding to the number 13:

 [root@headnode~]# echo "obase=16; 13" | bc
D

This inside hn's /data1/beegfs/mgmtd/targetNumIDs file corresponds to 0-5E3B66F3-D=D. So two changes were made to n13. Inside the targetNumID and targetID files that had 16 and 0-62853797-D=10 respectively were replaced by:

    [root@n13 ~]# cat /data1/beegfs/storage/targetNumID
13
[root@n13 ~]# cat /data1/beegfs/storage/targetID
0-5E3B66F3-D

Once this is done, restart the beegfs-storage services beegfs-meta.

 root@headnode~]# beegfs-ctl --listtargets --nodetype=storage --state
TargetID Reachability Consistency NodeID
======== ============ =========== ======
1 Online Good 1
2 Online Good 2
3 Online Good 3
4 Online Good 4
5 Online Good 5
6 Online Good 6
7 Online Good 7
8 Online Good 8
9 Online Good 9
10 Online Good 10
11 Online Good 11
12 Online Good 12
13 Online Good 13
14 Online Good 14

Best regards Jaqueline