Updates:

Similar topics mod installed, currently shown at the end of a topic

High availability/cluster problem ,with master node?

Started by elf4o, July 27, 2022, 06:32:41 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

elf4o

Hi everyone i have suse linux enterprise v15 sp1 ,
my issue is i have 2 servers, both same versions same machines...
below two servers are in High availability of SAP Hana on Azure VMs on SLES. We observed an issue making Server01 as primary node possible due to some underline issue in cluster management . Hana services are running fine in both the nodes so ideally any of the nodes should have the capability of being master node . Since the node Server01 is not working as expected under cluster ,we may face issue while there is a scenario of node Server02 is unavailable that means will be ended with service unavailability.

Here is some more information...
We had a drive issue both servers hdd were full with logs... now issue is fixed free space is added to them...

So before server drives was filled, with node 1 server01 was working, as master node before and seems failed over to node 2 during this issue . We have observed this during our last patching activity and database was running with 01 node as primary since then . So this need more investigation as why this is not happening now . We have checked already and found the hana database are running as expected still cluster service is not making this as primary .

We have not any evidence of fail over activity as this have not being performed yet .

Please good people advice what to do in order to fix this.

Mad Penguin

Hi Elf40, 

So I'm not completely clear on your setup here .. you have two servers, or two virtual servers, or two servers and two virtual servers?  And when you say cluster, are you talking about a specific type if Linux clustering, or some SAP Application specific clustering?

If this is Linux clustering with a highly available SAP database, you will need to do a little linking to make the two stay in sync and the linking would depend on your flavour of clustering. If this is all SAP / SAP clustering, then this sounds like an issue for SAP or at least a SAP engineer.

At the end of the day, if you're running commercial / enterprise Linux, a commercial application, and running on a Microsoft cloud service, ultimately you might want to consider a commercial support contract :)
https://twitter.com/garethbult
https://gareth.bult.co.uk


elf4o

QuoteHi Elf40,

So I'm not completely clear on your setup here .. you have two servers, or two virtual servers, or two servers and two virtual servers?  And when you say cluster, are you talking about a specific type if Linux clustering, or some SAP Application specific clustering?

If this is Linux clustering with a highly available SAP database, you will need to do a little linking to make the two stay in sync and the linking would depend on your flavour of clustering. If this is all SAP / SAP clustering, then this sounds like an issue for SAP or at least a SAP engineer.

At the end of the day, if you're running commercial / enterprise Linux, a commercial application, and running on a Microsoft cloud service, ultimately you might want to consider a commercial support contract :)
Hi i am win admin,
so i am running in virtual machines these two machines, i dont know if its a linux cluster or sap hana specific cluster can you tell me i am not a linux expert at all...  i am normal windows admin i dont own or run company budgets.... so thats not possible to get sap engineer etcs.


I will try to explain again,
due to sap hana filled disk with logs system decided to switch, system made server 02 master and server 01 slave.
all of the disk issues is fixed now, but i dont know how to revert this..
Stack : corosync

Current DC - server01 version 2.01 some numbers partition with quorum..
stack corosync,
current dc - server01 some simular number partition with quorum..

last change : 2022 by root via crm_attribute on server02.

2 nodes configured
,7 resource configured.

Online server01 ,server 02,

full list of resources
rsc_st_azure stonith fence_azure arm, started server 01.
clone set : cln _saphana topology h71 hdb00 rsc_sap hana topolgy... numbers..

started server01 ,server 02
clone set msl_saphana_h71 numbers... rsc_sap_hana h71 numbers.... promotable?

MASTERS - Server 02
slave server 01.
resource group g ip numbers,,, and some more stuff under that.
after that i receive
failed resource actions :
rsc_sap hana numbers promote 0 , on server 01 , unknown error , 1 call =106 status complete exitreason = last rc change monday 25 ...


May be thats the reason unknown error simply a guess ,at least i got the right command now.

Can you tell me if this Cluster is Linux cluster or its special sap/hana cluster, how to understand the difference?

Mad Penguin

#3
Ok, so it doesn't look like your two nodes are in-sync, in which case I wouldn't expect to be able to switch the master node. First thing you need to do is get them in sync again. If you run a status command on "both" nodes, see what each node thinks the status is. It may be that both nodes think they are primary for example, i.e. split brain syndrome. I would be expecting to see 'something' against 'exitreason' from at least one of the servers which should help.
It's always difficult with two servers, once the link is severed, getting quorum again means telling one of the servers it's not master even tho' it might think it is. Having a third server makes arbitration much much easier for most of the time. (I know quite a few people who run a third "dummy" server just to get a quorum of >1 so if one server goes down you're left with a node count of two, which means the remaining server know's it has quorum and should still be master.

Corosync .. I'm not very familiar, but as I understand it, Corosync is essentially a very lightweight generic application level clustering tool for synchronising configuration files across a series of 'clustered' servers. I have used it in the past, can't say I was terribly impressed.

Incidentally I notice one of the servers lists "stonith" capabilities, which is designed to prevent split-brain. Historically STONITH (Shoot The Other Server In The Head) required special hardware, so essentially if a server spots a problem with the other, it literally kills it and takes over .. so did server #1 go down when server #2 got promoted to master?

I think the status commands are "crm status" and "pcs status", but it's a long time since I used it. Again, this is a non-trivial issue, if you're a Windows Admin you might want to find a local Linux guy you could hand it off it.

https://twitter.com/garethbult
https://gareth.bult.co.uk

elf4o

QuoteOk, so it doesn't look like your two nodes are in-sync, in which case I wouldn't expect to be able to switch the master node. First thing you need to do is get them in sync again. If you run a status command on "both" nodes, see what each node thinks the status is. It may be that both nodes think they are primary for example, i.e. split brain syndrome. I would be expecting to see 'something' against 'exitreason' from at least one of the servers which should help.
It's always difficult with two servers, once the link is severed, getting quorum again means telling one of the servers it's not master even tho' it might think it is. Having a third server makes arbitration much much easier for most of the time. (I know quite a few people who run a third "dummy" server just to get a quorum of >1 so if one server goes down you're left with a node count of two, which means the remaining server know's it has quorum and should still be master.

Corosync .. I'm not very familiar, but as I understand it, Corosync is essentially a very lightweight generic application level clustering tool for synchronising configuration files across a series of 'clustered' servers. I have used it in the past, can't say I was terribly impressed.

Incidentally I notice one of the servers lists "stonith" capabilities, which is designed to prevent split-brain. Historically STONITH (Shoot The Other Server In The Head) required special hardware, so essentially if a server spots a problem with the other, it literally kills it and takes over .. so did server #1 go down when server #2 got promoted to master?

I think the status commands are "crm status" and "pcs status", but it's a long time since I used it. Again, this is a non-trivial issue, if you're a Windows Admin you might want to find a local Linux guy you could hand it off it.

Hi can you tell me do you think my these are linux clusters?
Or these are specific sap/hana/ clusters which one of these two is the case?


Mad Penguin

Quote[font="Open Sans", sans-serif]Corosync .. I'm not very familiar, but as I understand it, Corosync is essentially a very lightweight generic application level clustering tool for synchronising configuration files across a series of 'clustered' servers.[/font]
Corosync runs on Linux, but it may also run on other *nix systems, not sure .. but it appears to be developed by a community that's not specifically linked to SAP .. but it may be that SAP provide it as a component of their product as it's effectively an application in it's own right rather than being a "part" of Linux as such. I guess it maybe comes down to who/what installed or maintains it. Did you install and configure it manually, or did it get installed as a part of the SAP installation?
https://twitter.com/garethbult
https://gareth.bult.co.uk

elf4o

Quote
Quote[font="Open Sans", sans-serif]Corosync .. I'm not very familiar, but as I understand it, Corosync is essentially a very lightweight generic application level clustering tool for synchronising configuration files across a series of 'clustered' servers.[/font]
Corosync runs on Linux, but it may also run on other *nix systems, not sure .. but it appears to be developed by a community that's not specifically linked to SAP .. but it may be that SAP provide it as a component of their product as it's effectively an application in it's own right rather than being a "part" of Linux as such. I guess it maybe comes down to who/what installed or maintains it. Did you install and configure it manually, or did it get installed as a part of the SAP installation?


I inherited this so i dont have a clue... i never installed it configured it or anything..

Mad Penguin

Mmm, well unless you can get someone in who knows corosync/Linux, all I can recommend is;
  • Read the corosync docs, it's all online / open source
  • Aim to resolve "why" the servers aren't sync'd
  • After resolving, the "promote" option from the corosync shell should let you choose which master you prefer
If the cluster is set up correctly, my expectation would be that rebooting the current secondary should attempt to resolve the issue and reconnect the cluster, but depending on "why" it failed and how it was set up, it's not a 'given'. You may still need to do a little work to resolve the issue. At the end of the day the fallback process for this kind of thing would be to remove the secondary, clean it's local config files, then re-add it to the cluster. If this is a production environment, I wouldn't attempt this unless you know what you're doing.
Typically I'd duplicate the setup in VM's, break it, then attempt to resolve it ... before trying it on the live servers.
https://twitter.com/garethbult
https://gareth.bult.co.uk

elf4o

QuoteMmm, well unless you can get someone in who knows corosync/Linux, all I can recommend is;
  • Read the corosync docs, it's all online / open source
  • Aim to resolve "why" the servers aren't sync'd
  • After resolving, the "promote" option from the corosync shell should let you choose which master you prefer
If the cluster is set up correctly, my expectation would be that rebooting the current secondary should attempt to resolve the issue and reconnect the cluster, but depending on "why" it failed and how it was set up, it's not a 'given'. You may still need to do a little work to resolve the issue. At the end of the day the fallback process for this kind of thing would be to remove the secondary, clean it's local config files, then re-add it to the cluster. If this is a production environment, I wouldn't attempt this unless you know what you're doing.
Typically I'd duplicate the setup in VM's, break it, then attempt to resolve it ... before trying it on the live servers.

Thats a production environment, so i cant do it...
alone at least... its way beyond my little brain.