Forums updated to SMF version 2.1.1
Started by elf4o, July 27, 2022, 06:32:41 PM
0 Members and 1 Guest are viewing this topic.
QuoteHi Elf40, So I'm not completely clear on your setup here .. you have two servers, or two virtual servers, or two servers and two virtual servers? And when you say cluster, are you talking about a specific type if Linux clustering, or some SAP Application specific clustering?If this is Linux clustering with a highly available SAP database, you will need to do a little linking to make the two stay in sync and the linking would depend on your flavour of clustering. If this is all SAP / SAP clustering, then this sounds like an issue for SAP or at least a SAP engineer.At the end of the day, if you're running commercial / enterprise Linux, a commercial application, and running on a Microsoft cloud service, ultimately you might want to consider a commercial support contract
QuoteOk, so it doesn't look like your two nodes are in-sync, in which case I wouldn't expect to be able to switch the master node. First thing you need to do is get them in sync again. If you run a status command on "both" nodes, see what each node thinks the status is. It may be that both nodes think they are primary for example, i.e. split brain syndrome. I would be expecting to see 'something' against 'exitreason' from at least one of the servers which should help.It's always difficult with two servers, once the link is severed, getting quorum again means telling one of the servers it's not master even tho' it might think it is. Having a third server makes arbitration much much easier for most of the time. (I know quite a few people who run a third "dummy" server just to get a quorum of >1 so if one server goes down you're left with a node count of two, which means the remaining server know's it has quorum and should still be master.Corosync .. I'm not very familiar, but as I understand it, Corosync is essentially a very lightweight generic application level clustering tool for synchronising configuration files across a series of 'clustered' servers. I have used it in the past, can't say I was terribly impressed.Incidentally I notice one of the servers lists "stonith" capabilities, which is designed to prevent split-brain. Historically STONITH (Shoot The Other Server In The Head) required special hardware, so essentially if a server spots a problem with the other, it literally kills it and takes over .. so did server #1 go down when server #2 got promoted to master?I think the status commands are "crm status" and "pcs status", but it's a long time since I used it. Again, this is a non-trivial issue, if you're a Windows Admin you might want to find a local Linux guy you could hand it off it.
Quote[font="Open Sans", sans-serif]Corosync .. I'm not very familiar, but as I understand it, Corosync is essentially a very lightweight generic application level clustering tool for synchronising configuration files across a series of 'clustered' servers.[/font]
QuoteQuote[font="Open Sans", sans-serif]Corosync .. I'm not very familiar, but as I understand it, Corosync is essentially a very lightweight generic application level clustering tool for synchronising configuration files across a series of 'clustered' servers.[/font]Corosync runs on Linux, but it may also run on other *nix systems, not sure .. but it appears to be developed by a community that's not specifically linked to SAP .. but it may be that SAP provide it as a component of their product as it's effectively an application in it's own right rather than being a "part" of Linux as such. I guess it maybe comes down to who/what installed or maintains it. Did you install and configure it manually, or did it get installed as a part of the SAP installation?
QuoteMmm, well unless you can get someone in who knows corosync/Linux, all I can recommend is;Read the corosync docs, it's all online / open sourceAim to resolve "why" the servers aren't sync'dAfter resolving, the "promote" option from the corosync shell should let you choose which master you preferIf the cluster is set up correctly, my expectation would be that rebooting the current secondary should attempt to resolve the issue and reconnect the cluster, but depending on "why" it failed and how it was set up, it's not a 'given'. You may still need to do a little work to resolve the issue. At the end of the day the fallback process for this kind of thing would be to remove the secondary, clean it's local config files, then re-add it to the cluster. If this is a production environment, I wouldn't attempt this unless you know what you're doing.Typically I'd duplicate the setup in VM's, break it, then attempt to resolve it ... before trying it on the live servers.