Best Pratice for High Availability...

nsavard · Jun 18, 2009

Did somebody here avec configure a HA environnement with DA. Buy "high availability" I mean that all the data for DA / Clients and DB who are on a "Server A" are live replicated (or every 15 min) on a "Server B" and when (or if) the server A freeze, the server B turn all the services ON and the donwtime for the client it's less then 1 minutes...

Or do you have a better way to do it you can suggest us ?

Thanks

Nicolas

localhost · Jun 19, 2009

The best usage is VMWARE ESX Server.
We are running it now. Plus the replication ideea is not so simple to implement. We are running DB and user files from a iSCSI mounted device that uses RAID5.

We used this only for testing. It's much less painfull to have a RAID1 hdd configuration, and make backups offsite.

Plus as I remember, we are using DA for 2-3 years and I dont remember any issues with it, everything is working great. We changed the kernel and basicly 90% all from source.

nobaloney · Jun 19, 2009

High availability replicating every fifteen minutes? I see several problems with this:

1) Session information is probably impossible to replicate; anyone who has an open session on a server that goes down will lose his/her session.

2) Generally IP# changes will fail for someone who has accessed a page recently, even without sessions, meaning viewers currently on a site when one server goes down will probably lose their access to the site unless they know to clear all DNS cached anywhere between them and the authoritative servers (which may not be possible).

And then there's the issue of replicating your MySQL across various servers. Have you figured out how to do that across the 'net betweeen datacenters?

Because if you're not in multiple datacenters, then you're not truly high availability; what happens if connectivity to the datacenter goes down (or as has happened within the past year) if the entire datacenter goes down?

So the right way to implement high availability is by running the front end to the site on multiple servers in multiple datacenters, with the database managed on a separate high-availability server. with multiple connections, RAID drives with hot-swap or a storage-area-network, and redundant power supplies.

That said, we recently had a hard machine failure (our first in three years). It was easy enough to quickly swap the drives to another server, and get the sites up and running again.

In this case it didn't happen as quickly as we'd liked because the swap-server wasn't in the datacenter, and there was a major traffic accident on the road to the datacenter (they closed it in both directions for over an hour; they had to do helicopter-evacuation), and the same morning many streets in Los Angeles (within a quarter-mile of the datacenter) were closed for a parade honoring the Los Angeles Lakers basketball team.

But it could have, if we'd had the server already in the datacenter. Which we should have had; and we now know better.

If you've got an extra server on standby in the datacenter and standby staff (most datacenters offer it at a cost) you can have the drives swapped rapidly at a lot lower cost than true redundancy.

Note there are still issues; for example you should NOT have the NIC IDs in your network configuration files; if they're wrong the network won't start (and they will be wrong after a drive swap) but if they're absent the network will start just fine.

We're currently working on a combination of email redundancy from a different datacenter and redundant server availability in the hosting datacenter; that should satisfy most clients in a cost-effective manner.

Jeff

dvnscr · Jun 29, 2009

15mins? well, it's better to do live replication + load balancing, it means that "stand-by" server will be giving its resources too.
My configuration for replication: drbd(as it is network raid, so replication is instantanous) + (inotify+csync2)for /etc/shadow +passwd

1) Session information is probably impossible to replicate; anyone who has an open session on a server that goes down will lose his/her session.

so how youtube google livejournal and similiar sites works? it should be possible somehow to replicate sessions.

bjseiler · Jun 29, 2009

"We're currently working on a combination of email redundancy from a different datacenter and redundant server availability in the hosting datacenter; that should satisfy most clients in a cost-effective manner."

Jeff -- Would you want to share some details on the features and a timeline?

nobaloney · Jun 29, 2009

bjseiler said:
Jeff -- Would you want to share some details on the features and a timeline?

I have no idea at present; I don't even know when i'll have time to work on it.

But it has moved from never to my eventually list.

Jeff

welch · Jul 20, 2009

VMWare

The best solution I have found is running your server as a vmware image. Your applications should not be hardware or network dependent, and vmware provides that abstraction layer needed. It will handle all the mutlipathing issues needed with the SAN, RARP requests for network teaming and failure, and your resource management for HA. We run our enterprise in a blade center with redundant power connections ( each way to a 20 amp socket on separate power girds. ). The switches are setup the same way. 4 NIC cards ( 2 on the HBA ) to 2 switches and redundant to the layer 3 core. This also applies to the SAN. 2 HBA cards with 2 controllers on the SAN. We also use vmware fault tollerent to provide local blade failure so there is no downtime if a blade suddenly dies. And then we use a disaster recovery solution that allows us to fail over to another blade center in another location within 5-10 minutes ( VM Bootup time ). If all of this still causes an outage, we have much bigger problems that need to be addressed.

nobaloney · Jul 20, 2009

Isn't failure of the SAN hardware still a single point of failure?

If so, is it less likely to fail than your blade solution?

And do you protect against failure of the data center?

Jeff

welch · Jul 20, 2009

jlasman said:
Isn't failure of the SAN hardware still a single point of failure?

If so, is it less likely to fail than your blade solution?

And do you protect against failure of the data center?

Jeff

the SAN hardware failure is highly redundant. Not only do you have your standard arrays, we have redundant tray arrays that have dual link Fiber Fabric connections. Each tray has 2 power supplies that again split two the two different power girds. Each fiber link goes to a separate controller. It would be a very big crash to have the SAN completely fail, but at that point, we will restore to our backup solution that also has fiber channel access. For sleep comfort, we also got the GOLD support. We had a drive fail and within 10 minutes they called us up to schedule a swap within 4 hours. Our blade center is the same way. It has 2 power supplies, that each split to 4 20amp sockets. The failure of the datacenter is where our offsite location can resume trunking within minutes notice.

nobaloney · Jul 20, 2009

It sounds as if you've got it covered. What do you tell clients who expect this to be cheap

?

Jeff

welch · Jul 20, 2009

LOL, it is used for a whole lot of stuff beyond DA. I really enjoy vmware because of the app abstraction it provides.

Best Pratice for High Availability...

nsavard

Verified User

localhost

Verified User

nobaloney

NoBaloney Internet Svcs - In Memoriam †

dvnscr

Verified User

bjseiler

Verified User

nobaloney

NoBaloney Internet Svcs - In Memoriam †

welch

Verified User

nobaloney

NoBaloney Internet Svcs - In Memoriam †

welch

Verified User

nobaloney

NoBaloney Internet Svcs - In Memoriam †

welch

Verified User