Mirror / Fail-over /Load Balance

IT_Architect

Verified User
Joined
Feb 27, 2006
Messages
1,088
Correct me if I'm wrong, but there is no reliable, practical way to do Mirror / Fail-over /Load Balanced servers when it involves e-mail and control panels. A web site and database are no problem. When it comes to control panels, users, e-mail, etc., big problem.

It also seems like solutions for load balancing involving hardware, introduce a new single points of failure. Also if the DC has a problem where the hardware is, it cannot switch the load to a server in another DC. The multiple DNS records or changing DNS records doesn't actually work because the routers often don't respect your low TTLs and it takes 18 hours to clear, not the 5 minutes they put in their TTL.

If you have a cool redundancy idea, I'd like to hear it.
 
You can take a look at the Linux Virtual Server project (http://www.linuxvirtualserver.org/).

You can create a very cool failover and loadbalancing solution with it. You have to use a fileserver (and connect to it from the different boxes) or use a synchronize utility for updating each server with user data (eg. mails, website files), dns zones etc.

It's even possible to setup a box in another datacentre (VS-tun).
 
I guess it all depends on what level of reliability you are trying to achieve and what you are willing to pay for it :)

For example, network related downtime isn't that common these days, I would guess most downtime is caused by hardware problems (read: HD failure). So, if you have 2 servers hosted in the same rack, you can switch IPs from one server to another instantly if one server goes down. This simple setup should increase the reliability dramatically.

Having backup mail server is also fairly easy. You can set it up to use the first server as "smarthost", meaning that all the mail eventually ends up on the first server unless that server becomes unavailable.

I would also like to enter "not guilty" plea on behalf of multiple DNS records :) They seemed to work just fine in all load balancing projects we've implemented. Do you have some story to tell to support that argument or it was based on a hunch? ;)
 
Do you have some story to tell to support that argument or it was based on a hunch? ;) [/B]
No hunch. It is based on threads on forums of other people who have tried it thinking it would work and learned that other DNS servers didn't respect the short TTL which resulted in 18 hours before things finished moving for the most part. I have never implemented this.

I do have something that happened about 3 weeks ago though when I moved our hotel site. The web site moved at least 4 hours before the MX. I never experienced that before.
 
IT_Architect said:
No hunch. It is based on threads on forums of other people who have tried it thinking it would work and learned that other DNS servers didn't respect the short TTL which resulted in 18 hours before things finished moving for the most part. I have never implemented this.
This all probably relates to changing DNS records, not to what you've referred to as "multiple DNS records".


I do have something that happened about 3 weeks ago though when I moved our hotel site. The web site moved at least 4 hours before the MX. I never experienced that before.
There is nothing strange about taht.
For example, if you've sent e-mail to that domain (but haven't accessed the site) before chaging DNS records, then MX records info might have been cached on your end.
In any event, the way you describe it there is no prove that low TTL values haven't been obeyed. Do you have some dig/nslookup snapshots?
 
Originally posted by Webcart This all probably relates to changing DNS records, not to what you've referred to as "multiple DNS records".
No, these discussions are only about fail-over and multiple DNS records.
Originally posted by Webcart There is nothing strange about taht.
For example, if you've sent e-mail to that domain (but haven't accessed the site) before chaging DNS records, then MX records info might have been cached on your end.
It makes sense, but I never had that happen to me with our own servers since we are on them all the time, both mail and http.
Originally posted by Webcart In any event, the way you describe it there is no prove that low TTL values haven't been obeyed. Do you have some dig/nslookup snapshots?
I think I was quite clear that I'm not the one who discovered this, but the web forums are replete with testimonials from others who have tried. Here is one article, different from all the others that I've read, that states the same thing. http://www.hostpapers.com/article/265. There are people that were surprised by this when they actually had a failure. There is one out there that waited 18 hours.
 
IT_Architect said:
I think I was quite clear that I'm not the one who discovered this, but the web forums are replete with testimonials from others who have tried. Here is one article, different from all the others that I've read, that states the same thing. http://www.hostpapers.com/article/265. There are people that were surprised by this when they actually had a failure. There is one out there that waited 18 hours.

I didn't say you should provide your own snapshots :) A quote from a technical article discussing this issue in details would be just fine. The article you mention, however, looks purely theoretical. It doesn't postulate that ignoring low TTL values is a common practice. In fact, there is no evidence it ever happend:
The problem with this scenario, is that, while some ISP's caching might respond to such low figures, other ISP's may decide to ignore,(to save on bandwidth utilization), any TTL's below a certain value, say, 60 minutes. So it is entirely possible that some of your visitors would see your websites and for others, your site would be down for 1 hour or more, even though one of your servers was operating perfectly.
 
Webcart said:
I didn't say you should provide your own snapshots :) A quote from a technical article discussing this issue in details would be just fine. The article you mention, however, looks purely theoretical. It doesn't postulate that ignoring low TTL values is a common practice. In fact, there is no evidence it ever happend:
Well I didn't think I would need to find back the thread of 5 minute TTL with the 18 hour resolution time at the time I read it. However, you're not out of luck. There is a thread going on right now at WebHostTalk.com about this. Perhaps you could straighten them out so they don't lead people astray at http://webhostingtalk.com/showthread.php?t=548656 while I try to remember. Also, I do remember this from our own experience. Before the west coast office moved from Vancouver to Hillsboro, OR, when we changed servers, the switch two different times took 4 DAYS for a 1440 timeout. They were on Shaw Cable and they got lots of calls about that, and from us as well. In fact we've never been able to determine what they use to determine how long to cache stuff other than it seems to work normally on weekends.

PS: I'm rooting for you, because DNS would surely be the cheapest way to solve the problem
 
Last edited:
IT_Architect said:
Also, I do remember this from our own experience. Before the west coast office moved from Vancouver to Hillsboro, OR, when we changed servers, the switch two different times took 4 DAYS for a 1440 timeout. They were on Shaw Cable and they got lots of calls about that, and from us as well.

Well, again, without technical details it's hard to know for sure why it happend. May be the switch involved a change of authorative nameservers for that domain rather than change of A records. May be the authorative nameservers were under a heavy load and couldn't respond promptly to DNS queries in which case any reasonable nameserver wouldn't clear the cache for some time.

It could even be a network problem making authorative nameservers unreachable for some users. Yes, it still happens (a recent problem from my personal experience - some part of Global Crossing network was unreachable to one Canadian ISP).
 
Well....I tried setting the timeout to 10 minutes more than 24 hours ago. The TTL normally coming out of DA is 4 hours. It should be safe enough, so I changed to another server with one of the less critical sites. Here is how it went:
Our west coast office seen everything happen almost right away. The midwest office is still waiting after more than 4 hours, sorta. Here is what I mean:

- NS, MX, mail.domain.com, moved almost instantly.
- domain.com, www.domain.com, ftp.domain.com have yet to move, and it's been hours.
- Both offices use the same ISP, ComCast.

I asked a guy who I hire sometimes to make server modifications who also has a web hosting company with several servers. His response was, "yeah its nothing new I am afraid, loads of isps have hardcoded TTL set in their servers to save on dns traffic, with them its just a case of riding it out. Nothing can be done about it whatsoever." I also see several posts that support this on this forum by jlasman, also a web hosting company owner. He mentions AOL doesn't respect TTL. Between the ComCast and AOL, we are talking a pile of users meaning setting the TTL low can help, but realize loads of users are not going to be hitting your low TTL records.
http://directadmin.com/forum/showthread.php?s=&threadid=7416&highlight=TTL
Well.... now that I know for myself, and the mail and http scenario just repeated itself by changing instantly and not the rest, I'm going to have to just live with it I guess and realize that sites gotta move at slow times, and probably will require 12 hours or so to affect most users no matter what you put in the TTL.

The site isn't actually down, it's just that half the stuff is coming from one server, and the other half is coming from the other, for pronbably half the world.

That's not really the end of story though. There must be something special about how DynDNS works because I know my LAN customers with dynamic IPs don't wait that long, and I've used it myself on sites, and it's pretty quick.
 
I figured out some more. The domain finally resolved, and at the same times as the other times I changed IPs for other domains in the past. It appears with Comcast where I am they dump their cache certain times of the day. TTL has nothing to do with anything. It happens at 10 AM and 8 PM Eastern. This has happened 4 different times now at those times since I started keeping track some months ago. So the time for me to move domains is 9:45 AM, and 7:45 PM. Of course that just makes me happy. It's anybody's guess what is happening elsewhere around the country.
 
Last edited:
Back
Top