How can one achieve a cloud setup where you keep plugging new servers in to grow it, have load-balancing, auto-scaling and fail-over?
It's one of those topics that a lot of people have opinions on, but not many have the direct experience to back it up. I don't know all of the things that do work, but I know a lot of the things that don't.
On the face of it, an expandable cloud server with some form of block storage gives you room to scale vertically. But not everything scales well vertically either. Network-attached storage is the only way to keep scaling storage vertically (like Ceph), but it's neither easy nor cheap to set up a reliable cluster, and you'd need at least a dedicated local network port between the server and the cluster to ensure that you could scale very far without capping out throughput. That's only going to grow you as far as the memory on the host node but these days you can pack a lot of memory in a server so it won't be your first roadblock.
Every application is going to handle replication / failover and horizontal / vertical scaling differently. For example:
Dovecot replication -
https://wiki.dovecot.org/Replication
Load balancers can be outsourced at most major cloud providers but haproxy is great if you want to do it yourself.
Load balancers and individual service replication processes aren't going to keep all of your data in sync. Take, for example, your SNI certificates for Dovecot/Exim, your user configs in /etc/virtual, all that jazz. A newbie is going to tell you "GlusterFS" and maybe, just maybe, two servers in the same rack running GlusterFS on a dedicated port between them might be able to handle the replication needs of a vertically scaling server, but you put them across the internet and it would barely be reasonable to expect it to handle a few websites, much less a whole server that needs to stay in sync at all times and is constantly writing new files (because that one guy with a catchall email and no spam filters receives 500 spam per minute, right?).
None of this is really reasonable at the scale that most shared hosting providers are operating, or the market that they're selling to. If you pull off all of this and theoretically have no points of failure guess what... your Ceph cluster just took a shit and now 75,000 customers are down. Your customers are ticketing to tell you that their last host, the one that kept things simple and deployed customers across multiple single servers with 1TB HDDs in RAID1 never had that problem, "I thought your setup was so expensive because it would never go down?"
IMO it's not worth it. Build good servers, deploy the amount of customers you're comfortable with on each one, make backups, and keep building new servers. At the end of the day the more you spend doesn't mean better uptime. Office 365 has an outage every week or two. Gmail has outages. Amazon cloud regions go down. Spread out the uptime over a 10 year period and I'd wager none of these overly complex failover / replicated / horizontally scaled systems have enough to show for their investment over the ones not doing it.