Not only should you expect failures but you should expect FREQUENT failures.
If you can estimate that every server in your data centre will fail once every ten years then that sounds pretty good right?
Failure rate = Once / 10 years = once / 120 months
But... if you have 120 servers then that will mean you should expect a failure every month!
According to this article, in 2010 Facebook was running at least 60,000 servers across its data centres.
If these each of these 60,000 servers is expected to fail once every 10 years, then at that time Facebook would have expected a server failure about every hour and a half (120 months / 60000 ~= 1.46 hr)
Eeeek.
If you can estimate that every server in your data centre will fail once every ten years then that sounds pretty good right?
Failure rate = Once / 10 years = once / 120 months
But... if you have 120 servers then that will mean you should expect a failure every month!
According to this article, in 2010 Facebook was running at least 60,000 servers across its data centres.
If these each of these 60,000 servers is expected to fail once every 10 years, then at that time Facebook would have expected a server failure about every hour and a half (120 months / 60000 ~= 1.46 hr)
Eeeek.