Storm seems to fault tolerant with below mentioned details :
HA for a worker failure
In case of worker failure, Supervisor would restart it. But if it fails contineously and unable to send hearbeat to Nimbus, Numbus would assign to another node.
HA for a Node failure
In this case, one of the slave node is down and there would be timeout and Nimbus would reassign it to another node.
HA for a Nimbus + Supervisor Deamons
The Nimbus and Supervisor daemons are designed to be fail-fast and stateless. This states are kept in Zookeeper or on disk). It is said that this daemons run under supervision. Now this the key, as under "supervision" means, we need to configure tools likemonit and Upstart to restart the daemons seamlessly.
If we loose Nimbus node, workers would still continue to work and supervisors would continue to restart workers if they die. But without Nimbus, workers wont be re-assigned to other machine in case the worker machine also dies.
Now according to Nathan
"There are plans to make Nimbus highly available in the future" as of Jan-17,2012
Now We can think of :
1. Zookeeper ensemble to take care of RACK failure
2. Zookeeper Observer to do Cross Data Center DR (http://zookeeper.apache.org/doc/trunk/zookeeperObservers.html)
This comes with a caveat that it requires experimentation and feasibility study of a production ready deployment.
But this things need to be experimented and not available out of the box as of now.
I am also following up with Nathan (Storm creator) and storm user group to understand any work around.
Good point is there are many companies still using Storm in production.