« on: January 20, 2015, 04:57:40 PM »
Long shot, but looking for some guidance. Maintenance/recovery question...
We will eventually migrate off of this, to Office 365, but in the meantime I have inherited an Exchange 2007 environment, with 2 Hub/CAS servers in an NLB, and a CCR Failover Cluster. All server OSes are Server 2008. All servers are running all latest patches and Exchange rollups. We have 25 Storage Groups, each containing a single DB. On the Mailbox server cluster, we have separate filesystems for logs and DBs. This is all virtual in VMware 5.0 (very behind here too).
The admin I inherited this from had started a test phase with O365 by setting up a 3rd Hub/CAS server with Exchange 2010 (Server 2008 R2, patched and Exchange rollups all the way).
Everything is running great with Exchange. The problem is our backup software (Netbackup) is out of date and backup hardware is taxed. So, the occasional backup fails. This causes our log drive to fill, causing Exchange to stop sending mail. We had no monitoring solution that I know of, so during the last couple of months, I have been documenting everything (for my own sake, to learn how everything works) and getting these upgrades done, including getting PowerShell v2 on all nodes. This all should be finished in the next couple of days. My plan is to setup a repeating PowerShell script to monitor drive fullness that will email me nightly, and my boss weekly; eventually this (or similar) will be deployed to all critical servers. 2 Things:
1) What else should I be doing? I had just taken a class early last year, for Exchange 2013, but that has been little help in a production environment of 2007; almost everything has changed. I have mostly read an Exchange 2007 book, so along with my 2013 reading, I have a feel for how everything should work. I expanded on that with the documentation, which we had 0 of, so that I know all points of failure. Still needing to trace mail flow from OWA (and other web services), as I think that may be hosted on the "new" 2010 box as there was recently a disruption after my boss patched it; (re)starting some services on that box brought everything back online.
2) CCR Failover Cluster. What is the proper way to recover from our full log drive? It seems we have to reseed the passive node every time we recover. I think we may be causing that ourselves. Current steps are:
a) Dismount-Database -Identity dbname
b) eseutil /mh "path_to_db.edb"
c) assumung state = clean shutdown, delete all logs on largest Storage Groups\DBs <THIS MIGHT BE POTENTIAL SPOT OF FAILURE>
d) Mount-Database -Identity dbname
DBs come back healthy, but CCR maintains an initializing state for as long as we let it, then we reseed (Update-StorageGroupCopy -Identity dbname) which eventually gets everyone back in a happy copy state, in case of failover.
Since I inherited my new role back in ~August last year, I have been in a state of perpetual fear and failure making life in-general stressful. I am happy that I have my new responsibilities, as I know after I master this I will have experience with all major Windows Enterprise technologies: NLBs, Failover Clusters, Exchange Support, IIS, Basic Database work (I have another project I was working on requiring MS SQL, but that has been on the backburner due to this), but it has been a bad way to learn - maybe this is how everyone does it? Anyway, I have been pretty burnt out for about a year now, due to previous posts I have made about moving buildings, so once this gets handled in a way that I am happy with, I can relax a bit, then pick back up on my other task and learn some SQL.
Any help would be appreciated. Thanks.