As many of you are probably aware, the Curse family of sites has been down since early AM (PST) the 22nd. This post is meant to clear up any questions that may arise about this event.
At approximently 7:30AM PST, on June 22nd, our primary SAN controller experienced a catastrophic failure, bringing it offline. This controller was the primary controller for most of the database nodes, and most of the web nodes. Redundant systems failed to come online, even though they had reported as 'ready' before the primary systems failed.
After replacing the failed controller, it begain booting and copying it's configuration from it's peer server. Unfortunately, as soon as the configuration was copied, the secondary controller also died.
After replacing the second failed controller, we began powering the servers that reiled on the SAN for their data - all of the database servers and the network-attached storage (NAS) file servers that store all the media, static content, and most of our web files as well.
This process only took a few hours. The main delay was an extended period (24 hours) of checking the continuty of the data on the disks. This was the majority of the downtime experienced.
As we started pulling our servers online after check was complete, we noticed an issue with the firmware's on the new SAN controllers. The newer firmware versions were conflicting with the storage array, and thus, the controllers couldn't talk to the disks.
The manufacturer told us that this was a known issue, and provided us with a method of repairing it. We then backed up all the data again, and proceeded to apply the firmware patch.
After the patch, we were able to restore the drives, and start booting up the critical systems, followed by the non-critical systems.
We can assure you that at no time during the hardware failure was any of your personal information compromised. We take the sacred trust you put in us with your information VERY seriously.
Thank You
We realize that you rely on the forums as part of your Minecraft experience. Once again, we sincerely apologize for the downtime and hope you'll continue to enjoy the Minecraft Forums.
The Minecraft Forums Team
Rollback Post to RevisionRollBack
Looking for Minecraft servers to join? Check out http://mcserverlist.net/, the largest list of public Minecraft servers!
Good job....And yeah....my anonymous login information for a game "official" forums is vital to me....slight sarcasm....sorry lol. Good job again on yall getting everything up and running.
Rollback Post to RevisionRollBack
Quote from GretchenMC »
"I'm an underaged girl on the internets, come play with me and we can video chat!"
Glad to see it back, as many people say!
During the downtime, I got to think of something: Since Minecraft Forum is part of the Curse Network, do you think there'll ever be support for Minecraft in the Curse Client?
Rollback Post to RevisionRollBack
AVATAR, PHOTO AND SKIN AT THE COURTESY OF WHISKERS
Welcome back guys! I do have a few questions that maybe someone knowing about these server array thingamabobs could answer:
1. How likely is it that this sort out of equipment malfunction would occur in the first place?
2. Why did Curse administrators take the long way around and take out four days of site usability instead of immediately getting working copies of sites up so at least they were usable?
3. Would it be possible to shorten the time needed to restore and reinstall everything if the same hardware failure should occur again?
4. In the foreseeable future is there anything Curse could do about making site backups and restorations less dependent on critical hardware components like the ones which failed?
5. It seems possible from reading about the hardware that failed, Curse could remotely store a copy of their entire Atlanta datacentre at one or more other datacentres. Would they be able to put together some plan to do this on a regular basis, and to switch sites if such a failure occurs again?
6. Is there anything Curse could do to prevent a similar failure from taking out so many sites at once? Having a redundancy system that depends on a few critical components seems a lot like putting too many eggs in one basket to me!
And finally a question for the Minecraft forums staff: Could they please look into having the Minecraft Wiki site hosted on a different set of servers? It takes a lot of time to learn about redstone circuitry and I can't learn much while the wiki is down on the mat for four days!
At approximently 7:30AM PST, on June 22nd, our primary SAN controller experienced a catastrophic failure, bringing it offline. This controller was the primary controller for most of the database nodes, and most of the web nodes. Redundant systems failed to come online, even though they had reported as 'ready' before the primary systems failed.
After replacing the failed controller, it begain booting and copying it's configuration from it's peer server. Unfortunately, as soon as the configuration was copied, the secondary controller also died.
After replacing the second failed controller, we began powering the servers that reiled on the SAN for their data - all of the database servers and the network-attached storage (NAS) file servers that store all the media, static content, and most of our web files as well.
This process only took a few hours. The main delay was an extended period (24 hours) of checking the continuty of the data on the disks. This was the majority of the downtime experienced.
As we started pulling our servers online after check was complete, we noticed an issue with the firmware's on the new SAN controllers. The newer firmware versions were conflicting with the storage array, and thus, the controllers couldn't talk to the disks.
The manufacturer told us that this was a known issue, and provided us with a method of repairing it. We then backed up all the data again, and proceeded to apply the firmware patch.
After the patch, we were able to restore the drives, and start booting up the critical systems, followed by the non-critical systems.
We can assure you that at no time during the hardware failure was any of your personal information compromised. We take the sacred trust you put in us with your information VERY seriously.
Thank You
We realize that you rely on the forums as part of your Minecraft experience. Once again, we sincerely apologize for the downtime and hope you'll continue to enjoy the Minecraft Forums.
The Minecraft Forums Team
EDIT: Wow, I just happen to visit the forum for the first time in a few hours minutes after it goes up? :biggrin.gif:
I say we all's well that ends well.
[12:41] Coffeeeeeee!
---
[16:29] "And lo, the tacos were delicious"
When I tried opening the site.. 400 TIMES!!!!!
Two THUMBS UP FOR MINECRAFT!
During the downtime, I got to think of something: Since Minecraft Forum is part of the Curse Network, do you think there'll ever be support for Minecraft in the Curse Client?
http://www.businessweek.com/ap/financialnews/D9O3AQ201.htm
It's official - Curse Network is cursed.
Fierce as Fire, Immovable as a Mountain, Righteous as the Light!
I created WorldPainter. For support, please visit the WorldPainter subreddit.
It was the constant lies sent out by Curse Network that made me lose respect for them.
1. How likely is it that this sort out of equipment malfunction would occur in the first place?
2. Why did Curse administrators take the long way around and take out four days of site usability instead of immediately getting working copies of sites up so at least they were usable?
3. Would it be possible to shorten the time needed to restore and reinstall everything if the same hardware failure should occur again?
4. In the foreseeable future is there anything Curse could do about making site backups and restorations less dependent on critical hardware components like the ones which failed?
5. It seems possible from reading about the hardware that failed, Curse could remotely store a copy of their entire Atlanta datacentre at one or more other datacentres. Would they be able to put together some plan to do this on a regular basis, and to switch sites if such a failure occurs again?
6. Is there anything Curse could do to prevent a similar failure from taking out so many sites at once? Having a redundancy system that depends on a few critical components seems a lot like putting too many eggs in one basket to me!
And finally a question for the Minecraft forums staff: Could they please look into having the Minecraft Wiki site hosted on a different set of servers? It takes a lot of time to learn about redstone circuitry and I can't learn much while the wiki is down on the mat for four days!
Cheers ...
BrickVoid
MeCraft = Survival