An emergency shipment of hard drives was not moving fast enough for Nyle Elison, scrambling to recover BYU computer systems halted by a massive data crash that he now calls “the event.”
Halted research, postponed graduations — engines of the university that may not be running until Christmastime were keenly on his mind. He was tracking a FedEx shipment of emergency equipment that was as close as Orem, but not in his hands. He drove to the shipper’s office and began piling the hard disk drives in his car.
“FedEx doesn’t deliver to campus on Saturday and they were going to wait until Monday morning,” said Elison, product line manager in the Office of Information Technology. The drives were slightly smaller than a shoebox. “But by the time you get 72 hard disk drives in, it filled up the whole back seat of my car.”
A scheduled software upgrade over Memorial Day weekend in BYU’s Data Center went horribly wrong and perhaps may be the worst the university has ever seen.
Sociology professor Vaughn Call was one of many immediately impacted by the data crash. He uses the OIT Data Center to store confidential research information that was inaccessible until July 3, well into the spring/summer window of best research time.
“My first reaction was disbelief and that an institution of this size and reputation would have something like that happen,” Call said. “I’ve never in my long career been in any circumstance like this. It just brought us to a dead halt.”
OIT centrally manages many university data functions, like payroll, but also houses data for about 30 colleges and departments that choose their own management and back-up procedures.
The College of Family, Home, and Social Sciences used the OIT services almost exclusively, including for back-up storage. Joe Olson, FHSS assistant dean over information technology, said the crash essentially froze the work of faculty in his department. Spring and summer semesters are crucial for professors’ undivided attention on their research.
Most of the about 200 faculty members in his college do research. “Having this resource out of commission during an important time for faculty to do their research work has been very serious for a number of faculty members,” Olson said.
The McKay School of Education uses data back-up systems outside of the OIT Data Center. The practice lessened impacts of the OIT Data Center crash.
“We ran back-up systems that we’ve used for a year,” said Technology Education Computing Lab administrator Steven Burton. “Without it we would’ve been in very sorry shape.”
Many computing services directors agreed on the same point. There was a sense of vulnerability among the university.
“When things run smoothly, we get a sense of complacency,” Burton said. “A thing like this is a good reminder to be vigilant with a back-up running.”
Elison said universities around the West are watching to see how BYU manages the recovery.
“It’s having a dramatic ripple effect throughout similar institutions,” Elison said. “They are extremely interested in what we are doing and what we felt that the problem was.”
The crash affected students as well. Olson said some graduate students will have to postpone graduation because of lost research time. Not yet known is how the university will accept this delay as a legitimate excuse, he said.
“The problem is that there’s not a precedent for this kind of a widespread problem,” Olson said. “In addition to faculty, you’ve got similar issues with graduate students who are trying to get their work done. They’re usually working with faculty on these projects so that may put them a semester behind in terms of their schedules. Yeah, it’s a serious problem.”
Some of the solution is being made by transferring data to new hardware, like three pallets of servers shipped in from Michigan and the carload of disk drives Elison picked up himself. Outside data-recovery firms are being paid to pluck data off damaged hardware. A contractor just returned 20 terabytes of recovered data belonging to FHSS.
OIT is involved in a complex process of assessing the cost and priority of data repairs that will continue through the end of the year, all while OIT is still investigating what went wrong in the first place, a process it calls a “root-cause analysis.”
“It’s very difficult to find the smoking gun. Frankly I think there was a variety of things every one could have done better,” Elison said. “I hate to use this term, but it was like a perfect storm.”
Temporary fixes will get the university through the storm and debris by the end of the year. After that, better “enterprise class” equipment will better protect the university from future meltdowns, Elison said. Computer equipment has a finite lifespan.
“Just like your laptop, (data servers) only lasts for a given period of time, so it’s a few million dollars every three or four years,” Elison said.
Elison said he knew OIT would have to win back the trust of colleges and departments on campus. He recently met with about 75 Computer Support Representatives from across campus, where participants were both hesitant and frustrated.
“We clearly have to win your confidence back,” Elison told the group.
The Universe will continue to follow the story as information becomes available.