Although Hadoop is perhaps best known for underpinning big data-based analytics projects, Sydney-based MediaHub is using the open source project's distributed storage capabilities to deliver resilience and scalability for a mammoth — and growing — archive of broadcast video.
MediaHub is a 50-50 joint venture between the ABC and WIN Television. The company was originally envisaged as a joint venture between all of nation's TV stations in the lead-up to the end of analog broadcasting in Australia.
In the end, only the two broadcasters decided to take part in establishing MediaHub, but the company is now responsible for delivering more than 170 channels.
Along with distributing content for WIN TV and ABC, the MediaHub playout facility in Ingleburn is responsible for the broadcasting of digital content for SBS, NITV, the ThoroughVisioN racing network, Australia Plus TV, and Imparja Television.
The facility operates 24 hours a day, seven days a week and has a headcount of just over 80, with 65 people comprising the pool of presentation coordinators and the rest made up of administrative and engineering staff.
MediaHub's collection of digital media currently weighs in at one petabyte — and it is growing exponentially. CEO Alan Sweeney predicts that the company's video archive will hit around five petabytes within 18 months.
"Then sometime after that — and predicting this becomes a little harder — we would probably move up to around 10 petabytes," he adds.
"We think it will balance itself out at somewhere around 10."
The ultimate size of the archive is limited by rights issues: Television stations generally buy rights to a show for a limited period.
"At that stage we'll reach the point where the amount coming in to the archive will be matched by the amount being deleted out of the archive," the CEO explains.
MediaHub CEO Alan Sweeney
The shift away from tape-based analog broadcasting to digital means that now every individual piece of content is a separate file.
"Whether it's a program, or if you're watching a particular program and there's a segment before it goes to an ad, that segment is a file. The ad then is a file — if there three or four or five or six ads, each one is just a file. If there's a station promotion or s a program promotion — every single one of those is a file," Sweeney explains.
On-screen graphics, such as a station logo in the corner of a screen or a tickertape-style banner scrolling across the bottom of the screen, are also separate files.
Each piece of content has a unique ID. Schedules provided by the TV channels start at 6am and run until 6am the next day and account for every second of broadcast time.
"What happens is, months in advance we're sent all those files and then maybe a week in advance the television will send us their final schedule," the MediaHub CEO says.
"We will then match all the files to the schedule and put them together using an automation system. The automation system pulls them all together and stiches them all together into one long linear playout list."
In the case of on-screen logos and promotions, the system will overlay them over the main video file.
"Every single thing that happens is scheduled except for live sport, which we do a lot of," says Sweeney.
"That's where we've got manual intervention, with people are talking right back to the source, whether it's cricket or tennis or whatever, and the director will indicate when to go to an ad."
When MediaHub was initially conceived, it included only minimal storage.
"We would only keep content for maybe three days and then the system would automatically purge it," the CEO says.
"The first sod was turned for the building of MediaHub in September '09 and the first ABC channels went to air in April 2010 — so in less than six months.
"It was built without storage but what we were always aware of was that we needed to build storage. We were getting pressure from clients — not just WIN and ABC — that wanted us to be able to store [their] content. So that's this new project, our archive, which we're now well and truly into and have up and operating commercially."
The archive includes the vast majority of nationally broadcast television advertisements.
"Any television station — whether it's WIN whether it's Imparja, it doesn't matter — any television station that schedules a commercial, the automation system will go to our archive now and pick up that particular commercial and insert it into the schedule automatically," Sweeney says.
"That's one of the big drivers for the archive."
The other driver has been the growth of Internet-based streaming and IPTV.
Read more: FSI report recommends data inquiry
A year ago MediaHub wasn't retaining copies of TV programs it distributed. Now, at least in the case of ABC content, they're added to the archive and made accessible through the broadcaster's online iView service.
"We started the overall archive project 12 months ago," Sweeney says.
The initial stage of the project involved researching the available video archiving solutions. There are a lot of commercial tape-based and disk-based archives, but they tended to involve significant capital expenditure and operating costs, the MediaHub CEO says.
"We came to the conclusion that the archive systems that Google, Netflix, a number of major international banks use was so bloody good, and so efficient, that you could forget tape," the CEO says.
"If it's on a disk it can be accessed instantly, [the system] doesn't have to source the tape, load the tape, wind back and forth, find the segment — none of that."
The company uses Snell's Momentum media asset management system to manage the actual video, but for managing the storage media used by the archive system it relies on MapR's Hadoop distribution.
"The Hadoop/MapR software will actually look at all the disks, work out where its space is, work out whether any particular disk is not performing to the level it needs to," Sweeney said.
"If it's got problems it will move stuff away. If it's got any concerns it will mirror it elsewhere and it will give us reports telling us 'you've got a little problem here', which means all we have to do is, without any break in operations whatsoever and at no risk to the client, remove that disk and push a new disk in."
"That's why the MapR Hadoop software that we use is so important to us and why it works so bloody well — because it means we've got all this resilience from our disks," he says.
"And because we have more archive space than we need, there's plenty for the system to move the content around as it needs to. It also sends us it gives us reports — so we've got online, real-time reporting about the status of our archives."
"We could have bought proprietary systems — and they're pretty good — but we would have been locked into the hardware/software combinations provided by those vendors," Sweeney says.
"We wanted to be totally independent."
"MapR was the [vendor] that fit best this open architecture approach that we had to building our archives," he adds.
Currently the collection of digital video is stored in six chassis which each contain 47 hard drives: 45 for storage and two for control.
"We chose Western Digital disks but we could change them tomorrow — just because you've got a chassis with 45 Western Digital disks in at the moment, I could put three or four or 10 or 45 other manufacturers' disks in there tomorrow if I want to," the CEO says.
Follow Rohan on Twitter: @rohan_p