CSIRO launches serverless 'search engine for the genome' on Alibaba Cloud
- 20 September, 2018 12:53
Trials of CRISPR/Cas9 are getting underway around the world. Clustered Regularly Interspaced Short Palindromic Repeats/CRISPR associated protein 9, to give it its full name, holds huge promise for tackling cancer, hereditary blindness, AIDS, cystic fibrosis, Duchenne’s muscular dystrophy, hepatitis B, Huntington’s disease and even high cholesterol.
CRISPR/Cas9 is adapted from the system found in bacteria used to cut up the DNA of invading viruses. By utilising the bacteria’s self-defence mechanism – essentially a tiny pair of virus destroying DNA scissors – scientists can search a cell’s genetic material for a specific DNA sequence and slice it at the right point.
Introduce a disease-free section of DNA at the same time CRISPR/Cas9 is slicing, and the cell will often use the material to repair the cut.
The repaired cells could be replicated and then injected into patients to repopulate damaged organs, or the editing could happen at the embryonic stage. While most labs are running trials on mice or cells in petri dishes, some are already experimenting with human embryos.
One of the challenges of using CRISPR/Cas9 is in figuring out where the cut should be made, and within a reasonable timeframe. Given there are 3 billion letters in the human genome, it is no small feat.
To tackle the issue, researchers in CSIRO’s eHealth program developed a software tool called GT-Scan2. The tool – dubbed the “search engine for the genome” – ranks the most effective CRISPR/Cas9 target in order of predicted cutting efficiency, and also the number of similar sequenced ‘off-targets’.
Having launched on AWS at the end of 2016, CSIRO has now made GT-Scan2 available on Alibaba Cloud to appeal to Chinese genomics researchers.
“We’re doing both. We like to be cloud agnostic. But a reason we’re trying Alibaba is they’re the number one cloud provider in China and that’s a huge market for us,” Dr Denis Bauer, head of CSIRO’s transformational bioinformatics group told Computerworld.
China is emerging as a serious contender in genomics research, and catching up with the US in the field. At the end of last year the nation launched the world's largest human genome research project to sequence the DNA of 100,000 people (matching a similar UK effort). The country is home to the world’s largest genetic research centre, and its researchers are perceived as being less hampered by the ethical concerns around genetic engineering as those in other countries.
“Especially with genome engineering there’s a lot of innovation that has come out of China. They’ve often been the first to actually do things,” Bauer says.
“The AWS offering [in China] is substantial, but there seems to be a lot of new restrictions that have come in around handling data and cloud computing. [Chinese researchers] have a leaning towards using a company that comes from their own country.”
Put in the smarts
To realise the tool on Alibaba Cloud, CSIRO mirrored the AWS architecture, and used Alibaba’s fully hosted and serverless running environment, Function Compute.
“With serverless we can summon appropriately sizes compute resources to evaluate even large genomic regions so researchers can do this now in real time while they are working in the lab,” Bauer said.
“It’s looking strikingly similar and there are similar components in both versions. This was really important, in being cloud agnostic, having a technology running as well on AWS, as well as it might be running on Alibaba,” she added.
Users submit a search through a web application – here – and the application puts the parameters into a NoSQL database table (Alibaba’s service is called Table Store) via an API call.
The database entry triggers the first Function Compute function which finds all the CRISPR targets in the DNA sequence the user has entered.
The potential targets “have fixed rules and can be easily found using a regular expression that completes in seconds” Alibaba solution architect Sabith Venkitachalapathy said, and then go into a second Table Store table.
A string matching tool called Bowtie – developed by John Hopkin University in the US – then evaluates the targets for their ‘off-target risk’.
Though Bowtie requires only a reduced representation of the three billion letter genomic sequence, in the AWS version of GTScan-2, the 915 MB size of the index files exceeded the storage limitation for each Lambda instance. While Alibaba’s Function Compute supports temporary spaces of this size, the CSIRO team kept their AWS workaround which divides the genome into smaller blocks to enable parallel processing.
“Everything that we’ve done on AWS with the parallelisation, we replicated on Alibaba, even if we didn’t need to do it, because they have different limitations. But if you’ve already put in the smarts into coming up with a parallel architecture, they are advantages beyond being feasible, it’s also quicker,” Bauer said.
“In my mind, once you go serverless you never go back, because it’s so easy, it’s so convenient, it’s so cheap, it’s so economical. For burstable workloads serverless is the way to go. Innovation becomes easily affordable. You can stand up a minimum viable product quite cheaply. And because everything is modularised you can exchange individual components,” she said.
GT-Scan2 looks set to accelerate the progress of research into genome engineering, moving medical breakthroughs like a cure for cancer into the realms of possibility.
“It makes genome engineering more efficient and hence enables applications that previously would have not been possible because of the resource waste from failed experiments,” Bauer says.
“For example researchers have found lots of mutations with unknown function that might be associated with diseases – re-creating these mutations in cell models using genome engineering and observing their effect will help validate their disease association and ultimate enable new treatment options. GT-Scan makes this process far more efficient,” she explained.
The promise of the research field is huge. And GT-Scan2 will help scientists realise its potential sooner.
“This tool democratises access to high-quality recommendations on editing-efficiencies. Before only institutes with high-performance compute capability were able to tailor their experiments, now smaller labs can also more efficiently contributing towards growing the knowledge on which genomic mutations cause disease,” Bauer said.