1. SCALABLE APPROACHESTO EXPLORINGMICROBIAL DIVERSITYC. Titus
[email protected] Professor, MMG / CSE; Michigan State University1/15: Population Health & Reproduction, VetMed, UC DavisTalk slides on slideshare.net/c.titus.brown 2. Funding and motivation: 3. The central question of my lab --How can we most effectively use computation to extractinformation from large sequence data sets, for the purposeof better understanding non- and semi-model organisms?Focus on environmental microbes, marine animals,& agricultural and veterinary animals. 4. Biology is becoming data rich – and arising tide lifts all boats!http://susieinfrance.blogspot.com/2010/06/rising-tide-lifts-all-boats.html 5. …but sometimes the tide comes in a bitfast. 6. Our foil for today:Investigating soil microbial communitiesLife on earth depends on soil microbes, but:• 95% or more of soil microbes cannot be cultured in lab.• Very little transport in soil and sediment =>slow mixing rates.• Estimates of immense diversity:• Billions of microbial cells per gram of soil.• Million+ microbial species per gram of soil (Gans et al, 2005)• One observed lower bound for genomic sequence complexity =>26 Gbp (Amazon Rain Forest Microbial Observatory) 7. “By 'soil' we understand (Vil'yams, 1931) a loose surfacelayer of earth capable of yielding plant crops. In the physicalN. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTShttp://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.htmlsense the soil represents a complex disperse systemconsisting of three phases: solid, liquid, and gaseous.”Microbes live in & on:• Surfaces ofaggregate particles;• Pores withinmicroaggregates; 8. Specific questions to address:• Role of soil microbes in nutrient cycling?• How does agricultural soil differ from native soil?• How do soil microbial communities respond to climateperturbation?• Genome-level questions:• What kind of strain-level heterogeneity is present in the population?• What are the phage and viral populations & dynamics thereof?• What species are where, and how much is shared betweendifferent geographical locations? 9. Must use culture independent andmetagenomic approaches• Many reasons why you can’t or don’t want to culture:Cross-feeding, niche specificity, dormancy, etc.• If you want to get at underlying function, 16s analysisalone is not sufficient.Single-cell sequencing & shotgun metagenomics are twocommon ways to investigate complex microbial communities. 10. Shotgun metagenomics• Collect samples;• Extract DNA;• Feed into sequencer;• Computationally analyze.“Sequence it all and let thebioinformaticians sort itWikipedia: Environmental shotgunsequencing.pngout” 11. Computational reconstruction of(meta)genomic content.http://eofdreams.com/library.html;http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/ 12. Points:• Lots of fragments needed! (Deep sampling.)• Having read and understood some books will help quite a bit(Reference genomes.)• Rare books will be harder to reconstruct than common books.• Errors in OCR process matter quite a bit. (Sequencing error)• The more, different specialized libraries you sample, the morelikely you are to discover valid correlations between topics andbooks. (We don’t understand most microbial function.)• A categorization system would be an invaluable but notinfallible guide to book topics. (Phylogeny can guideinterpretation.)• Understanding the language would help you validate &understand the books. 13. Great Prairie Grand Challenge --SAMPLING LOCATIONS2008 14. A “Grand Challenge” dataset (DOE/JGI)6005004003002001000Iowa,ContinuouscornIowa, NativePrairieKansas,CultivatedcornKansas,NativePrairieMetaHIT (Qin et. al, 2011), 578 GbpWisconsin,ContinuouscornWisconsin,NativePrairieWisconsin,RestoredPrairieWisconsin,SwitchgrassBasepairs of Sequencing (Gbp)GAII HiSeqRumen (Hess et. al, 2011), 268 GbpNCBI nr database,37 GbpTotal: 1,846 Gbp soil metagenomeRumen K-mer Filtered,111 Gbp 15. A “Grand Challenge” dataset (DOE/JGI)6005004003002001000Iowa,ContinuouscornIowa, NativePrairieKansas,CultivatedcornKansas,NativePrairieMetaHIT (Qin et. al, 2011), 578 GbpWisconsin,ContinuouscornWisconsin,NativePrairieWisconsin,RestoredPrairieWisconsin,SwitchgrassBasepairs of Sequencing (Gbp)GAII HiSeqRumen (Hess et. al, 2011), 268 GbpNCBI nr database,37 GbpTotal: 1,846 Gbp soil metagenomeRumen K-mer Filtered,111 Gbp 16. My algorithm research: 3 methods.1. Adaptation of a suite of probabilistic data structures forrepresenting set membership and counting (Bloom filtersand CountMin Sketch). (Zhang et al., PLoS One, 2014.)2. An online streaming approach to lossy compression ofsequencing data. (Brown et al., arXiv, 2012; Howe et al., PNAS, 2014.)3. Compressible de Bruijn graph representation forassembly. (Pell et al., PNAS, 2012.) 17. Method #2 - Digital normalization(a computational version of library normalization)Suppose you have adilution factor of A (10) toB(1). To get 10x of B youneed to get 100x of A!Overkill!!This 100x will consumedisk space and, becauseof errors, memory.We can discard it foryou… 18. Digital normalization 19. Digital normalization 20. Digital normalization 21. Digital normalization 22. Digital normalization 23. Digital normalization 24. Assembling Iowa prairie and Iowa corn:TotalAssemblyTotal Contigs(> 300 bp)% ReadsAssembledPutting it in perspective:Total equivalent of ~1200 bacterial genomesHuman genome ~3 billion bpPredictedproteincoding2.5 bill 4.5 mill 19% 5.3 mill3.5 bill 5.9 mill 22% 6.8 millAdina Howe 25. Resulting contigs are all low coverage.Howe et al., 2014Figure11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil metagenomes. 26. Iowa prairie & corn DNA abundances arevery even.Corn PrairieHowe et al., 2014 27. Assembly is a good idea:Howe et al., 2014 28. Analyses ofmetabolic potentialbegin to illuminatedifferences.Howe et al., 2014 29. We see little strain variation in sample.Top two allele frequenciesPosition within contigCan measureby readmapping.Of 5000 mostabundantcontigs, only 1has apolymorphismrate > 5% 30. Biogeography: Iowa sample overlap?Corn and prairie content graphs have 51% nucleotideoverlap.Corn PrairieSuggests that at greater depth, samples may have similargenomic content. 31. Biogeography of genomic DNA in soilHow much genomic richness is sharedbetween different sites?Qingpeng Zhang 32. So, for soil:• We really do need more data;• But at least now we can assemble what we already have.• Estimate required sequencing depth at 50 Tbp;• Now also have 2-8 Tbp from Amazon Rain ForestMicrobial Observatory.• …still not saturated coverage, but getting closer.Iowa soil work has been published:Howe et al., 2014, PNAS. 33. So, for soil:Note! There are now much faster assembly approaches…!See: Megahit, http://arxiv.org/abs/1409.7208(Technology marches on!) 34. So, for soil:• We really do need more data;• But at least now we can assemble what we already have.• Estimate required sequencing depth at 50 Tbp;• Now also have 2-8 Tbp from Amazon Rain ForestMicrobial Observatory.• …still not saturated coverage, but getting closer.But, diginorm approach turns out to also be widelyuseful. 35. Digital normalization is popular…Estimated ~1000 users of our software.Diginorm algorithm now included in Trinitysoftware from Broad Institute (~10,000 users)Illumina TruSeq long-read technology nowincorporates our approach (~100,000 users) 36. The data problem: Looking forward 5years…Navin et al., 2011 37. Some basic math:• 1000 single cells from a tumor…• …sequenced to 40x haploid coverage with Illumina…• …yields 120 Gbp each cell…• …or 120 Tbp of data.• HiSeq X10 can do the sequencing in ~3 weeks.• The variant calling will require 2,000 CPU weeks…• …so, given ~2,000 computers, can do this all in onemonth. 38. Similar math applies:• Pathogen detection in blood;• Environmental sequencing;• Sequencing rare DNA from circulating blood.• Two issues:•Volume of data & computeinfrastructure;• Latency for clinical applications. 39. We face an infinite data problem.• For all intents and purposes• For example, Illumina estimates that 228,000 humangenomes will be resequenced this year, primarily byresearchers; this is only going to grow.• Similar stories across all of biology (although #s lower :) 40. Current analysis approaches are multipass,e.g. variant calling:DataMappingSortingCalling AnswerOn infinite data, you really only want to look at the data once… 41. Streaming algorithms can be very efficientData1-passAnswerSee also eXpress, Roberts et al., 2013. 42. Some key points --• Digital normalization is streaming.• Digital normalizing is computationally efficient (lowermemory than other approaches; parallelizable/multicore;single-pass)• Currently, primarily used for prefiltering for assembly, butrelies on underlying abstraction (De Bruijn graph) that isalso used in variant calling. 43. Digital normalization 44. Digital normalization 45. Digital normalization 46. Digital normalization 47. Digital normalization 48. Some key points --• Digital normalization is streaming.• Digital normalizing is computationally efficient (lowermemory than other approaches; parallelizable/multicore;single-pass)• Currently, primarily used for prefiltering for assembly, butrelies on underlying abstraction (De Bruijn graph) that isalso used in variant calling. 49. Error correction as the solution for our illsCurrent work: error correction (??)Errors in sequencing data are at the root of manyproblems:• Assembly is 100x lower memory in the absence of errors.• Mapping is computationally trivial when there are noerrors.• Variant calling and genotyping become simple, as doesspecies detection. 50. We can error correct high-coverage shotgun datawith k-mer spectra:Chaisson et al., 2009True k-mersErroneous k-mers 51. Streaming error correction on E. coli data(Early days…)TP FP TN FN1% error rate, 100x coverage.Michael Crusoe, Jordan Fish, Jason PellErrorcorrection 3,494,631 3,865 460,601,171 5,533(corrected) (mistakes) (OK) (missed) 52. Error correction variant callingSingle pass, reference free, tunable, streamingonline variant calling. 53. Streaming with reads…Sequence...GraphSequence...Sequence...Sequence...Sequence...Sequence...Sequence...Sequence.......Variants 54. Analysis is done after sequencing.Sequencing Analysis 55. Streaming with basesk bases...Graphk+1k bases... k+1k+2k bases... k+1k bases... k+1k bases... k+1...k bases... k+1Variants 56. Integrate sequencing and analysisSequencingAnalysisAre we done yet? 57. What does the future hold?• More emphasis on training and infrastructure.• Data integration!• Identifying the function of unknown genes… 58. Summer NGS workshop (2010-2017) 59. The infrastructure challengeIn 5-10 years, we will have nigh-infinite data.(Genomic, transcriptomic, proteomic, metabolomic,…?)We currently have no good way of querying,exploring, investigating, or mining these data sets,especially across multiple locations.. 60. Distributed graph database serverWeb interface + APICompute server(Galaxy?Arvados?)Data/InfoRaw data setsPublicservers"Walledgarden"serverPrivateserverGraph query layerUpload/submit(NCBI, KBase)Import(MG-RAST,SRA, EBI) 61. Data integration?Once you have all the data, what do you do?"Business as usual simply cannot work."Looking at millions to billions of genomes.(David Haussler, 2014) 62. My charge: We don’t know what most genes do.TotalAssemblyTotal Contigs(> 300 bp)% ReadsAssembledPutting it in perspective:Total equivalent of ~1200 bacterial genomesHuman genome ~3 billion bpPredictedproteincoding2.5 bill 4.5 mill 19% 5.3 mill3.5 bill 5.9 mill 22% 6.8 millHowe et al, 2014; pmid 24632729 63. Data Intensive BiologyOpportunities & challenges; how can we best support thebiology?"I have traveled the length and breadth of thiscountry and talked with the best people, and I canassure you that data processing is a fad that won'tlast out the year." --The editor in charge of businessbooks for Prentice Hall, 1957 64. Thanks!Key points:• Facing nigh-infinite data situation;• The first stages of sequence analysis, assembly and variantcalling, are computationally intensive (but we’re hoping to fixthat);• Training in data intensive biology is critical to the future ofbiology.• Data sharing and data integration infrastructure is also critical. 65. Graph alignment can detect read saturation 66. Proposal: distributed graph database serverWeb interface + APICompute server(Galaxy?Arvados?)Data/InfoRaw data setsPublicservers"Walledgarden"serverPrivateserverGraph query layerUpload/submit(NCBI, KBase)Import(MG-RAST,SRA, EBI) 67. Proposal: distributed graph database serverWeb interface + APICompute server(Galaxy?Arvados?)Data/InfoRaw data setsPublicservers"Walledgarden"serverPrivateserverGraph query layerUpload/submit(NCBI, KBase)Import(MG-RAST,SRA, EBI) 68. Proposal: distributed graph database serverWeb interface + APICompute server(Galaxy?Arvados?)Data/InfoRaw data setsPublicservers"Walledgarden"serverPrivateserverGraph query layerUpload/submit(NCBI, KBase)Import(MG-RAST,SRA, EBI) 69. Proposal: distributed graph database serverWeb interface + APICompute server(Galaxy?Arvados?)Data/InfoRaw data setsPublicservers"Walledgarden"serverPrivateserverGraph query layerUpload/submit(NCBI, KBase)Import(MG-RAST,SRA, EBI) 70. Graph queriesacross public & walled-garden data sets:assembledsequenceSIMILARITY TO ALSO CONTAINSnitritereductaseppaZrawsequenceSee Lee,Alekseyenko, Brown,paper in SciPy 2009:the “pygr” project.