CoreGenomics: September 2011

Thursday, 29 September 2011

Keeping up with the Joneses, genomics style

23andMe just launched their new programme to perform exome sequencing for individuals. The Exome80 "Personal exome sequencing" will cost just $999! This will now put pressure on some of those who have had their genomes scanned on arrays to pay up for the exome or risk being the social outcast in California high society. In fact 23andMe state on their website "Most excitingly, you'll be a trailblazer, one of the first people on the planet to know their personal exome sequence" that's hamming things up if ever I heard it.

The previous service (which I just signed up to) only scans 1M SNPs. The exome sequencing will provide users sequence data over about 50 million bases of DNA at 80 fold coverage. There will not be any of the very well designed reports that are given with SNP data though, just sequences and variants. This means only those with the skills to access this data will get anything useful out of it and it is debatable what 'useful' means in this context. There is certainly not the same depth of information available on exome variants and linkage to disease.

It sounds like 23andMe will develop tools and so will the community. I expect the Genomes Unzipped team will all be in the first batch of samples, hopefully paid for by 23adndMe. Most of these, if not all, are already customers who are "comfortable managing and understanding raw genetic data" as stated on the 23andMe press release. The release also assays "If you don't know your exons from your introns, this pilot is probably not for you,". I think most undergrads know about exons and introns and they certainly could not do a huge amount with raw data.

This could of course be a great opportunity for a grad school programme to develop tools and interpretations as the data could be shared out across classes.

23andMe's description of what an exome is fails to mention regulatory sequences at all. While they do say "you can think of the exome as the DNA sequence of your genes" and "Your entire genome is made up of your exome plus...DNA that does not code for proteins".The dropping of regulation may seem trivial to a lay person but is becoming increasingly important for biologists. Where are the regulome capture products!

PS: if anyone from 23andMe is reading this and wants to send me an early birthday present my DNA is ready for sequencing on my desk. Courtesy of an Oragene freebie at a conference. It's amazing what you can pick up for free nowadays.

Thursday, 22 September 2011

Resequencing cancer gene loci

There is an enormous demand for resequencing of specific cancer loci. BRCA1 & 2, TP53, PTEN, KRAS, are already being Sanger sequenced in many labs but this process is not scalable to all cancer patients’ tumours. Some NHS hospitals do test cancer patients tumours for mutations. However they only test for a few specific genes or exons at a time and this is often done on a few samples at a time, rather than 100s or 1000s.

There are several methods available to capture Cancer specific loci and choosing between them is not simply a matter of cost. Other things to consider are the amount of material available for analysis, time taken to generate libraries and data, the amount of on/off-target sequence, ...

I thought I would briefly review some of the available methods in this post, partly so I don't have to explain this each time some asks at work!

In-solution 'exome' style methods: Illumina just released their TruSeq custom capture kits and there is a Cancer panel available now. Coming soon are autoimmune and ES cell gene panels. The kit allows capture of 700kb-15Mb of sequence in a 3.5 day protocol which uses 2 capture hybs rather than the one long hyb that Agilent and Nimblegen suggest. You have to make a TruSeq library from 1ug input DNA and there is a cost to this of about £30-60 depending on how many libraries you are making. Agilent and Nimblegen both offer customisable capture methods as well. Flexgen also offer a custom capture product and there is a Flexgen calculator you can use this to get costings for a project. Update! These use broadly similar protocols and a few comparisons have been published, an article was publised in Nature Biotech and is reviewed here.

The Cancer panel targets 372 genes across 5265 exons and is available on the Illumina website. Up to 12 samples can be pooled for custom capture and run as multiplexed sequencing. It looks like 96 samples could be prepared using gel free methods and sequenced in less than two weeks. The cost per sample also looks like it is going to be very attractive.

I can't help but wonder if you could reduce capture costs by 50% with a single hyb but improve data quality using 2 replicates. It might be possible to get away with one hyb as a quick and dirty screen but I am sure Illumina will not recommend this.

Pros: A large number of genes can be targeted.
Cons: You need to make a library first.

Multiplex PCR, or PCR-like methods: Molecular inversion probes, HaloGenomics "Selector" probes allow capture of specific regions from genomic DNA. Illumina are releasing their TruSeq custom amplicon product next month when MiSeq is finally delivered (AGBT seems a long time ago). This will use Illumina's GoldenGate assay to target up to 384 loci (I think) in a multiplex reaction. There are very few details available from the Illumina website but the kit is likely to use the same locus specific extension/ligation assay followed by PCR with tag sequences complementary to the flowcell primers. It is likely to be supplied in 96well plate configuration and would allow very large numbers of samples to be screened.

Pros: No need to make sequencing libraries. Very automatable workflows, low amounts of DNA are required.
Cons: A smallish number of genes can be targeted.

Next-generation PCR methods: The Fluidigm Access Array is a microfluidic chip that performs PCR on 48 DNA samples across 48-480 assays in single-plex or ten-plex reactions. The system is easy to use and can be set up by anyone that has previously pipetted into 384 well plates. Update! Costs for Fluidigm are about £200 for an AccessArray plus primer and reagents costs of £50 (dependent on the number of assays you intend to run), a cost per sample for 48 amplicons of £5. Combine with MiSeq PE150 kit at £600 for just £17 per sample including sequencing! The first Access Array paper was published in BioTechniques this week, a team from Roche and Fluidigm resequenced EGFR and MET Access Array System on GS Junior.

The RainDance system allows PCR amplification in a bead emulsion of up to 20,000 loci. Their new system will process up to 96 samples (called Thunder Storm, Hurricane or something like that). Update! GenomeWeb has a report from Ambry genetics about their experiences with RainStorm. They also report on the ThunderStorm instrument which will process 96 samples per day. It will "complete a targeted resequencing protocol in about 15 minutes with no hands-on time, and will enable the company to offer sequence enrichment at between $100 and $150 per sample". This suggests Thunderstorm will run for 24 hours and take 15 minutes per sample.

Both systems require specific hardware. RainDance requires more DNA than Fluidigm which only needs 50-100ng per sample. Fluidigm also processes 48 samples with standard oligos and these panels can be reconfigured very easily by users in their own labs while RainDance amplicon panels need to be made to order. The separation into individual reaction chambers of both of these platforms may make them more suited to discovery of rare mutations where in a traditional multiplex PCR these might be out competed by more common alleles.

Pros: No need to make sequencing libraries. Automatable workflows, low amounts of DNA are required.
Cons: A smallish number of genes can be targeted.

Which one to use? Personally I'm most interested in Illumina custom amplicon and Fluidigm Access Array. Both of these allow very fast preparation of large numbers of samples as sequence ready libraries directly from DNA. They could be used to screen every sample or cell line coming into a lab. There is a lot of competition in this space so I am also hoping the cost continues to drop, DNA input requirements fall and the number of amplicons targeted increases.

I'll update this post from time to time as I get a minute to add more detailed information. Feel free to comment on what I may have missed or ask questions.

Monday, 19 September 2011

My genome analysis part 1

It's my birthday in a few weeks and I thought I'd get myself genotyped by 23andMe as a birthday present to myself.

The lab I run offers genotyping and whole genome sequencing so it feels a bit crazy to be sending this out to a company when I could run the chips myself. I also thought I could do some of the analysis using some of the web based tools and Galaxy. I also saw Dan MacArthurs blog about a FireFox plugin, this should throw up some interesting tidbits next time I am browsing through UCSC or the literature. There are probably some legal and ethical issues about performing my own genome analysis but perhaps once I get the results I'll ask my boss if I can start using my genome library as the control on our flowcells.

The order has been placed and I now have an account on the 23andMe website.It will soon be time to start spitting and then waiting for results. I don't know if 23andMe use the Illumina multi-sample kit, but if they do hopefully in another year or so 2-5M snps will be the order of the day. The current 23andMe v3 chip is one of Illuminas HumanOmniExpress arrays, genotyping 730,000 SNPs plus 200,000 custom SNPs selected by 23andMe.

I will keep posting about this genome adventure as I get my results and start to look at them. There are some other great sites already bringing these technologies into the public domain. My favourite is Genomes Unzipped, do take a look if your not already aware of what that team is doing. I have also signed up to the personal genome project, but as a non-US citizen can't actually join yet.

PS: Sorry Dan but I am adding one more white european ancestry sample to the 23andMe database.

PPS: If you do want to know I'm after international copies of The Hobbit, which I collect, as my other pressies. If you have an old copy lying around feel free to send it and I'll make a donation to charity for each copy I receive. The older the better but do check yours is not worth a fortune before sending it!!! I am particularly looking for a Spanish copy.

Thursday, 15 September 2011

How much does the BBC reckon I get paid?

Are you considering a science degree? If so then the BBC has a finance calculator on its website designed for students in the UK who could be going off to Uni in September 2012. This is designed to help students get an idea of how much it they will ultimately pay back for their course.

Why am I writing about this?
Well I saw the article on the website and noticed the calculator has a choice of Science Professional for the job students might hope to do when they graduate. I am a science professional and wondered what the BBC thought I should be earning. Whoever the BBC spoke to is getting paid a lot more than me!

The website states that "the calculator uses estimates of predicted lifetime earnings based on the career areas people are considering pursuing, their age and sex. All the figures are based on averages" and "the calculator is designed to provide a general illustration of cost of financing in various scenarios, rather than to give precise predictions of how much individuals will have to pay."

When I entered course details for a 3 year degree and the aim of becoming a science professional the calculator tells me I will pay back £36,448 over the next 23 years.

So how much does the Beeb reckon I get paid?
Interestingly the predicted salary for a nearly 40 year old (birthday in 6 weeks) is £64,259. At 39 it is "only" £49,071 so I am looking forward to a £15,188 pay rise!

This is ludicrous, only a handful of my peers earn anywhere close to this, most are post-docs on quite a bit less. The few science grads that make it to becoming a PI might get a salary like this but the rest of us are paid reasonable, but certainly much lower salaries. Where do the BBC get their figures from?

From the ONS apparently, who say that Science and Technology Professionals earn an average of £36,313. The BBC website average is £42,477 so the authors might want to check their figures.

The calculator uses median earnings, fairer than a mean but I still find it hard to believe the figure is as high as this for the people I work with day to day.

Genome Technology have been running salary surveys for some time; 2011 $62,500, 2010 $55,000, 2009 $55,000 and 2008 $60,000. Their data suggest a figure averaged over the last four years of about $58,000 or £35-40,000.

What about the girls?
The calculator allows you to state whether you are male or female. When I swapped sexes (just for this experiment of course) my total pay back was about the same. However the salary I could expect to earn at 40 was £44,997; nearly £20,000 lower than the blokes.

Why do students need a loan anyway?
A student doing a three year BSc in Biology (same as me in 1992-1995) at the University of East Anglia in 2011 was paying £3375. However the fees are well hidden and it takes eight clicks to find 2012 fees which are a whopping maximum of £9,000. UEA is 27th on the complete university guide rankings. Perhaps £9000 represents value for money but a jump of 265% in fees for exactly the same course might show how broken UK University funding is! Students need to borrow £27,000 for fees, then more cash (up to £16,500) to cover accommodation and living expenses before they even start spending on beer, condoms and dope.

A little background on UK university funding for students:
1980s: Students could get grants of up to £2,265, they could even sign-on in the summer holidays. At the end of the decade loans were introduced to compensate for discontinuation of the annual increase in grants.

1990s: Loans grew and in 1997 grants were scrapped. In 1998 tuition fees were introduced at £1000.

2000s: Loans continued to grow and fees rose to £3000 per year.

2010 and on: Today English students can borrow up to £9,000 a year to cover tuition fees, plus maintenance loans of up to £5,500 (£7,675 in London). There are then some means-tested bursaries, grants and other packages available. When they graduate they have to pay back 9% of any earnings over £21,000 and are charged interest on the loans of inflation plus 3%. After 30 years, any remaining debt is written off.

See the BBC website for more details:
Q&A: Tuition fees and more details on loans here.

Friday, 9 September 2011

I am not a Bioinformatician: 1. Experiences with UCSC DAS

I am not a Bioinformatician but I often want to do things Bioinformaticians say are easy. This is the first in a possible series of blogs about my experiences with Bioinformatics, text mining, etc. I'd be very happy to receive comments on the approaches I try or hints on how to do what I am trying to do more easily. For now I am not learning to program or run scripts from a command line. I am happy to try something in Galaxy.
Part of the reason for these posts is so I can remeber how to do this next time myself.
This is meant to be the non-Bioinformaticians way of doing things!

I used to just send an email to one of the Bioinformaticians at work but I couldn't help but take the hint that they had work to do as well. So now I'll usually try to find out if I can do what I want myself with publically available, usually web-based tools. I figure that even if I don't succeed my trying is evidence I'm not just abusing their better experience. It also helps me better explain what I want so they don't do some work for me and then I say how nice it is but it is not really what I thought I wanted after all.

UCSC Distributed Annotation Service (DAS):
The "DAS" server allows you to download data directly from UCSC. I suspect it is built to be queried from under the hood rather than through a URL. However it is an easy way to get the sequence from genomic coordinates so I like it.

There is a simple FAQ here. And a query for a sequnce looks like this http://genome.ucsc.edu/cgi-bin/das/hg18/dna?segment=chr2:51409549,51409749 which will return 200bp of sequence from chr2 in Hg19 build of the Human genome.

Here's how I used DAS and other tools to design some primers from a SNP:

Designing PCR primers from an dbSNP rsID: I am hoping to design some PCR assays to look at regions around SNPs. I got a list of SNPs from a colleague wth lots of information about them position, alleles, MAF, etc. However I wanted to take this list and design PCR primers to all 24 SNPs as quickly as possible.
Getting SNP locations: I was already sent the locations along with rsID's but I wanted to see if I could find them myself. I like UCSC and it is usually my first port of call when looking for sequence information. In this case I simply entered an rsID into the UCSC position/search box (Image:UCSC RsID search) and then clicked through the link (Image: UCSC rsID results). This gives me the all important location information, chr1:4181020-4181020 in the case of rs10799216.
Getting flanking sequence: I used Excel (not a bioinformaticians favourite tool) to create a UCSC DAS address (Image: Excel to create DAS address) to pull out the flanking regions 100bp either side of the SNP (Image: UCSC DAS results).
Designing suitable primers: I used the sequence results from the DAS query in Primer3 (image: Primer3 Input) wth the parameters Targets=100,2 (primers surronding the 2 bases at position 100), ProdcutSizeRange=30-75 (I want short products) and NumberToReturn=50 (50 sets of primers). Hey presto PCR primers (Image: Primer3 Output)

Now I can get on and order some primers to test in the lab, where I feel much more in my conmfort zone.

UCSC rsID search

UCSC rsID results

Excel to create DAS address

UCSC DAS results

Primer3 Input

Primer3 Output

SNAP from the Broad

SNAP query

SNAP results

about the snp data and using excel and word to generate a DAS list and get data fast even though I'm not a bioinformaticiain!!!

Thursday, 8 September 2011

Sequencing versus arrays

This is the big question for many labs today, especially core facilities investing in technology for the longer term. Everyone wants to know when we might replace microarrays with NGS for applications like, differential gene expression, splicing analysis, allele specific expression, copy number variation, loss of heterozygosity, miRNA expression, etc.

I have been considering for some time the question of when CNV-Seq will replace SNP arrays for CNV and Structural Variation analysis, or RNA-Seq will replace gene expression arrays. This is part of my job as the head of a Genomics core, and it is a bit I really like about the job. From a quality of data perspective I'd say we should have stopped a year or so ago. However other factors like nucleic acid requirements, ease of sample processing or analysis, cost of data generation, maturity of analysis tools, etc need to be considered before we can make the switch. Over at GenomeWeb they interviewed several researchers about the rapid take-up of array CGH in clinical labs. The interview discusses how a decision on using one technology over another is not always easy. They mention similar pro's and con's as I do above, some are objective whilst others are subjective and each lab can come up with a different answer. In the interview John Iafrate, an assistant professor at Harvard Medical School says it's too early to tell if NGS Will be more sensitive than aCGH as there is not enough data yet and that CNV algorithms are developing and not widely adopted either

Sequencing vs Arrays: For this post I am going to focus primarily on differential gene expression and copy number analysis and the costs associated with generating the primary data.

How much do gene expression arrays 'cost': In my lab we run lots of Illumina HT12 arrays, over 4000 in the last five years. We have an internal charge of £125 that includes the array, labeling of cRNA, personnel and service contracts on the scanner. This is not FEC accounting but we try to cover the major expenditures. The HT12 format is very amenable to high-throughput processing and we typically run 48-192 samples as a batch using 250ng RNA as input. We have a fantastic Bioinformatics core that have implemented an analysis pipeline based on beadArray to generate differential expressed gene lists. Usually gene lists are ready for the user to look at 5-7 days after receipt of samples.

How much do snpCGH arrays 'cost': I have been involved in outsourcing some snpCGH projects from my lab. We have used Aros in Demark and they provide Affymetrix SNP6.0 and Illumina Omni1-Quad arrays at between £250-350 and £290-360 each, dependent on volume. Aros are a commercial service provider and these costs include the arrays and service provision. They typically run 96 samples per batch as a minimum using 500ng DNA as input. We have had very good experiences ofwith them so far. Our Bioinformatics core uses a pipeline based around Affymetrix Power Tools (APT) and DNACopy (Bioconductor) only. to generate segmented copy number calls, as well as structural variation and LOH analyses. Again data can be ready in a matter of weeks, usually 2-4 from sample receipt by Aros.

How much does the sequencing cost? I generated the charts below to show what the cost per sample of a CNV/SV-Seq or RNA-Seq data set might be. I assumed we would use PE50bp runs, that 10M, 20M or 100M reads are required to generate data at the same quality as an array and included sample prep using TruSeq. Illumina TruSeq v3 SBS chemistry yields about 300M reads per HiSeq lane. There are two charts as 100M reads makes the axis too difficult to interpret for the 10M and 20M read costs.

A PE50 lane costs £750 at CRI so we can run 3 multiplexed samples per lane for the same cost as a snpCGH array (~£300) and get 100M reads per sample. Or run 10 multiplexed samples per lane for the same cost as an HT12 arrays (£125) and get 30M reads per sample. If we really can perform differential gene expression with only 3M reads then we can multiplex 96 samples per lane. Pretty cool I think!

So why have we not just switched? I have followed the developments in snpCGH, mainly Omni and Axiom. And as we started to look at changing to a newer array in the institute I immediately thought we might skip straight to NGS instead. The cost of generating a sequencing data set is dropping significantly (see above), however the other costs need to be taken into account. The sample prep and analysis pipelines are issues that I think need particularly careful consideration. The copy number analysis could probably be made with low coverage sequencing data sets, perhaps just 4 fold genome coverage? But structural variation and LOH may require more data, 1M SNPs from an array are cheap today. I do not believe it is clear what coverage is required to generate CNV and LOH data of the same quality as arrays. For gene expression the outlook is more promising. As long as users want differential gene expression only then as few as 1-5M could be enough to generate data comparable to arrays.

Yet today we still have recommendations to use 1ug of nucleic acid for these sample prep protocols (I know they are dropping). And the analysis tools are still developing very rapidly so pipelines are slow to develop.

I hope we can be one of the first institutes to say we've swapped from arrays to sequencing for general GX or CNV analysis, and I think we're pretty close to doing so. However it is always going to be after careful consideration of what is best for the project being discussed. Some of those projects are going to use arrays for at lest another couple of years.

PS Ideas on LOH analysis form low overage sequencing data: I do wonder if LOH might be possible from low coverage sequencing, if the analysis does not focus on high quality individual SNP calls but instead used a haplotyping approach based on low coverage and lower quality SNP calls but over windows of 10-100 SNPs? Is anyone working on this?

Friday, 2 September 2011

Next generation sequencing acronyms

The list at the bottom is occasionally edited to include new methods.

There is an ever increasing number of next generation sequencing acronyms (see the link to a table at the bottom of this post), other than the commonly used ChIP-Seq and RNA-Seq most have been used only once in the article where the acronym was first used. I suspect authors are trying to come up with something as catchy as "ChIP-Seq" in the ope that their paper will be cited more frequently. A search through PubMed suggests otherwise.

My list contains just over 30 acronyms presented in papers, posters or meetings over the past few years. A search for these in PubMed brings up almost 850 papers. RNA-Seq (including mRNA-Seq or RNASeq) account for 48%, ChIP-Seq has 47%, ClIP-Seq (including RIP-Seq and HITS-ClIP) has 2%, about 1% of papers are versions of methylation sequencing.

The 2007 ChIP-Seq paper "Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing" in Nat Methods started this sequencing acronym name game. However this has been played out before with other technologies.

I am certain I will have missed a few out so feel free to comment and I'll upadte the tabel with this information in it. I'd also like to ask authors and editors to strongly consider adding any more to this list unless there is a strongly compelling reason.

Here's my list: ALEXA-Seq, Apopto-Seq, AutoMeDip-Seq, Bind-n-Seq, Bisulfite-Seq, ChIA-PET, ChIP-Seq, ChiRP-seq, ClIP-Seq, CNV-Seq, Degradome-seq, DGE-Seq, DNA-Seq, DNase-Seq, dRNA-Seq, F-Seq, FAIRE-seq, FRT-Seq, HITS-CLIP, Immune-Seq, indel-Seq, MBD-Seq, MeDIP-Seq, MethylCap-Seq, microRNA-Seq, mRNA-Seq, NA-Seq, NET-Seq, NOMe-seq, NSR-Seq, PAS-Seq, Peak-Seq, PhIP-Seq, Protein-seq, ReChIP-Seq, RIP-Seq, RIT-seq, RNA-Seq, RNASeq, rSW-Seq, SAGE-Seq, Seq-Array, Sono-Seq, Sort-seq, Tn-Seq.

Others added by you: (now all moved to the list above)
One comment suggested I should have labeled the post as "Sequencing abbreviations". Having now taken the time to look at what the difference is between an acronym and an abbreviation, I think there is a bit of both here so I'll leave it for now.

Here's the data...

Pages