基于高通量测序技术的癌症研究
林钊
linzhao@genomics.cn
Cancer Background
CACER GENOMICS
n Cancers are caused by changes that have
occurred in the DNA sequence of the genomes
of cancer cells
n Characteristic:
The high heterogenicity in the different cancer
tissue,different developing period
n Target:
ü a comprehensive catalogue of somatic
mutations cancer samples
ü identification of further potentially druggable
cancer genes
ü utility of somatic mutations as biomarkers for
prognosis
hypothesis-driven data-driven, large scale analysis
6
Unable to detect rare variants,MAF>5%.
Rare SNPs were true diseases risk variants.
Classical methods have just looked at cancer cells
and sequenced genes known or suspected to be
linked to cancer,it may overlooked key mutations,
especially new ones.
Hypothesis genes chosen, long cycle time and low
successful rate.
Problems and difficulties of classical methods
7
MR Stratton et al. Nature 458, 719-724 (2009)
All these can be solved by sequencing
It’s time to
sequencing!
8
Overview of Cancer Solutions
Exome
sequencing
Whole
genome
sequencing
Cell line Single-cell
sequencing
Research
design
100 tumor
and 100
control 50X
/sample
10 groups
(blood+
tumor tissue)
30X per
sample
whole
genome
sequencing
50X 170-
800bp PE;
20X 2k-
40kbp PE;
50X exome
of 20 normal
and 100
tumor single
cells;
Deliverable
s
find SNV ,
Indel
find SNV,
indel,
CNV,SV,Viru
s
integrations
or rearrange-
ments
find SNV,
indel
find SNV ,SV,
novel
squence by
assembly
9
100 tumor and 100 control> 50X /sample
Background:
Ø The high heterogenicity in the same cancer tissue
Ø Require hundreds of cases to be sequenced to identify a
cancer gene that is mutated in
Scientific goal:
Ø To detect the most of the somatic mutations
Ø Try to Identify drive and passenger
Cancer Solution 1: Exome squencing
Exome
Sequencing:>50×depth
Alignment with SOAPaligner
SNVs detected by
SNVdetector or other
softwares
Quality control
Potential somatic SNVs
Excluding SNVs in
dbSNP/YH/1000
genomes
Somatic mutations
Indels (short reads)
Alignment to
reference genome
Indels detected by
SoapSV or other softwares
Excluding indels in
dbSNP/YH/1000
genomes
Filtering out indels
in normal tissues
Somatic indels
Analysis Pipeline
Sequencing Data Production
Normal
Sequencing analysis GC-201 GC-202 GC-203 GC-204 GC-205 GC-206 GC-207 GC-208 GC-209 GC-210
Total effective reads(M) 11.7 11.76 11.75 11.83 21.44 12.19 12.46 21.02 21.52 9.33
Total effective yield(Mb) 856.08 861.88 808.88 823.36 1558.41 899.66 915.95 1509.57 1558.94 746.08
Effective sequence on target(Mb) 334.59 302.87 290.43 281.15 550.05 318.27 321.69 529.31 549.31 293.7
Average sequencing depth on target 9.81 8.88 8.51 8.24 16.13 9.33 9.43 15.52 16.1 8.61
Coverage of target region 92.7% 90.8% 91.8% 92.5% 94.3% 93.2% 92.0% 94.3% 94.6% 92.2%
Tumor
Sequencing analysis GC-201 GC-202 GC-203 GC-204 GC-205 GC-206 GC-207 GC-208 GC-209 GC-210
Total effective reads(M) 40.08 37.04 32.16 32.7 37.62 35.96 32.1 37.15 34.95 44.38
Total effective yield(Mb) 2930.61 2831.84 2395.21 2433.29 2864.62 2728.28 2381.05 2823.45 2644.37 3550.2
Effective sequence on target(Mb) 1075.9 971.22 824.17 851.02 1040.93 986.48 865.74 1024.37 995.13 1397.8
Average sequencing depth on target 31.54 28.47 24.16 24.95 30.52 28.92 25.38 30.03 29.18 40.98
Coverage of target region 95.5% 94.8% 94.8% 95.1% 95.0% 95.2% 94.6% 95.0% 95.3% 95.5%
8277 somatic SNVs
760 (9.2%) new SNVs
414 (54.5%)non-
synonymous and
splice-site SNVs
249 random select
SNV for technical
validation
216 (86.7%)validated
357 predicted
cancer genes
244 novel predicted
cancer genes
113 recorded in
COSMIC
7517 present in
dbSNP and 1000
genome project
346 synonymous
and UTR’s SNVs
Schematic diagram of SNVs
filtering process and gene annotation
SNV profile
SNV spectrum
SNVs location
Transcription factor network in
3 pathways
The expression alteration of MUC17
Patients with
varied MUC17
were
represented
good
prognostic
comparing with
ones of
wild-type
MUC17
18
10 groups (blood/normal tissue +tumor tissue) 30X per
sample
Background:
u Need to know the whole aspect of genomics,including intro、
promotor region to find mutations
Research:
Large-scale analyses of genes in tumors have shown that the
mutation load in cancer is abundant, hetero-geneous, and
widespread
Cancer solution 2: Whole Genome Sequencing
Alignment
Demographic analysis
SNV
annotation
InDel
annotation
Short InDel
calling
SNV
calling
Selection Others
HiSeq 2000 sequencing
Library construction
DNA sample prepration
Basic
bioinformatics
analysis
Advanced
bioinformatics
analysis
Personalized
bioinformatics
analysis
Workflow
SV
calling
SV
annotation
CNV
calling
CNV
annotation
Others
Mutations Summary
21
Cancer solution 3: cell line
Advantage:
1.give out very clear pattern about what happened
in that cell line.
2.build a systematic characterization of the genetics
and genomics
3.High-accuracy SV,CNV, information /clear pattern
Introduction:
Human immortal cancer cell lines--an
accessible, easily usable set of biological
models
22
Workflow
Denovo sequencing Re-sequencing
23
Cancer solution 4:
Single-cell sequencing
Background:
Cancer are mixture of different cells, it's hard to identify
the tumor and adjacent tissue, it's nessesary to research
on the single cell level.
Advantages:
single cell sequencing can give out the real frequency
of mutations
give out the progress of mutation during cancer
development by the phylogenetic tree of sequenced
single cell
50X exome of 20 normal and 100 tumor single cells
50X exome of 20
normal and 100
tumer single
cells 1. coverage (>=1) 90
%;(>=10) 80%
2.SNV technical
validation rate 90%
3.Indel technical
validation rate 80%
1. point muttaions in each cells
2. mutation frequency spectrum
of normal and cancer cells;
3. relationship of different cells
Solution
25
Demo Case:
Renal cancer tumor,
BGI on-going collaborative project:
No significant differences in detecting SNPs
and InDels between single cell sequencing and
multiple cells sequencing
Genetic comparisons among cancer cells,
normal cells and leukocytes of two renal
cancer patients, respectively
l Sample set: single cell from the first Asian genome donor (YH); and
control form the same tissue.
l Data set : 13X and 18X for two replications
Single cell1 Single cell2 Control
Raw data (Gb) 35.47 47.99 48.72
Average depth 13.32 17.82 18.03
Genome coverage (%) 95.77 94.46 99.91
Method Evaluation
27
Method Evaluation
• No obvious genome wide coverage limitation by single
cell sequencing
l Depth bias strongly affected by GC content
Method Evaluation
l Depth are not affected by repeat or chromosome
location
29
Method Evaluation
SNP calling
Population
analysis
Progression
on
inferring
1000 single cell sequencing
Analysis pipeline
Mutation types for CCRCC, ET and AML
CCRCC and two different leukemia samples shows no
significant (P>0.05) higher proportion of C:G->T:A than T:A-
>C:G mutation types
1000 single cell sequencing
Mutation types in BTCC, Gastric, Colorectal
cancers
32
1000 single cell sequencing
BTCC has significant (P<0.01) higher proportion of C:G->T:A
than T:A->C:G mutation types, while others are not.
1000 single cell sequencing
Differentiate the cancer and normal
cells by PCA
+ : cancer
*: normalGastric
l Most cancer types distinguished apparently, but
ET, AML and BTCC can not, reflect the
heterogeneous nature of these cancers.
34
1000 single cell sequencing
+ : cancer
*: normalET
AML
l Phylogenetic trees clearly show subpopulations in
AML cancers
1000 single cell sequencing
AML
Consensus Tree
Inferring key genes in AML (a typical heterozygous cancer)
1000 single cell sequencing
Key Gene?
Key Gene for
sub-pop?
G1~G6: different subpopulations from AML
Key genes means cancer specific or
subpopulation specific high prevalence
mutated genes during tumor progression
Branches which are less than 50 were set
to one group
MP: Metabolic pathway
SP: Signaling pathway
CAM: Cell adhesion
molecules
Richment Cutoff: P-values<0.01
5 cancers
5 cancers related
3 cancers related
2 cancers realted
1 cancer related
4 cancers related
Mapping comprehensive progression
pathways in five tumors
1000 single cell sequencing
38
BGI, Your Premier Scientific Partner
Welcome to join us !