Classification analysis of DNA microarrays

Bibliographic Information

Classification analysis of DNA microarrays

Leif E. Peterson

(Wiley series on bioinformatics : computational techniques and engineering / series editors, Yi Pan and Albert Y. Zomaya)

John Wiley & Sons, c2013

Available at  / 2 libraries

Search this Book/Journal

Note

Includes bibliographical references and index

Description and Table of Contents

Description

Wiley Series in Bioinformatics: Computational Techniques and Engineering Yi Pan and Albert Y. Zomaya, Series Editors Wide coverage of traditional unsupervised and supervised methods and newer contemporary approaches that help researchers handle the rapid growth of classification methods in DNA microarray studies Proliferating classification methods in DNA microarray studies have resulted in a body of information scattered throughout literature, conference proceedings, and elsewhere. This book unites many of these classification methods in a single volume. In addition to traditional statistical methods, it covers newer machine-learning approaches such as fuzzy methods, artificial neural networks, evolutionary-based genetic algorithms, support vector machines, swarm intelligence involving particle swarm optimization, and more. Classification Analysis of DNA Microarrays provides highly detailed pseudo-code and rich, graphical programming features, plus ready-to-run source code. Along with primary methods that include traditional and contemporary classification, it offers supplementary tools and data preparation routines for standardization and fuzzification; dimensional reduction via crisp and fuzzy c-means, PCA, and non-linear manifold learning; and computational linguistics via text analytics and n-gram analysis, recursive feature extraction during ANN, kernel-based methods, ensemble classifier fusion. This powerful new resource: Provides information on the use of classification analysis for DNA microarrays used for large-scale high-throughput transcriptional studies Serves as a historical repository of general use supervised classification methods as well as newer contemporary methods Brings the reader quickly up to speed on the various classification methods by implementing the programming pseudo-code and source code provided in the book Describes implementation methods that help shorten discovery times Classification Analysis of DNA Microarrays is useful for professionals and graduate students in computer science, bioinformatics, biostatistics, systems biology, and many related fields.

Table of Contents

Preface xix Abbreviations xxiii 1 Introduction 1 1.1 Class Discovery 2 1.2 Dimensional Reduction 4 1.3 Class Prediction 4 1.4 Classification Rules of Thumb 5 1.5 DNA Microarray Datasets Used 9 References 11 Part I Class Discovery 13 2 Crisp K-Means Cluster Analysis 15 2.1 Introduction 15 2.2 Algorithm 16 2.3 Implementation 18 2.4 Distance Metrics 20 2.5 Cluster Validity 24 2.5.1 Davies-Bouldin Index 25 2.5.2 Dunn's Index 25 2.5.3 Intracluster Distance 26 2.5.4 Intercluster Distance 27 2.5.5 Silhouette Index 30 2.5.6 Hubert's Statistic 31 2.5.7 Randomization Tests for Optimal Value of K 31 2.6 V-Fold Cross-Validation 35 2.7 Cluster Initialization 37 2.7.1 K Randomly Selected Microarrays 37 2.7.2 K Random Partitions 40 2.7.3 Prototype Splitting 41 2.8 Cluster Outliers 44 2.9 Summary 44 References 45 3 Fuzzy K-Means Cluster Analysis 47 3.1 Introduction 47 3.2 Fuzzy K-Means Algorithm 47 3.3 Implementation 49 3.4 Summary 54 References 54 4 Self-Organizing Maps 57 4.1 Introduction 57 4.2 Algorithm 57 4.2.1 Feature Transformation and Reference Vector Initialization 59 4.2.2 Learning 60 4.2.3 Conscience 61 4.3 Implementation 63 4.3.1 Feature Transformation and Reference Vector Initialization 63 4.3.2 Reference Vector Weight Learning 66 4.4 Cluster Visualization 67 4.4.1 Crisp K-Means Cluster Analysis 67 4.4.2 Adjacency Matrix Method 68 4.4.3 Cluster Connectivity Method 69 4.4.4 Hue-Saturation-Value (HSV) Color Normalization 69 4.5 Unified Distance Matrix (U Matrix) 71 4.6 Component Map 71 4.7 Map Quality 73 4.8 Nonlinear Dimension Reduction 75 References 79 5 Unsupervised Neural Gas 81 5.1 Introduction 81 5.2 Algorithm 82 5.3 Implementation 82 5.3.1 Feature Transformation and Prototype Initialization 82 5.3.2 Prototype Learning 83 5.4 Nonlinear Dimension Reduction 85 5.5 Summary 87 References 88 6 Hierarchical Cluster Analysis 91 6.1 Introduction 91 6.2 Methods 91 6.2.1 General Programming Methods 91 6.2.2 Step 1: Cluster-Analyzing Arrays as Objects with Genes as Attributes 92 6.2.3 Step 2: Cluster-Analyzing Genes as Objects with Arrays as Attributes 94 6.3 Algorithm 96 6.4 Implementation 96 6.4.1 Heatmap Color Control 96 6.4.2 User Choices for Clustering Arrays and Genes 97 6.4.3 Distance Matrices and Agglomeration Sequences 98 6.4.4 Drawing Dendograms and Heatmaps 104 References 105 7 Model-Based Clustering 107 7.1 Introduction 107 7.2 Algorithm 110 7.3 Implementation 111 7.4 Summary 116 References 117 8 Text Mining: Document Clustering 119 8.1 Introduction 119 8.2 Duo-Mining 119 8.3 Streams and Documents 120 8.4 Lexical Analysis 120 8.4.1 Automatic Indexing 120 8.4.2 Removing Stopwords 121 8.5 Stemming 121 8.6 Term Weighting 121 8.7 Concept Vectors 124 8.8 Main Terms Representing Concept Vectors 124 8.9 Algorithm 125 8.10 Preprocessing 127 8.11 Summary 137 References 137 9 Text Mining: N-Gram Analysis 139 9.1 Introduction 139 9.2 Algorithm 140 9.3 Implementation 141 9.4 Summary 154 References 156 Part II Dimension Reduction 159 10 Principal Components Analysis 161 10.1 Introduction 161 10.2 Multivariate Statistical Theory 161 10.2.1 Matrix Definitions 162 10.2.2 Principal Component Solution of R 163 10.2.3 Extraction of Principal Components 164 10.2.4 Varimax Orthogonal Rotation of Components 166 10.2.5 Principal Component Score Coefficients 168 10.2.6 Principal Component Scores 169 10.3 Algorithm 170 10.4 When to Use Loadings and PC Scores 170 10.5 Implementation 171 10.5.1 Correlation Matrix R 171 10.5.2 Eigenanalysis of Correlation Matrix R 172 10.5.3 Determination of Loadings and Varimax Rotation 174 10.5.4 Calculating Principal Component (PC) Scores 176 10.6 Rules of Thumb For PCA 182 10.7 Summary 186 References 187 11 Nonlinear Manifold Learning 189 11.1 Introduction 189 11.2 Correlation-Based PCA 190 11.3 Kernel PCA 191 11.4 Diffusion Maps 192 11.5 Laplacian Eigenmaps 192 11.6 Local Linear Embedding 193 11.7 Locality Preserving Projections 194 11.8 Sammon Mapping 195 11.9 NLML Prior to Classification Analysis 195 11.10 Classification Results 197 11.11 Summary 200 References 203 Part III Class Prediction 205 12 Feature Selection 207 12.1 Introduction 207 12.2 Filtering versus Wrapping 208 12.3 Data 209 12.3.1 Numbers 209 12.3.2 Responses 209 12.3.3 Measurement Scales 210 12.3.4 Variables 211 12.4 Data Arrangement 211 12.5 Filtering 213 12.5.1 Continuous Features 213 12.5.2 Best Rank Filters 219 12.5.3 Randomization Tests 236 12.5.4 Multitesting Problem 237 12.5.5 Filtering Qualitative Features 242 12.5.6 Multiclass Gini Diversity Index 246 12.5.7 Class Comparison Techniques 247 12.5.8 Generation of Nonredundant Gene List 250 12.6 Selection Methods 254 12.6.1 Greedy Plus Takeaway (Greedy PTA) 254 12.6.2 Best Ranked Genes 258 12.7 Multicollinearity 259 12.8 Summary 270 References 270 13 Classifier Performance 273 13.1 Introduction 273 13.2 Input-Output, Speed, and Efficiency 273 13.3 Training, Testing, and Validation 277 13.4 Ensemble Classifier Fusion 280 13.5 Sensitivity and Specificity 283 13.6 Bias 284 13.7 Variance 285 13.8 Receiver-Operator Characteristic (ROC) Curves 286 References 295 14 Linear Regression 297 14.1 Introduction 297 14.2 Algorithm 299 14.3 Implementation 299 14.4 Cross-Validation Results 300 14.5 Bootstrap Bias 303 14.6 Multiclass ROC Curves 306 14.7 Decision Boundaries 308 14.8 Summary 310 References 310 15 Decision Tree Classification 311 15.1 Introduction 311 15.2 Features Used 314 15.3 Terminal Nodes and Stopping Criteria 315 15.4 Algorithm 315 15.5 Implementation 315 15.6 Cross-Validation Results 318 15.7 Decision Boundaries 326 15.8 Summary 327 References 329 16 Random Forests 331 16.1 Introduction 331 16.2 Algorithm 333 16.3 Importance Scores 334 16.4 Strength and Correlation 338 16.5 Proximity and Supervised Clustering 342 16.6 Unsupervised Clustering 345 16.7 Class Outlier Detection 348 16.8 Implementation 350 16.9 Parameter Effects 350 16.10 Summary 357 References 358 17 K Nearest Neighbor 361 17.1 Introduction 361 17.2 Algorithm 362 17.3 Implementation 363 17.4 Cross-Validation Results 364 17.5 Bootstrap Bias 369 17.6 Multiclass ROC Curves 373 17.7 Decision Boundaries 374 17.8 Summary 377 References 378 18 Na ve Bayes Classifier 379 18.1 Introduction 379 18.2 Algorithm 380 18.3 Cross-Validation Results 380 18.4 Bootstrap Bias 384 18.5 Multiclass ROC Curves 386 18.6 Decision Boundaries 386 18.7 Summary 389 References 391 19 Linear Discriminant Analysis 393 19.1 Introduction 393 19.2 Multivariate Matrix Definitions 394 19.3 Linear Discriminant Analysis 396 19.3.1 Algorithm 397 19.3.2 Cross-Validation Results 397 19.3.3 Bootstrap Bias 401 19.3.4 Multiclass ROC Curves 402 19.3.5 Decision Boundaries 403 19.4 Quadratic Discriminant Analysis 403 19.5 Fisher's Discriminant Analysis 406 19.6 Summary 411 References 412 20 Learning Vector Quantization 415 20.1 Introduction 415 20.2 Cross-Validation Results 417 20.3 Bootstrap Bias 417 20.4 Multiclass ROC Curves 426 20.5 Decision Boundaries 428 20.6 Summary 428 References 430 21 Logistic Regression 433 21.1 Introduction 433 21.2 Binary Logistic Regression 434 21.3 Polytomous Logistic Regression 439 21.4 Cross-Validation Results 443 21.5 Decision Boundaries 444 21.6 Summary 444 References 447 22 Support Vector Machines 449 22.1 Introduction 449 22.2 Hard-Margin SVM for Linearly Separable Classes 449 22.3 Kernel Mapping into Nonlinear Feature Space 452 22.4 Soft-Margin SVM for Nonlinearly Separable Classes 452 22.5 Gradient Ascent Soft-Margin SVM 454 22.5.1 Cross-Validation Results 455 22.5.2 Bootstrap Bias 457 22.5.3 Multiclass ROC Curves 465 22.5.4 Decision Boundaries 465 22.6 Least-Squares Soft-Margin SVM 465 22.6.1 Cross-Validation Results 470 22.6.2 Bootstrap Bias 477 22.6.3 Multiclass ROC Curves 477 22.6.4 Decision Boundaries 477 22.7 Summary 481 References 483 23 Artificial Neural Networks 487 23.1 Introduction 487 23.2 ANN Architecture 488 23.3 Basics of ANN Training 488 23.3.1 Backpropagation Learning 493 23.3.2 Resilient Backpropagation (RPROP) Learning 496 23.3.3 Cycles and Epochs 496 23.4 ANN Training Methods 497 23.4.1 Method 1: Gene Dimensional Reduction and Recursive Feature Elimination for Large Gene Lists 497 23.4.2 Method 2: Gene Filtering and Selection 502 23.5 Algorithm 502 23.6 Batch versus Online Training 504 23.7 ANN Testing 504 23.8 Cross-Validation Results 504 23.9 Bootstrap Bias 506 23.10 Multiclass ROC Curves 506 23.11 Decision Boundaries 513 23.12 RPROP versus Backpropagation 513 23.13 Summary 522 References 522 24 Kernel Regression 525 24.1 Introduction 525 24.2 Algorithm 527 24.3 Cross-Validation Results 527 24.4 Bootstrap Bias 528 24.5 Multiclass ROC Curves 536 24.6 Decision Boundaries 537 24.7 Summary 540 References 542 25 Neural Adaptive Learning with Metaheuristics 543 25.1 Multilayer Perceptrons 544 25.2 Genetic Algorithms 544 25.3 Covariance Matrix Self-Adaptation-Evolution Strategies 549 25.4 Particle Swarm Optimization 556 25.5 ANT Colony Optimization 560 25.5.1 Classification 560 25.5.2 Continuous-Function Approximation 562 25.6 Summary 567 References 567 26 Supervised Neural Gas 573 26.1 Introduction 573 26.2 Algorithm 574 26.3 Cross-Validation Results 574 26.4 Bootstrap Bias 582 26.5 Multiclass ROC Curves 582 26.6 Class Decision Boundaries 584 26.7 Summary 586 References 588 27 Mixture of Experts 591 27.1 Introduction 591 27.2 Algorithm 595 27.3 Cross-Validation Results 596 27.4 Decision Boundaries 597 27.5 Summary 597 References 599 28 Covariance Matrix Filtering 601 28.1 Introduction 601 28.2 Covariance and Correlation Matrices 601 28.3 Random Matrices 602 28.4 Component Subtraction 608 28.5 Covariance Matrix Shrinkage 610 28.6 Covariance Matrix Filtering 613 28.7 Summary 621 References 622 Appendixes 625 A Probability Primer 627 A.1 Choices 627 A.2 Permutations 628 A.3 Combinations 630 A.4 Probability 632 A.4.1 Addition Rule 633 A.4.2 Multiplication Rule and Conditional Probabilities 634 A.4.3 Multiplication Rule for Independent Events 635 A.4.4 Elimination Rule (Disease Prevalence) 636 A.4.5 Bayes' Rule (Pathway Probabilities) 637 B Matrix Algebra 639 B.1 Vectors 639 B.2 Matrices 642 B.3 Sample Mean, Covariance, and Correlation 647 B.4 Diagonal Matrices 648 B.5 Identity Matrices 649 B.6 Trace of a Matrix 650 B.7 Eigenanalysis 650 B.8 Symmetric Eigenvalue Problem 650 B.9 Generalized Eigenvalue Problem 651 B.10 Matrix Properties 652 C Mathematical Functions 655 C.1 Inequalities 655 C.2 Laws of Exponents 655 C.3 Laws of Radicals 656 C.4 Absolute Value 656 C.5 Logarithms 656 C.6 Product and Summation Operators 657 C.7 Partial Derivatives 657 C.8 Likelihood Functions 658 D Statistical Primitives 665 D.1 Rules of Thumb 665 D.2 Primitives 668 References 678 E Probability Distributions 679 E.1 Basics of Hypothesis Testing 679 E.2 Probability Functions: Source of p Values 682 E.3 Normal Distribution 682 E.4 Gamma Function 686 E.5 Beta Function 689 E.6 Pseudo-Random-Number Generation 692 E.6.1 Standard Uniform Distribution 692 E.6.2 Normal Distribution 693 E.6.3 Lognormal Distribution 694 E.6.4 Binomial Distribution 695 E.6.5 Poisson Distribution 696 E.6.6 Triangle Distribution 697 E.6.7 Log-Triangle Distribution 698 References 698 F Symbols and Notation 699 Index 703

by "Nielsen BookData"

Related Books: 1-1 of 1

Details

Page Top