Kernel-based Data Fusion For Machine Learning Methods And Applications In Bioinformatics And Text Mining

Tác giả : Shi Yu – Léon-Charles Tranchevent

Lượt đọc : 316
Kích thước : 3.26 MB
Số trang : 228
Đăng lúc : 2 năm trước
Số lượt tải : 108
Số lượt xem : 1.186

Đọc trên điện thoại :

Mô tả
Bình luận

The emerging problem of data fusion offers plenty of opportunities, also raises lots of interdisciplinary challenges in computational biology. Currently, developments in high-throughput technologies generate Terabytes of genomic data at awesomerate. How to combine and leveragethe mass amountof data sources to obtain significant and complementary high-level knowledge is a state-of-art interest in statistics, machine learning and bioinformatics communities.

To incorporate various learning methods with multiple data sources is a rather recent topic. In the first part of the book, we theoretically investigate a set of learning algorithms in statistics and machine learning. We find that many of these algorithms can be formulated as a unified mathematical model as the Rayleigh quotient and can be extended as dual representations on the basis of Kernel methods. Using the dual representations, the task of learning with multiple data sources is related to the kernel based data fusion, whichhas been actively studied in the recent five years.

In the second part of the book, we create several novel algorithms for supervised learning and unsupervised learning. We center our discussion on the feasibility and the efficiency of multi-source learning on large scale heterogeneous data sources. These new algorithms are encouraging to solve a wide range of emerging problems in bioinformatics and text mining.

In the third part of the book, we substantiate the values of the proposed algorithms in several real bioinformatics and journal scientometrics applications. These applications are algorithmically categorized as ranking problem and clustering problem. In ranking, we develop a multi-view text mining methodology to combine different text mining models for disease relevant gene prioritization. Moreover, we solidify our data sources and algorithms in a gene prioritization software, which is characterized as a novel kernel-based approach to combine text mining data with heterogeneous genomic data sources using phylogenetic evidence across multiple species. In clustering, we combine multiple text mining models and multiple genomic data sources to identify the disease relevant partitions of genes. We also apply our methods in scientometric field to reveal the topic patterns of scientific publications. Using text mining technique, we create multiple lexical models for more than 8000 journals retrieved from Web of Science database. We also construct multiple interaction graphs by investigating the citations among these journals. These two types of information (lexical /citation) are combined together to automatically construct the structural clustering of journals. According to a systematic benchmark study, in both ranking and clustering problems, the machine learning performance is significantly improved by the thorough combination of heterogeneous data sources and data representations.

The topics presented in this book are meant for the researcher, scientist or engineer who uses Support Vector Machines, or more generally, statistical learning methods. Several topics addressed in the book may also be interesting to computational biologist or bioinformatician who wants to tackle data fusion challenges in real applications. This book can also be used as reference material for graduate courses such as machine learning and data mining. The background required of the reader is a good knowledge of data mining, machine learning and linear algebra.

This book is the product of our years of work in the Bioinformatics group, the Electrical Engineering department of the Katholieke Universiteit Leuven. It has been an exciting journey full of learning and growth, in a relaxing and quite Gothic town. We have been accompanied by many interesting colleagues and friends. This will go down as a memorable experience, as well as one that we treasure. We would like to express our heartfelt gratitude to Johan Suykens for his introduction of kernel methods in the early days. The mathematical expressions and the structure of the book were significantly improved due to his concrete and rigorous suggestions. We were inspired by the interesting work presented by Tijl De Bie on kernel fusion. Since then, we have been attracted to the topic and Tijl had many insightful discussions with us on various topics, the communication has continued even after he moved to Bristol. Next, we would like to convey our gratitude and respect to some of our colleagues. We wish to particularly thank S. Van Vooren, B. Coessen, F. Janssens, C. Alzate, K. Pelckmans, F. Ojeda, S. Leach, T. Falck, A. Daemen, X. H. Liu, T. Adefioye, E. Iacucci for their insightful contributions on various topics and applications. We are grateful to W. Gl¨anzel for his contribution of Web of Science data set in several of our publications.

This research was supported by the Research Council KUL (ProMeta, GOA Ambiorics, GOA MaNet, CoE EF/05/007 SymBioSys, KUL PFV/10/016), FWO (G.0318.05, G.0553.06, G.0302.07, G.0733.09, G.082409), IWT (Silicos, SBO-BioFrame, SBO-MoKa, TBM-IOTA3), FOD (Cancer plans), the Belgian Federal Science Policy Office (IUAP P6/25 BioMaGNet, Bioinformatics and Modeling: from Genomes to Networks), and the EU-RTD (ERNSI: European Research Network on System Identification, FP7-HEALTH CHeartED).