Optimization of Cancer Identification and Prognosis based on DNA Methylation through Machine Learning:
With about 1.9 million new cancer cases diagnosed each year in the United States, the early classification and identification of such cancers are vital for accurate and potentially life-saving treatments. In recent years, liquid biopsy DNA methylation testing of tumor suppressor genes, specifically at CpG sites, has been popularized due to its ability to identify cancer epigenetic biomarkers and predict cancer prognoses correctly, and yet most unsupervised machine-learning systems struggle to provide optimal analyses of comprehensive patterns of methylation levels at candidate genes due to their highly-dimensional properties. In this project, I investigated the use of data clustering and algorithmic filtration of such data to better analyze methylation levels in the BRCA2, EGFR, MSH6, and CEBPA genes using 3579 tumor methylation features from The Cancer Genome Atlas Program, finding that a relatively simple switch to Gaussian mixture modeling from traditional K-means clustering methods would increase accuracy by about 12.3%. Such algorithms provide a promising avenue to employ the insurgency of new methylation technology to better identify and diagnose various cancers, including leukemia, breast, lung, and colon cancer, to save lives in a more accessible method.
With about 1.9 million new cancer cases diagnosed each year in the United States, the early classification and identification of such cancers are vital for accurate and potentially life-saving treatments. In recent years, liquid biopsy DNA methylation testing of tumor suppressor genes, specifically at CpG sites, has been popularized due to its ability to identify cancer epigenetic biomarkers and predict cancer prognoses correctly, and yet most unsupervised machine-learning systems struggle to provide optimal analyses of comprehensive patterns of methylation levels at candidate genes due to their highly-dimensional properties. In this project, I investigated the use of data clustering and algorithmic filtration of such data to better analyze methylation levels in the BRCA2, EGFR, MSH6, and CEBPA genes using 3579 tumor methylation features from The Cancer Genome Atlas Program, finding that a relatively simple switch to Gaussian mixture modeling from traditional K-means clustering methods would increase accuracy by about 12.3%. Such algorithms provide a promising avenue to employ the insurgency of new methylation technology to better identify and diagnose various cancers, including leukemia, breast, lung, and colon cancer, to save lives in a more accessible method.