Sechenov Scientists Reveal New Technologies for Data Analysis in Medicine

Sechenov researchers examined the usage of machine learning and big data in personalised medicine, oncology in particular.They pointed out a way to increase flexibility of the algorithms and proposed a new method of data processing. The article was published in Frontiers in Oncology.

Though the basis and main principles of machine learning were formulated more than a half a century ago, these algorithms became relatively widespread in medicine only over the past 20 years. Prior to this they had changed the way of decision-making in several spheres of engineering, banking, agriculture, and defense. The turning point for medicine was reached with the emergence of methods for collecting large amounts of data, including information about DNA (genome), RNA (transcriptome), proteins (proteome), and all the compounds taking part in metabolism in a cell (metabolome). “Without the use of advisory systems, targeted antitumor medicines account for around 30–40% of success, and it passes for normal in the present-day clinical practice. Our technique has a success rate of more than 70%. To date, our laboratory is a world leader,” said Nicolas Borisov, coauthor of the paper and principal research scientist of the Institute for Personalized Medicine, Sechenov University.

The algorithms of machine learning are based on the creation of a mathematical model and its adjustment using a training dataset (e.g. information about the patient’s condition, the measures taken, and, above all, treatment results). This model is applied to predict the outcome of the new cases (test dataset). In medicine these algorithms are being introduced quite slowly due to complexity and often inadequacy of data – for instance, modern methods of DNA and RNA sequencing allow to find much more features (gene mutations) than there were patients examined. In this case substantial data processing and combination of multiple datasets are required.

In the field of personalised medicine two types of data are used to predict the outcome of the chosen therapy: patient’s gender and age, their medical history, risk factors, results of clinical laboratory tests, and functional diagnostic data (ECG, EEG etc.) on the one hand, and multi-omics data (structure and variety of DNA, RNA, proteins, and metabolic products) on the other. The second group of data can be compared with the results of research on cell cultures – how the activity (expression) of genes is changing in response to medication. The application of machine learning algorithms implies three main steps: – data preparation and compilation of the training dataset. Here the researcher is facing data deficiency (there are more features than patients) and has to select the most crucial indicators. It could be genes encoding a particular enzyme, or mutations separating different groups of patients most accurately. – applying the algorithm. Scientists may either choose the proper method of classification/clusterization or combine several of them. – processing of the test dataset and evaluating the results. The authors of the article proposed a new way to reduce the number of features, FloWPS (FLOating Window Projective Separator). This algorithm allows to exclude features that are not relevant for each particular sample, from the analysis, and thus provides a more flexible and responsive set of characteristics. Although the time needed to execute the algorithms rapidly grows with increasing number of samples, this method can significantly improve the accuracy of different machine learning algorithms and will be useful in the analysis of various biomedical data.

The research was conducted together with specialists of Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry and OmicsWay Corporation (USA).