# Thesis: Analysis of Embryo Genetic Scores

This thesis was presented in the Faculty of Exact and Natural Sciences of the
University of Buenos Aires (UBA) in 2022. Itâ€™s a Bioinformatics and
Biostatistics analysis of several kinds of genetic scores associated to
disease, mainly utilised to diagnose embryos during *in vitro* fertilization.

Some results and figures are showcased here.

## Theoretical description of genetic data under different disorders

We described the expected distribution of a genetic data statistic called BAF under different disorders and levels of mosaicism, which means different percentages of mutated cells in a sample.

We described the expected behavior of a single measure $\theta$ that helped designing statistical tests to easily detect these disorders.

## Simulation of genetic data with different degrees of mosaicism

We simulated genetic data for different levels mosaicism, based on the theoretical framework developed for the thesis.

## Performance and empirical distribution of a new statistic

We developed and described a new statistic to detect aneuploidies in embryos, leveraging on a robust measure of dispersion called median absolute deviation (MAD).

## Performance of testing with overlapping-windows of data

We described a problem with non-overlapping windows of data when searching for genetic disorders that span few megabases in a chromosome (that means, mutations that span a relatively small region).

## Estimator performance evaluation

We evaluated the performance of a new estimator of the level of mosaicism with simulated embryo samples.

## Complex patterns of DNA contamination

We analyzed the BAF statistic under different levels of DNA mixture (or â€ścontaminationâ€ť) between mother and embryo.

## Genomic panel target distribution

We analyzed the distribution of â€śprobesâ€ť or targets in a genomic panel, with emphasis in the distribution of the size of the gaps with no data left by the panel.

## Anticorrelation between Polygenic Risk Scores

We analyzed the correlation of Polygenic Risk Scores (PRS), a kind of genetic score computed from hundreds or thousands of mutations in the genome. We found several pairs of diseases, like rheumatoid arthritis and multiple sclerosis, with negatively correlated scores. An example is given in the figure below.

## Variance trajectories of sub-scores

We analyzed the proportion of variance $\mathcal{V}_{(k)}$ captured by the
genetic scores when keeping sub-selections of the $k$ strongest-effect
mutations. These captured-variance trajectories help quantify the degree of
**polygenicity** of a given disease. Low polygenicity means that
few mutations determine most of the risk.

## Graph of diseases with correlated genetic risk

We found a surprising â€śnaturalâ€ť grouping of autoimmune diseases based on the correlation between their polygenic scores.

## Simulation of embryos and performance of embryo-selection strategies

We simulated genetic data from embryos and analyzed the results of selecting embryos based on their genetic scores. A potential problem was described where embryos with low genetic score of one disease might have increased risk of a correlated disease.

## Relation between increase in risk and number needed to harm

A knwon metric in epidemiology called Number Needed to Harm (NNH) is plotted against the increase in absolute risk (IRA) for several pairs of anticorrelated diseases. This illustrates that between 250 and 1250 couples applying a PRS-based embryo selection might be enough to produce unwanted new cases of some diseases.

## Heatmap of number of datapoints per genomic region

The number of SNPs in each window of 5 Mb (megabases, a kind of genetic distance) are plotted, which can be thought of as the number of datapoints for a given test.