## Compare true and imputed genotypes by calculating R-squared

When performing imputation analyses, it is often useful to obtain measures for the imputation accuracy. Imputation typically estimates genotypes which were not observed in a genome-wide association study (GWAS) using whole-genome sequencing data as a reference, where genotypes that are missing in the sample are imputed based on the similarity of the observed genotypes to the haplotypes in the reference. To evaluate the accuracy of imputation, it is necessary to know the true genotypes which are imputed.

In my current work, I use simulations to produce a complete set of ‘truth’ genotypes, which I thin down to a GWAS sample of typical size; that is with genetic variants that are typed in a typical GWAS (in my case I use a sample obtained with an Illumina Omni microarray). The removed genotypes are then re-imputed using an external reference panel, using IMPUTE2 as imputation software. As I want to evaluate the imputation accuracy, I can use the ‘truth’ (i.e. the genotypes that were previously removed) to determine how well the removed genotypes were estimated/imputed. For this, I am calculating R-squared in simple linear regression. R-squared, which is also called the coefficient of determination, is a measure for how well the data fits a statistical model (i.e. the linear regression) and is calculated as the square of the correlation coefficient.

As there are thousands/millions of imputed genotypes, I aggregate them by their corresponding minor allele frequency (MAF), where MAF is obtained from the (simulated) population sample. I define a frequency interval (or “MAF bin”) in which genotypes are pooled to calculate R-squared.

Such calculations can be easy in R, but require some more difficult data handling and, of course, some long computation time and much memory. Hence, I coded a small programme written in C++ (see code below) to speed up computation while being memory efficient.

This simple algorithm proceeds as follows:

1. Index ‘truth’ genotypes, calculate their MAF, and retain variants whose MAF fall within the defined interval.
2. Read estimated genotypes which match to those retained in (1) and calculate allele dosages.
3. Read ‘truth’ genotypes which match to those retained in (2) and calculate allele dosages.
4. Aggregate both ‘truth’ and estimated allele dosages and calculate R-squared.

The reason why ‘truth’ genotypes are read twice by the programme is memory efficiency. If all ‘truth’ allele dosages would be stored in the first round, computer memory would quickly be overburdened while giving only a little increase in speed. It is therefore more efficient to first index variants which fall within the defined interval.

The programme will match genetic variants based on their physical position on a chromosome and their alleles (reference and alternative), which I found to be more useful and less prone to error than simply matching variant IDs.

As it can be quite work-intense to calculate aggregated R-squared values for separate MAF intervals, the programme will compute R-squared for multiple intervals simultaneously, which requires to specify all MAF intervals in an external file.

The programme computes R-squared from allele dosages, and not from the actual genotype triplet. Genotypes are usually presented as three values; $\{g_{AA}, g_{AB}, g_{BB}\}$. Allele dosage is calculated as $d=g_{AB} + 2g_{BB}$. If the genotype is $[0, 0, 1]$, then its dosage is $d=2$; or if the genotype probabilities (e.g. for imputed genotypes) are $\{0.1, 0.3, 0.6\}$ then its dosage is $d=1.5$. Note that the genotype triplet always sums to one, and dosages can range between 0 and 2.

Note: The genotype file format required for this programme is the one used by IMPUTE2 and other software; see here.

The input files required by the programme, together with their respective command line options, are:

• -i: a file specifying the MAF intervals/bins
• -t: the ‘truth’ genotypes in .gen format
• -e: the estimated genotypes, also in .gen format
• -o: name of the output file to be generated
• [optional] -l: a file specifying the variants to be included, in .legend format (see here; option “-l”)

The optional argument “-l” limits the output to those variants contained in the input file. The input file is in .legend format, which contains four columns, and a header, for (1) rsID, (2) position, (3) reference allele, and (4) alternative allele.

After the programme has been compiled, to run it, the following command can be executed on the command line:

```./rsquared \
-i intervals.txt \
-t simulated.gen \
-e imputed.gen \
-l variants-to-include.legend \
-o rsquared.txt```

Prior to this, the file intervals.txt has to be generated. This can be done by executing the following code on the command line:

```INT_FILE="intervals.txt"
echo "0.000 0.005" >  \$INT_FILE
echo "0.005 0.010" >> \$INT_FILE
echo "0.01 0.02" >> \$INT_FILE
echo "0.02 0.03" >> \$INT_FILE
echo "0.03 0.04" >> \$INT_FILE
echo "0.04 0.05" >> \$INT_FILE
echo "0.05 0.10" >> \$INT_FILE
echo "0.1 0.2" >> \$INT_FILE
echo "0.2 0.3" >> \$INT_FILE
echo "0.3 0.4" >> \$INT_FILE
echo "0.4 0.5" >> \$INT_FILE```

The output file (specified by “-o”) contains the interval (“mac_from” and “maf_to”), the calculated R-squared value (“Rsquared”), and the number of variants (“n_variants”) found for each MAF bin on each row. For example, it can look as follows:

```maf_from maf_to Rsquared n_variants
0 0.005 0.903927 15669
0.005 0.01 0.922476 14604
0.01 0.02 0.927068 18255
0.02 0.03 0.954275 11195
0.03 0.04 0.964097 7883
0.04 0.05 0.972888 6174
0.05 0.1 0.979746 20360
0.1 0.2 0.985799 23347
0.2 0.3 0.985891 18866
0.3 0.4 0.982349 15534
0.4 0.5 0.980174 15147```

The programme took about 10min on a relatively new machine (2.7 Ghz Intel Core i7) for ‘truth’ and estimated genotype files of about 12 Gb size each.