Compare true and imputed genotypes by calculating R-squared

When performing imputation analyses, it is often useful to obtain measures for the imputation accuracy. Imputation typically estimates genotypes which were not observed in a genome-wide association study (GWAS) using whole-genome sequencing data as a reference, where genotypes that are missing in the sample are imputed based on the similarity of the observed genotypes to the haplotypes in the reference. To evaluate the accuracy of imputation, it is necessary to know the true genotypes which are imputed.

In my current work, I use simulations to produce a complete set of ‘truth’ genotypes, which I thin down to a GWAS sample of typical size; that is with genetic variants that are typed in a typical GWAS (in my case I use a sample obtained with an Illumina Omni microarray). The removed genotypes are then re-imputed using an external reference panel, using IMPUTE2 as imputation software. As I want to evaluate the imputation accuracy, I can use the ‘truth’ (i.e. the genotypes that were previously removed) to determine how well the removed genotypes were estimated/imputed. For this, I am calculating R-squared in simple linear regression. R-squared, which is also called the coefficient of determination, is a measure for how well the data fits a statistical model (i.e. the linear regression) and is calculated as the square of the correlation coefficient.

As there are thousands/millions of imputed genotypes, I aggregate them by their corresponding minor allele frequency (MAF), where MAF is obtained from the (simulated) population sample. I define a frequency interval (or “MAF bin”) in which genotypes are pooled to calculate R-squared.

Such calculations can be easy in R, but require some more difficult data handling and, of course, some long computation time and much memory. Hence, I coded a small programme written in C++ (see code below) to speed up computation while being memory efficient.

The programme rsquared is written in C++ and has to be compiled on the command line by typing

g++ rsquared.cpp -O3 -o rsquared

This simple algorithm proceeds as follows:

  1. Index ‘truth’ genotypes, calculate their MAF, and retain variants whose MAF fall within the defined interval.
  2. Read estimated genotypes which match to those retained in (1) and calculate allele dosages.
  3. Read ‘truth’ genotypes which match to those retained in (2) and calculate allele dosages.
  4. Aggregate both ‘truth’ and estimated allele dosages and calculate R-squared.

The reason why ‘truth’ genotypes are read twice by the programme is memory efficiency. If all ‘truth’ allele dosages would be stored in the first round, computer memory would quickly be overburdened while giving only a little increase in speed. It is therefore more efficient to first index variants which fall within the defined interval.

The programme will match genetic variants based on their physical position on a chromosome and their alleles (reference and alternative), which I found to be more useful and less prone to error than simply matching variant IDs.

As it can be quite work-intense to calculate aggregated R-squared values for separate MAF intervals, the programme will compute R-squared for multiple intervals simultaneously, which requires to specify all MAF intervals in an external file.

The programme computes R-squared from allele dosages, and not from the actual genotype triplet. Genotypes are usually presented as three values; \{g_{AA}, g_{AB}, g_{BB}\}. Allele dosage is calculated as d=g_{AB} + 2g_{BB}. If the genotype is [0, 0, 1], then its dosage is d=2; or if the genotype probabilities (e.g. for imputed genotypes) are \{0.1, 0.3, 0.6\} then its dosage is d=1.5. Note that the genotype triplet always sums to one, and dosages can range between 0 and 2.

Note: The genotype file format required for this programme is the one used by IMPUTE2 and other software; see here.

The input files required by the programme, together with their respective command line options, are:

  • -i: a file specifying the MAF intervals/bins
  • -t: the ‘truth’ genotypes in .gen format
  • -e: the estimated genotypes, also in .gen format
  • -o: name of the output file to be generated
  • [optional] -l: a file specifying the variants to be included, in .legend format (see here; option “-l”)

The optional argument “-l” limits the output to those variants contained in the input file. The input file is in .legend format, which contains four columns, and a header, for (1) rsID, (2) position, (3) reference allele, and (4) alternative allele.

After the programme has been compiled, to run it, the following command can be executed on the command line:

./rsquared \
-i intervals.txt \
-t simulated.gen \
-e imputed.gen \
-l variants-to-include.legend \
-o rsquared.txt

Prior to this, the file intervals.txt has to be generated. This can be done by executing the following code on the command line:

echo "0.000 0.005" >  $INT_FILE
echo "0.005 0.010" >> $INT_FILE
echo "0.01 0.02" >> $INT_FILE
echo "0.02 0.03" >> $INT_FILE
echo "0.03 0.04" >> $INT_FILE
echo "0.04 0.05" >> $INT_FILE
echo "0.05 0.10" >> $INT_FILE
echo "0.1 0.2" >> $INT_FILE
echo "0.2 0.3" >> $INT_FILE
echo "0.3 0.4" >> $INT_FILE
echo "0.4 0.5" >> $INT_FILE

The output file (specified by “-o”) contains the interval (“mac_from” and “maf_to”), the calculated R-squared value (“Rsquared”), and the number of variants (“n_variants”) found for each MAF bin on each row. For example, it can look as follows:

maf_from maf_to Rsquared n_variants
0 0.005 0.903927 15669
0.005 0.01 0.922476 14604
0.01 0.02 0.927068 18255
0.02 0.03 0.954275 11195
0.03 0.04 0.964097 7883
0.04 0.05 0.972888 6174
0.05 0.1 0.979746 20360
0.1 0.2 0.985799 23347
0.2 0.3 0.985891 18866
0.3 0.4 0.982349 15534
0.4 0.5 0.980174 15147

The programme took about 10min on a relatively new machine (2.7 Ghz Intel Core i7) for ‘truth’ and estimated genotype files of about 12 Gb size each.

Reduce size of IMPUTE2 genotype files

I am working a lot with IMPUTE2 for genotype imputation. Imputation generally predicts not-typed genotypes in a study sample by filling up the missing sites using a reference panel of sequence data (phased haplotypes). The file size of the data output, however, can become quite large. As I am working with simulated samples, of which I have thousands, storage is getting complicated, even when gzip compressed. To mitigate this, I wrote a short C++ code which reduces the overall size of a GEN file, and which is able to un-reduce the file back to its original.

The genotype format, which is used in IMPUTE2 and some other programmes for conducting genome-wide association studies (e.g. SNPTEST, SHAPEIT, and HAPGEN), is described here.

The file size is reduced simply by removing one of the three genotypes that are given for each sample. Because the three genotypes always sum to 1, removing one value implicates that it can be restored by subtracting the two other from 1. The programme (see below) removes the second genotype (heterozygote), which is restored when un-reducing/expanding the previously reduced GEN file.

The programme ReduceGen.cpp is written in C++ and compiled using

g++ ReduceGen.cpp -O3 -o reducegen

The programme is executed on the command line, and three options (-g, -r, and -o) must be specified:

# options:
# -g : GEN file --> reduce
# -r : previously reduced GEN file --> un-reduce/expand
# -o : name of output file 

# reduce GEN file
./reducegen -g imputed.gen -o imputed.rgen

# expand previously reduced GEN file (.rgen)
./reducegen -r imputed.rgen -o imputed.gen

I tested the programme with a 6.6G large test file (output of IMPUTE2). It was reduced to 4.3G, which took 2min and 8sec. Then, to expand the reduced file, it took 16min and 22sec. The latter is unfortunately much longer, but non-avoidable, because the genotype values or the imputed genotype probabilities need to be converted from text strings to numerical values, which takes some time in a file with thousands of samples and thousands/millions of variants.

The use the programme, simply copy&paste the following code in any text editor and compile the file on the command line:

//  reducegen_cpp
//  Created by Patrick Albers on 08/01/2014.
//  Copyright (c) 2014 Patrick Albers. All rights reserved.

#include <iostream>
#include <fstream>
#include <iomanip>
#include <vector>
#include <ctime>
#include <cmath>

using namespace std;

#define N_GENOTYPE  3
#define N_REDUCED   2
#define N_LEGEND    5

typedef struct {
	char * snp;
	char * id;
	unsigned long int pos;
	char * ref;
	char * alt;
} Variant;

typedef struct {
	double AA;
	double AB;
	double BB;
} Genotype;

class ReadFile {
	FILE * fp;

	char * line;
	ssize_t size;
	unsigned long n;
	ReadFile(string &filename) {
		this->fp = fopen(filename.c_str(), "r");
		if (this->fp == NULL) {
			cerr << "Cannot open file: " << filename << endl;
		this->line = NULL;
		this->n = 0;
	~ReadFile() {
	bool fetch() {
		size_t len = 0;
		this->size = getline(&this->line, &len, this->fp);
		if (this->size != -1) {
			return true;
		return false;

// fast convert string to double
// from: naive()
double str2dbl(const char *p) {
    double r = 0.0;
    bool neg = false;
    if (*p == '-') {
        neg = true;
    while (*p >= '0' && *p <= '9') {
        r = (r*10.0) + (*p - '0');
    if (*p == '.') {
        double f = 0.0;
        int n = 0;
        while (*p >= '0' && *p <= '9') {
            f = (f*10.0) + (*p - '0');
        r += f / pow(10.0, n);
    if (neg) {
        r = -r;
    return r;

int main(int argc, const char * argv[]) {
	const char *del = " \t\n";
	string filename_in;
	string filename_out;
	bool reduce;
	ofstream write;
	time_t t;
	// parse arguments
	while (argc > 1 && argv[1][0] == '-') {
		switch (argv[1][1]) {
			case 'g':
				while (argc > 2 && argv[2][0] != '-') {
					filename_in = argv[2];
					reduce = true;
			case 'r':
				while (argc > 2 && argv[2][0] != '-') {
					filename_in = argv[2];
					reduce = false;
			case 'o':
				while (argc > 2 && argv[2][0] != '-') {
					filename_out = argv[2];
	cout << ((reduce) ? "Reducing '" : "Expanding '") << filename_in << "' ..." << endl;
	ReadFile read (filename_in);;
	if (write.bad()) {
		cerr << "Cannot write to file: " << filename_out << endl;
	t = time(0);
	// reduce/expand each line
	while (read.fetch()) {
		unsigned long i = 0;
		unsigned long j;
		char * ptr;
		char * tok;
		double AA;
		double AB;
		double BB;
		for(tok = strtok_r(read.line, del, &ptr); tok != NULL; tok = strtok_r(NULL, del, &ptr)) {
			if (i < N_LEGEND) {
				write << tok << ((i < N_LEGEND - 1) ? " ": "");
			} else {
				if (reduce) {
					j = (i - N_LEGEND) % N_GENOTYPE;
					if (j != 1) {
						write << " " << tok;
				} else {
					j = (i - N_LEGEND) % N_REDUCED;
					if (j == 0) {
						AA = str2dbl(tok);
					} else {
						BB = str2dbl(tok);
						AB = 1.0 - AA - BB;
						write << " " << AA << " " << setprecision(4) << AB << " " << BB;
		write << endl;
		if ((reduce && j != N_GENOTYPE - 1) || (!reduce && j != N_REDUCED - 1)) {
			cerr << "Sample genotypes are not comlpete in line " << read.n << endl;
	t = difftime(time(0), t);
	cout << "# lines: " << read.n << endl;
	cout << "Done in " << floor(t/60) << "min and " << floor(t%60) << "sec" << endl;
	return 0;