Table 2

Detailed inspection of similar file pairs

Dataset

File pair and remarks

File sizes (in MB)

δ


E61/fa

Homo_sapiens.GRCh37.61.dna_rm.chromosome.HSCHR6_MHC_SSTO.fa

166.04

0.015

Homo_sapiens.GRCh37.61.dna_rm.chromosome.HSCHR6_MHC_MANN.fa

166.06

These are two alternative haplotype "patch" files for the same chromosome locus. The dataset contains 11 other examples of similar file pairs with δ < 0.06 (when unpacked). All are related to the alternative haplotypes for the MHC locus. The next most similar pair of files has δ > 0.8.


GPL570/cel

GSM405175.CEL

12.93

8e-6

GSM341406.CEL

12.93

The second file differs from the first by a single Affymetrix probe measurement. According to GEO metadata the two files are simply different packagings of the same experimental data by two researchers. The GEO570 dataset contains 9 other examples of similar file pairs with δ < 0.002. The next most similar pair of files has δ > 0.3.


GPL570/cel.gz

GSM405175.CEL.gz

4.31

6e-4

GSM341406.CEL.gz

4.31

A gzip-compressed version of the pair above. Same remarks apply. The most similar pair of actually different datafiles has δ > 0.9.


BioC2.7/B SGenome/u

BSgenome.Athaliana.TAIR.01222004/extdata/chr1.rda

29.04

2e-4

BSgenome.Athaliana.TAIR.04232008/extdata/chr1.rda

29.04

Consequtive versions of A.thaliana reference genome. The next most similar file pair in this dataset has δ > 0.5. Note that the compressed versions of the same files have δ > 0.9.


The table lists the suspiciously similar pairs of files from the studied datasets.

Tretyakov et al. BMC Genomics 2013 14(Suppl 2):S8   doi:10.1186/1471-2164-14-S2-S8

Open Data