The value of md5sum for the same file would be different after Gziped

I got a latest dataset from the collaborator yesterday, some files were already included in the previous version. However I failed in “md5sum -c md5sum.text”, that really tossed  me greatly. When I did the further check and  I found there was not any differences between the decompressed files, so only the compression step changed the MD5!!!

An email from Ray reminded me. I found the following lines on wiki page of gzip and I could have a sweet dream tonight~

“gzip” is often also used to refer to the gzip file format, which is:

  • a 10-byte header, containing a magic number, a version number and a timestamp
  • optional extra headers, such as the original file name,
  • a body, containing a DEFLATE-compressed payload
  • an 8-byte footer, containing a CRC-32 checksum and the length of the original uncompressed data

关于SAM格式中的Flag [Sequence Alignment/Map]

SAM是Sequence Alignment/Map 的缩写,目的是为了个大家提供一个序列比对的通用格式,方便后续处理。SAM.pdf官方文档

ILLUMINA-57021F:5:1:1361:5913#0 81 chr9 103745559 0 40M chr11 51579596 0 ATTTCCTTCTCCTGCCTGATTGCCCTGGCCAGAACTTCCA bT^bbTT`_ccac`caa^bccccccccdddddc^Yad XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:2172 XM:i:0 XO:i:0 XG:i:0 MD:Z:40

这个是由Fawn同学提供的SAM格式的比对结果的其中一行(这里就不扯每行的含义了),第二列就是SAM的Flag,它是按位来描述序列的比对模式,方向等信息。官方文档是这么说的,“Field <flag> is a bitwise flag. The meaning of predefined bits is shown in the following table:”


[table id=1 /]

1. Flag 0x02, 0x08, 0x20, 0x40 and 0x80 are only meaningful when flag 0x01 is present.
2. If in a read pair the information on which read is the first in the pair is lost in the upstream analysis, flag 0x01 should
be present and 0x40 and 0x80 are both zero.







后续处理关于验证这个flag,分别从位来判断比对的结果信息,要判断哪一位,就用2^(n-1) AND Flag,返回1就是true,0就是flase


Crazy DNA?

从Crazy DNA不难想到,我说白了就是个搞生物滴。只不过我在厌烦了挖地球、刷管子、跑板子之类的生活后,靠穿着数学的马甲、沾着IT的光环、凭借日益白菜价的测序服务火速爆发的bioinformatics维持生计而已。



