The value of md5sum for the same file would be different after Gziped

I got a latest dataset from the collaborator yesterday, some files were already included in the previous version. However I failed in “md5sum -c md5sum.text”, that really tossed  me greatly. When I did the further check and  I found there was not any differences between the decompressed files, so only the compression step changed the MD5!!!

An email from Ray reminded me. I found the following lines on wiki page of gzip and I could have a sweet dream tonight~

“gzip” is often also used to refer to the gzip file format, which is:

  • a 10-byte header, containing a magic number, a version number and a timestamp
  • optional extra headers, such as the original file name,
  • a body, containing a DEFLATE-compressed payload
  • an 8-byte footer, containing a CRC-32 checksum and the length of the original uncompressed data

关于SAM格式中的Flag [Sequence Alignment/Map]

SAM是Sequence Alignment/Map 的缩写,目的是为了个大家提供一个序列比对的通用格式,方便后续处理。SAM.pdf官方文档

ILLUMINA-57021F:5:1:1361:5913#0 81 chr9 103745559 0 40M chr11 51579596 0 ATTTCCTTCTCCTGCCTGATTGCCCTGGCCAGAACTTCCA bT^bbTT`_ccac`caa^bccccccccdddddc^Yad XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:2172 XM:i:0 XO:i:0 XG:i:0 MD:Z:40

这个是由Fawn同学提供的SAM格式的比对结果的其中一行(这里就不扯每行的含义了),第二列就是SAM的Flag,它是按位来描述序列的比对模式,方向等信息。官方文档是这么说的,“Field <flag> is a bitwise flag. The meaning of predefined bits is shown in the following table:”

牢骚,忽略之...【就为了wp的table展示阿,我到现在还没睡...愤恨阿,这点比Joomla差远了...】

[table id=1 /]

1. Flag 0x02, 0x08, 0x20, 0x40 and 0x80 are only meaningful when flag 0x01 is present.
2. If in a read pair the information on which read is the first in the pair is lost in the upstream analysis, flag 0x01 should
be present and 0x40 and 0x80 are both zero.

这个注解是说,第2,4,6,7,8,10位有效的前提是,第一位必须是1

例子中的81换算成二进制就是00001010001(不足11位,首位位补0【转换方法这里

从表格中可以看出:

这条比对信息的就是说它是个PE测序的【第一位是1】

并且比对到reference上是反向的【第五位是1】

它本身是这对序列的1#【第七位是1】

后续处理关于验证这个flag,分别从位来判断比对的结果信息,要判断哪一位,就用2^(n-1) AND Flag,返回1就是true,0就是flase

扛不住了,睡觉...

Crazy DNA?

从Crazy DNA不难想到,我说白了就是个搞生物滴。只不过我在厌烦了挖地球、刷管子、跑板子之类的生活后,靠穿着数学的马甲、沾着IT的光环、凭借日益白菜价的测序服务火速爆发的bioinformatics维持生计而已。

本科读生物的我对计算机很感兴趣又不辞幸苦花了两年跑去读了计算机,为现在的生物信息学做了充分准备(其实当初是想放弃生物转IT…现在依旧后悔…没读bioinformatics的现在转IT还来得及哈!)。

做bioinformatics是干什么的呢?我浅薄的理解就是先把生物问题归结到数学的模型,再利用计算机程序实现模型加上生物数据进行分析,最终得出有意义的结论。刚才估计我说的太学术了,咱们简单点,做生物信息学分析就是面对一切皆有可能的生物现象,像数学家那样绞尽脑汁,像程序员那样埋头苦干,最后提炼出一篇PDF。用付出和报酬来总结,就是干着比程序员还累的活,拿着比农民工还少!

在这里,我愿意用blog记录我在数据海洋里的挣扎~

稍稍解释下,nomel=reverse(‘lemon’),我域名的由来!