通常情况下RCSB PDB数据库中下载的pdb文件里都含有完整的序列信息
Chimera软件打开后也会发现缺失的信息用红色的框标出,大多数情况下缺失一些loop,这时候可以用Chimera补全(调用的Modeller)。然而我们也有时候仅仅有坐标信息,pdb中不带有全部序列信息,直接在tools sequence中加载外部的fasta序列,并进行序列和结构关联即可。这里介绍一个比较麻烦的事情是将序列信息写入pdb文件。根据规定,序列信息写在SEQRES起始的行中,其格式如下:
#from pdb database
Record Format
COLUMNS DATA TYPE FIELD DEFINITION
------------------------------------------------------------------------------
1 - 6 Record name "SEQRES" 起始行
8 - 10 Integer serNum 每条链从1开始的数字,以下每行加1.
12 Character chainID 属于哪条链
14 - 17 Integer numRes 该条链有多少氨基酸.
20 - 22 Residue name resName 残基名字.
24 - 26 Residue name resName 残基名字.
28 - 30 Residue name resName 残基名字.
32 - 34 Residue name resName 残基名字.
36 - 38 Residue name resName 残基名字.
40 - 42 Residue name resName 残基名字.
44 - 46 Residue name resName 残基名字.
48 - 50 Residue name resName 残基名字.
52 - 54 Residue name resName 残基名字.
56 - 58 Residue name resName 残基名字.
60 - 62 Residue name resName 残基名字.
64 - 66 Residue name resName 残基名字.
68 - 70 Residue name resName 残基名字,每行记录13个残基名字.
例如泛素1ubq.pdb文件中的序列信息:
#from 1ubq.pdb
SEQRES 1 A 76 MET GLN ILE PHE VAL LYS THR LEU THR GLY LYS THR ILE
SEQRES 2 A 76 THR LEU GLU VAL GLU PRO SER ASP THR ILE GLU ASN VAL
SEQRES 3 A 76 LYS ALA LYS ILE GLN ASP LYS GLU GLY ILE PRO PRO ASP
SEQRES 4 A 76 GLN GLN ARG LEU ILE PHE ALA GLY LYS GLN LEU GLU ASP
SEQRES 5 A 76 GLY ARG THR LEU SER ASP TYR ASN ILE GLN LYS GLU SER
SEQRES 6 A 76 THR LEU HIS LEU VAL LEU ARG LEU ARG GLY GLY
泛素1-76的序列信息为:
#from Uniprot
>sp|P0CG48|1-76
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
为了实现以上pdb文件中的信息,首先将氨基酸的单字母缩写改为3字母缩写,可以自己写个字典对应,也可以利用网站来做:https://www.bioinformatics.org/sms2/one_to_three.html 保存为一个1to3.seq的文件.
#3 letter sequence
MetGlnIlePheValLysThrLeuThrGlyLysThrIleThrLeuGluValGluProSer
AspThrIleGluAsnValLysAlaLysIleGlnAspLysGluGlyIleProProAspGln
GlnArgLeuIlePheAlaGlyLysGlnLeuGluAspGlyArgThrLeuSerAspTyrAsn
IleGlnLysGluSerThrLeuHisLeuValLeuArgLeuArgGlyGly
可以先去掉换行符,然而再每39个字符后面加换行符,每再每3个字符加一个空格,最后将小写字母转化为大写,注意末尾是需要一个换行符的,并保存到3.fas文件中:
#bash
cat 1to3.seq|tr -d "\n"|sed 's/.\{39\}/&\n/g'|sed 's/.\{3\}/& /g'|sed -e '$a\'|tr a-z A-Z >3.fas3
然后生成SEQRES为首的信息:
#bash
seqlen=76
n=1;while read line;do printf "SEQRES %3d A %4d $line\n" $n $seqlen;let n=$n+1;done<3.fas
最后得到跟以上pdb中一样的信息
#result
SEQRES 1 A 76 MET GLN ILE PHE VAL LYS THR LEU THR GLY LYS THR ILE
SEQRES 2 A 76 THR LEU GLU VAL GLU PRO SER ASP THR ILE GLU ASN VAL
SEQRES 3 A 76 LYS ALA LYS ILE GLN ASP LYS GLU GLY ILE PRO PRO ASP
SEQRES 4 A 76 GLN GLN ARG LEU ILE PHE ALA GLY LYS GLN LEU GLU ASP
SEQRES 5 A 76 GLY ARG THR LEU SER ASP TYR ASN ILE GLN LYS GLU SER
SEQRES 6 A 76 THR LEU HIS LEU VAL LEU ARG LEU ARG GLY GLY