RDkit数据库cartridge

本文档主要翻译于 The RDKit 数据库 cartridge 并进行了适当调整。

cartridge简介

cartridge 数据库是基于PostgreSQL 的数据库,增加了对分子结构查询的支持。 可以把cartridge看成是PostgreSQL的插件。

注解

建议先学习PostgreSQL基础知识。

安装cartridge数据库服务

为了数据库的安全,建议为cartridge创建一个独立的用户账号。

conda install -c rdkit rdkit-postgresql

创建数据库和配置

指定数据库目录

[conda folder]/envs/my-rdkit-env/bin/initdb -D /folder/where/data/should/be/stored

启动postgres服务

[conda folder]/envs/my-rdkit-env/bin/postgres -D /folder/where/data/should/be/stored

创建数据库表

createdb my_rdkit_db
psql my_rdkit_db

配置postgresql.conf

synchronous_commit = off      # immediate fsync at commit
full_page_writes = off            # recover from partial page writes
shared_buffers = 2048MB           # min 128kB                   # (change requires restart)
work_mem = 128MB              # min 64kB

cartridge 快速教程

基于rsmi文件创建数据库表

创建数据库和表

~/RDKit_trunk/Data/emolecules > createdb emolecules
~/RDKit_trunk/Data/emolecules > psql -c 'create extension rdkit' emolecules

设置表的字段并填充数据

~/RDKit_trunk/Data/emolecules > psql -c 'create table raw_data (id SERIAL, smiles text, emol_id integer, parent_id integer)' emolecules
NOTICE:  CREATE TABLE will create implicit sequence "raw_data_id_seq" for serial column "raw_data.id"
CREATE TABLE
~/RDKit_trunk/Data/emolecules > zcat emolecules-2013-02-01.smi.gz | sed '1d; s/\\/\\\\/g' | psql -c "copy raw_data (smiles,emol_id,parent_id) from stdin with delimiter ' '" emolecules

加载ChEMBL化合物库到cartridge

创建数据库

连接数据库

启用rdkit插件

chembl_25=# create extension if not exists rdkit; chembl_25=# create schema rdk;

创建分子并建立子结构搜索的索引

chembl_25=# select * into rdk.mols from (select molregno,mol_from_ctab(molfile::cstring) m  from compound_structures) tmp where m is not null;
SELECT 1870451
chembl_25=# create index molidx on rdk.mols using gist(m);
CREATE INDEX
chembl_25=# alter table rdk.mols add primary key (molregno);
ALTER TABLE

创建指纹并建立相似性搜索的索引

chembl_25=# select molregno,torsionbv_fp(m) as torsionbv,morganbv_fp(m) as mfp2,featmorganbv_fp(m) as ffp2 into rdk.fps from rdk.mols;
SELECT 1870451
chembl_25=# create index fps_ttbv_idx on rdk.fps using gist(torsionbv);
CREATE INDEX
chembl_25=# create index fps_mfp2_idx on rdk.fps using gist(mfp2);
CREATE INDEX
chembl_25=# create index fps_ffp2_idx on rdk.fps using gist(ffp2);
CREATE INDEX
chembl_25=# alter table rdk.fps add primary key (molregno);
ALTER TABLE

psql命令汇总

create extension if not exists rdkit;
create schema rdk;
select * into rdk.mols from (select molregno,mol_from_ctab(molfile::cstring) m  from compound_structures) tmp where m is not null;
create index molidx on rdk.mols using gist(m);
alter table rdk.mols add primary key (molregno);
select molregno,torsionbv_fp(m) as torsionbv,morganbv_fp(m) as mfp2,featmorganbv_fp(m) as ffp2 into rdk.fps from rdk.mols;
create index fps_ttbv_idx on rdk.fps using gist(torsionbv);
create index fps_mfp2_idx on rdk.fps using gist(mfp2);
create index fps_ffp2_idx on rdk.fps using gist(ffp2);
alter table rdk.fps add primary key (molregno);
create or replace function get_mfp2_neighbors(smiles text)
returns table(molregno bigint, m mol, similarity double precision) as
$$
select molregno,m,tanimoto_sml(morganbv_fp(mol_from_smiles($1::cstring)),mfp2) as similarity
from rdk.fps join rdk.mols using (molregno)
where morganbv_fp(mol_from_smiles($1::cstring))%mfp2
order by morganbv_fp(mol_from_smiles($1::cstring))<%>mfp2;
$$ language sql stable ;

子结构搜索

分子查询示例

chembl_25=# select count(*) from rdk.mols where m@>'c1cccc2c1nncc2' ;
 count
-------
   461
(1 row)

Time: 107.602 ms
chembl_25=# select count(*) from rdk.mols where m@>'c1ccnc2c1nccn2' ;
 count
-------
  1124
(1 row)

Time: 216.222 ms
chembl_25=# select count(*) from rdk.mols where m@>'c1cncc2n1ccn2' ;
 count
-------
  2233
(1 row)

Time: 88.266 ms
chembl_25=# select count(*) from rdk.mols where m@>'Nc1ncnc(N)n1' ;
 count
-------
  7095
(1 row)

Time: 327.855 ms
chembl_25=# select count(*) from rdk.mols where m@>'c1scnn1' ;
 count
-------
 16526
(1 row)

Time: 568.675 ms
chembl_25=# select count(*) from rdk.mols where m@>'c1cccc2c1ncs2' ;
 count
-------
 20745
(1 row)

Time: 998.104 ms
chembl_25=# select count(*) from rdk.mols where m@>'c1cccc2c1CNCCN2' ;
 count
-------
  1788
(1 row)

Time: 1922.273 ms

Notice that the last two queries are starting to take a while to execute and count all the results. 最后两个查询耗时要一段时间才能得到结果。

这个库中大约有170万种化合物,因此搜索的速度是可以接受的。

为了加速获得计算结果,可以仅检索有限数量的化合物,设置返回化合物的最大数目:

chembl_25=# select * from rdk.mols where m@>'c1cccc2c1CNCCN2' limit 100;
 molregno |                                                      m
----------+--------------------------------------------------------------------------------------------------------------
  1671940 | Cc1cccc(C)c1N1C(=O)c2ccccc2NC(=O)C1C(=O)NCc1ccco1
  1318078 | COCN1C(=O)[C@@H]2C[C@@H](O)CN2C(=O)c2ccccc21
  1318783 | O/N=C1/Nc2ccccc2C(=S)N2CSCC12
  1318127 | CC(=O)O[C@H]1C[C@H]2C(=S)Nc3ccccc3C(=S)N2C1
  1308578 | O=C1Nc2cc([N+](=O)[O-])ccc2C(=O)N2CCC[C@@H]12
  1417168 | O=C(NCC(F)(F)F)C1C(=O)Nc2ccccc2C(=O)N1Cc1ccccc1
  ...
   793329 | Cc1ccc2c(c1)C(c1ccccc1)N(C(=O)c1ccc(OC(C)C)cc1)CC(=O)N2
   921215 | O=C1CN(C(=O)c2cc([N+](=O)[O-])ccc2Cl)C(c2ccc(F)cc2)c2cc(F)ccc2N1
   790949 | CCOC(=O)[C@H]1[C@H]2COc3ccc(Cl)cc3[C@@H]2N2C(=O)c3cc(C)ccc3NC(=O)[C@@]12C
   760998 | CC(=O)N1CC(=O)Nc2ccc(Cl)cc2C1c1ccc(F)cc1
(100 rows)

Time: 97.357 ms

基于SMARTS查询

检索 恶二唑或噻二唑:

chembl_25=# select * from rdk.mols where m@>'c1[o,s]ncn1'::qmol limit 500;
 molregno |                                                 m
----------+---------------------------------------------------------------------------------------------------
  1882516 | COc1cccc(CN(C)Cc2nc(C(C)C)no2)c1
  2194441 | Cc1nc([C@](C)(O)C#Cc2ccc3c(c2)-c2nc(C(N)=O)sc2[C@@H](F)CO3)no1
  1881742 | CCOc1ccc(C(F)(F)F)cc1NC(=O)NCc1noc(C)n1
  1949861 | FC(F)(F)c1ccc(-c2nc(-c3ccc4nc[nH]c4c3)no2)cc1
  1949860 | FC(F)(F)c1cccc(-c2nc(-c3ccc4nc[nH]c4c3)no2)c1
  2172627 | O=c1[nH]cc(-c2cc(Cl)ccc2Oc2cc(F)c(S(=O)(=O)Nc3ncns3)cc2F)n2cncc12
  ...
  1848026 | O=C1CCCN1c1cccc(-c2noc([C@H]3CCCCN3C(=O)COc3ccccc3)n2)c1
  1848027 | O=C1CN(c2cccc(-c3noc([C@H]4CCCCN4C(=O)COc4ccccc4)n3)c2)C(=O)N1
  1848036 | CN(C)C(=O)CCC(=O)Nc1cc(F)cc(-c2noc([C@H]3CCCCN3C(=O)COc3ccccc3)n2)c1
  1852688 | CC(Sc1nc(N)cc(N)n1)c1nc(C(C)(C)C)no1
(500 rows)

Time: 761.847 ms

基于SMARTS的查询比smiles的查询通常要慢一些。

查询中考虑立体化学

注意,进行子结构查询时,默认情况下不考虑立体化学:

chembl_25=# select * from rdk.mols where m@>'NC(=O)[C@@H]1CCCN1C=O' limit 10;
 molregno |                                                 m
----------+---------------------------------------------------------------------------------------------------
  2213985 | CC[C@H](C)[C@@H]1NC(=O)[C@@H]2CCCN2C(=O)[C@@H]2CCCN2C(=O)[C@H]([C@@H](C)CC)NC(=O)[C@H](CO)NC(=O)[C@H](C)NC(=O)[C@H]([C@H](C)O)NC(=O)[C@@H]2CSSC[C@H](NC1=O)C(=O)N[C@@H](Cc1cnc[nH]1)C(=O)N[C@H](Cc1ccccc1)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1c[nH]c3ccccc13)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N2
  1956682 | NC(=O)[C@@H]1CCCN1C(=O)[C@H](Cc1nc(I)[nH]c1I)NC(=O)c1cnccn1
  2212188 | CN1C(=O)[C@H](CCCNC(=N)N)NC(=O)[C@@H](Cc2ccc(O)cc2)NC(=O)[C@@H]2CCCN2C(=O)[C@H](Cc2ccc3ccccc3c2)NC(=O)[C@@H]1CC(=O)O
  2053463 | NCCCC[C@H](NC(=O)[C@H](Cc1ccc(OP(=O)(O)O)cc1)NC(=O)Cc1ccccc1)C(=O)N1CCC[C@H]1C(=O)N[C@@H](Cc1ccccc1)C(N)=O
  2060743 | CCCCCCCCCCCCCCCCNC(=O)CN(CC(=O)NC(C)(C)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](CC(N)=O)C(N)=O)C(=O)c1cccnc1
  2060744 | CCCCCCCCCCCCCCCCN(CCCCCCCCCCCCCCCC)CCCCCC(=O)NC(C)(C)C(=O)NC(Cc1ccccc1)C(=O)NC(CC(C)C)C(=O)NC(Cc1ccccc1)C(=O)NC(CCCNC(=N)N)C(=O)N1CCCC1C(=O)NC(CCCNC(=N)N)C(=O)NC(CC(N)=O)C(N)=O
  2077784 | CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CC(C)C)NC(=O)[C@@H]2CCCN2C(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](C(C)C)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCSC)NC1=O
  2077779 | CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CC(C)C)NC(=O)[C@@H]2CCCN2C(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](C(C)C)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC[S+](C)[O-])NC1=O
  2077782 | CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2c[nH]c3ccccc23)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@@H]2CCCN2C(=O)[C@H](CCSC)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC[S+](C)[O-])NC1=O
  2077780 | CC(C)C[C@@H]1NC(=O)[C@H](CC[S+](C)[O-])NC(=O)[C@H](C(C)C)NC(=O)[C@H](Cc2c[nH]c3ccccc23)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@@H]2CCCN2C(=O)[C@H](CC[S+](C)[O-])NC1=O
(10 rows)

为了考虑立体化学,可以设置 rdkit.do_chiral_sss 为true.

chembl_25=# set rdkit.do_chiral_sss=true;
SET
Time: 0.241 ms
chembl_25=# select * from rdk.mols where m@>'NC(=O)[C@@H]1CCCN1C=O' limit 10;
 molregno |                                                 m
----------+---------------------------------------------------------------------------------------------------
  2213985 | CC[C@H](C)[C@@H]1NC(=O)[C@@H]2CCCN2C(=O)[C@@H]2CCCN2C(=O)[C@H]([C@@H](C)CC)NC(=O)[C@H](CO)NC(=O)[C@H](C)NC(=O)[C@H]([C@H](C)O)NC(=O)[C@@H]2CSSC[C@H](NC1=O)C(=O)N[C@@H](Cc1cnc[nH]1)C(=O)N[C@H](Cc1ccccc1)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1c[nH]c3ccccc13)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N2
  1956682 | NC(=O)[C@@H]1CCCN1C(=O)[C@H](Cc1nc(I)[nH]c1I)NC(=O)c1cnccn1
  2212188 | CN1C(=O)[C@H](CCCNC(=N)N)NC(=O)[C@@H](Cc2ccc(O)cc2)NC(=O)[C@@H]2CCCN2C(=O)[C@H](Cc2ccc3ccccc3c2)NC(=O)[C@@H]1CC(=O)O
  2053463 | NCCCC[C@H](NC(=O)[C@H](Cc1ccc(OP(=O)(O)O)cc1)NC(=O)Cc1ccccc1)C(=O)N1CCC[C@H]1C(=O)N[C@@H](Cc1ccccc1)C(N)=O
  2060743 | CCCCCCCCCCCCCCCCNC(=O)CN(CC(=O)NC(C)(C)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](CC(N)=O)C(N)=O)C(=O)c1cccnc1
  2077784 | CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CC(C)C)NC(=O)[C@@H]2CCCN2C(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](C(C)C)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCSC)NC1=O
  2077779 | CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CC(C)C)NC(=O)[C@@H]2CCCN2C(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](C(C)C)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC[S+](C)[O-])NC1=O
  2077782 | CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2c[nH]c3ccccc23)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@@H]2CCCN2C(=O)[C@H](CCSC)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC[S+](C)[O-])NC1=O
  2077780 | CC(C)C[C@@H]1NC(=O)[C@H](CC[S+](C)[O-])NC(=O)[C@H](C(C)C)NC(=O)[C@H](Cc2c[nH]c3ccccc23)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@@H]2CCCN2C(=O)[C@H](CC[S+](C)[O-])NC1=O
  2211488 | CC[C@H](C)[C@H](N)C(=O)N[C@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N1CCC[C@H]1C(=O)N1CCC[C@H]1C(=O)N[C@H](CCC(=O)N[C@@H](CCC(=O)N[C@@H](CC(C)C)C(=O)O)Cc1ccccc1)Cc1ccccc1)C(C)C)[C@@H](C)CC
(10 rows)

Time: 6.181 ms

自定义查询

通过 mol_adjust_query_properties() 可以对子结构搜索进行各种控制,而不必构造复杂的SMARTS.

下面是搜索 2,6-取代的吡啶 的例子:

chembl_25=# select molregno,m from rdk.mols where m@>mol_adjust_query_properties('*c1cccc(NC(=O)*)n1') limit 10;
 molregno |                                                 m
----------+---------------------------------------------------------------------------------------------------
  1609520 | Cc1cccc(NC(=O)c2cc(Br)ccc2C(=O)O)n1
  1141456 | CCN(CC)CCCn1cc(NC(=O)Nc2cccc(-c3ccccc3)n2)c2ccccc21
  1431198 | Cc1cccc(NC(=O)c2nc(C)sc2Nc2cccnc2)n1
   734975 | Cc1cccc(NC(=O)CN(C)S(=O)(=O)c2ccc(Cl)cc2)n1
   760426 | Cc1cccc(NC(=O)CCCn2cc([N+](=O)[O-])cn2)n1
   782786 | Cc1cccc(NC(=O)CN2C(=O)NC(C)(c3ccc4ccccc4c3)C2=O)n1
  1478990 | Cc1cccc(NC(=O)Cn2c(=O)sc3cc(C(=O)c4ccccc4)ccc32)n1
  1478787 | Cc1cccc(NC(=O)Cn2c(=O)sc3cc(C(=O)c4ccccc4F)ccc32)n1
  1955608 | C[C@H](N)C(=O)Nc1cccc(N)n1
   773911 | Cc1cccc(NC(=O)c2c(-c3ccccc3)noc2C)n1
(10 rows)

Time: 11.895 ms

默认 函数会对分子进行下述操作:

  1. 把dummy原子转换成any!
  2. 为环上的原子添加度查询;
  3. 识别芳香性。

我们可以通过提供JSON形式的参数来控制行为。 在以下示例中,我们禁用了其度的考虑查询:

chembl_25=# select molregno,m from rdk.mols where m@>mol_adjust_query_properties('*c1cccc(NC(=O)*)n1',
chembl_25(# '{"adjustDegree":false}') limit 10;
 molregno |                                                 m
----------+---------------------------------------------------------------------------------------------------
  2146308 | CCn1ncc2cc3nc(c21)NCCOC[C@H](c1ccccc1)NC(=O)N3
  2137309 | CCn1ncc2cc3nc(c21)CCCO[C@@H](O)[C@H](c1ccccc1)NC(=O)N3
  2102593 | CCn1ncc2cc3nc(c21)CCCO[C@@H]([C@@H](C)O)[C@@H](c1ccccc1)NC(=O)N3
  2171613 | CCn1ncc2cc3nc(c21)CCCO[C@@H]([C@H](C)O)[C@@H](c1ccccc1)NC(=O)N3
  2111904 | CCn1ncc2cc3nc(c21)C[C@H](O)COC[C@H](c1cccc(Cl)c1)NC(=O)N3
  2173410 | CCn1ncc2cc3nc(c21)CCCOC[C@H](c1ccccc1)NC(=O)N3
  2189450 | Cn1ncc2cc3nc(c21)CCCOC[C@H](c1ccccc1)NC(=O)N3
  2195752 | CCn1ncc2cc3nc(c21)C[C@H](O)COC[C@H](c1ccccc1)NC(=O)N3
  1609520 | Cc1cccc(NC(=O)c2cc(Br)ccc2C(=O)O)n1
  1141456 | CCN(CC)CCCn1cc(NC(=O)Nc2cccc(-c3ccccc3)n2)c2ccccc21
(10 rows)

Time: 10.780 ms

也可以控制查询过程中不考虑dummies 原子。

chembl_25=# select molregno,m from rdk.mols where m@>mol_adjust_query_properties('*c1cccc(NC(=O)*)n1',
chembl_25(# '{"adjustDegree":true,"adjustDegreeFlags":"IGNORERINGS|IGNOREDUMMIES"}') limit 10;
 molregno |                                                 m
----------+---------------------------------------------------------------------------------------------------
  2146308 | CCn1ncc2cc3nc(c21)NCCOC[C@H](c1ccccc1)NC(=O)N3
  2137309 | CCn1ncc2cc3nc(c21)CCCO[C@@H](O)[C@H](c1ccccc1)NC(=O)N3
  2102593 | CCn1ncc2cc3nc(c21)CCCO[C@@H]([C@@H](C)O)[C@@H](c1ccccc1)NC(=O)N3
  2171613 | CCn1ncc2cc3nc(c21)CCCO[C@@H]([C@H](C)O)[C@@H](c1ccccc1)NC(=O)N3
  2111904 | CCn1ncc2cc3nc(c21)C[C@H](O)COC[C@H](c1cccc(Cl)c1)NC(=O)N3
  2173410 | CCn1ncc2cc3nc(c21)CCCOC[C@H](c1ccccc1)NC(=O)N3
  2189450 | Cn1ncc2cc3nc(c21)CCCOC[C@H](c1ccccc1)NC(=O)N3
  2195752 | CCn1ncc2cc3nc(c21)C[C@H](O)COC[C@H](c1ccccc1)NC(=O)N3
  1609520 | Cc1cccc(NC(=O)c2cc(Br)ccc2C(=O)O)n1
  1141456 | CCN(CC)CCCn1cc(NC(=O)Nc2cccc(-c3ccccc3)n2)c2ccccc21
(10 rows)

Time: 12.827 ms

其他控制参数有:

  1. djustDegree (default: true) : adds a query to match the input atomic degree
  2. adjustDegreeFlags (default: ADJUST_IGNOREDUMMIES | ADJUST_IGNORECHAINS) controls where the degree is adjusted
  3. adjustRingCount (default: false) : adds a query to match the input ring count
  4. adjustRingCountFlags (default: ADJUST_IGNOREDUMMIES | ADJUST_IGNORECHAINS) controls where the ring count is adjusted
  5. makeDummiesQueries (default: true) : convert dummy atoms in the input structure into any-atom queries
  6. aromatizeIfPossible (default: true) : run the aromaticity perception algorithm on the input structure (note: this is largely redundant since molecules built from smiles always have aromaticity perceived)
  7. makeBondsGeneric (default: false) : convert bonds into any-bond queries
  8. makeBondsGenericFlags (default: false) : controls which bonds are made generic
  9. makeAtomsGeneric (default: false) : convert atoms into any-atom queries
  10. makeAtomsGenericFlags (default: false) : controls which atoms are made generic

可通过|进行组合控制的参数有:

  1. IGNORENONE : apply the operation to all atoms
  2. IGNORERINGS : do not apply the operation to ring atoms
  3. IGNORECHAINS : do not apply the operation to chain atoms
  4. IGNOREDUMMIES : do not apply the operation to dummy atoms
  5. IGNORENONDUMMIES : do not apply the operation to non-dummy atoms
  6. IGNOREALL : do not apply the operation to any atoms

相似性搜索

基本的相似性搜索:

chembl_25=# select count(*) from rdk.fps where mfp2%morganbv_fp('Cc1ccc2nc(-c3ccc(NC(C4N(C(c5cccs5)=O)CCC4)=O)cc3)sc2c1');
 count
-------
    67
(1 row)

Time: 177.579 ms

返回按照相似性,从高到底进行排序的列表。

chembl_25=# create or replace function get_mfp2_neighbors(smiles text)
    returns table(molregno bigint, m mol, similarity double precision) as
  $$
  select molregno,m,tanimoto_sml(morganbv_fp(mol_from_smiles($1::cstring)),mfp2) as similarity
  from rdk.fps join rdk.mols using (molregno)
  where morganbv_fp(mol_from_smiles($1::cstring))%mfp2
  order by morganbv_fp(mol_from_smiles($1::cstring))<%>mfp2;
  $$ language sql stable ;
CREATE FUNCTION
Time: 0.856 ms
chembl_25=# select * from get_mfp2_neighbors('Cc1ccc2nc(-c3ccc(NC(C4N(C(c5cccs5)=O)CCC4)=O)cc3)sc2c1') limit 10;
 molregno |                                m                                 |    similarity
----------+------------------------------------------------------------------+-------------------
   751668 | COc1ccc2nc(NC(=O)[C@@H]3CCCN3C(=O)c3cccs3)sc2c1                  | 0.619718309859155
   740754 | Cc1ccc(NC(=O)C2CCCN2C(=O)c2cccs2)cc1C                            | 0.606060606060606
   732905 | O=C(Nc1ccc(S(=O)(=O)N2CCCC2)cc1)C1CCCN1C(=O)c1cccs1              | 0.602941176470588
   810850 | Cc1cc(C)n(-c2ccc(NC(=O)C3CCCCN3C(=O)c3cccs3)cc2)n1               | 0.583333333333333
  1224407 | O=C(Nc1cccc(S(=O)(=O)N2CCCC2)c1)C1CCCN1C(=O)c1cccs1              | 0.579710144927536
   779258 | CC1CCN(S(=O)(=O)c2ccc(NC(=O)[C@@H]3CCCN3C(=O)c3cccs3)cc2)CC1     | 0.569444444444444
   472441 | Cc1ccc2nc(-c3ccc(NC(=O)C4CCN(S(=O)(=O)C(C)C)CC4)cc3)sc2c1        | 0.569444444444444
   745651 | Cc1ccc(NC(=O)[C@@H]2CCCN2C(=O)c2cccs2)cc1S(=O)(=O)N1CCCCC1       | 0.567567567567568
   472510 | Cc1ccc2nc(-c3ccc(NC(=O)C4CCN(S(=O)(=O)c5cccc(Cl)c5)CC4)cc3)sc2c1 | 0.565789473684211
  1233426 | Cc1cccc2sc(NC(=O)[C@@H]3CCCN3C(=O)c3cccs3)nc12                   | 0.563380281690141
(10 rows)

Time: 28.909 ms
chembl_25=# select * from get_mfp2_neighbors('Cc1ccc2nc(N(C)CC(=O)O)sc2c1') limit 10;
 molregno |                                m                         |    similarity
----------+----------------------------------------------------------+-------------------
  2138088 | CN(CC(=O)O)c1nc2ccc([N+](=O)[O-])cc2s1                   | 0.673913043478261
  1040255 | CC(=O)N(CCCN(C)C)c1nc2ccc(C)cc2s1                        | 0.571428571428571
   773946 | CC(=O)N(CCCN(C)C)c1nc2ccc(C)cc2s1.Cl                     |              0.56
  1044892 | Cc1ccc2nc(N(CCN(C)C)C(=O)c3cc(Cl)sc3Cl)sc2c1             | 0.518518518518518
   441378 | Cc1ccc2nc(NC(=O)CCC(=O)O)sc2c1                           | 0.510204081632653
  1047691 | Cc1ccc(S(=O)(=O)CC(=O)N(CCCN(C)C)c2nc3ccc(C)cc3s2)cc1    | 0.509090909090909
  1042958 | Cc1ccc2nc(N(CCN(C)C)C(=O)c3ccc4ccccc4c3)sc2c1            | 0.509090909090909
  1015485 | Cc1ccc2nc(N(Cc3cccnc3)C(=O)Cc3ccccc3)sc2c1               |               0.5
   994843 | Cc1ccc(S(=O)(=O)CC(=O)N(CCCN(C)C)c2nc3ccc(C)cc3s2)cc1.Cl |               0.5
   841938 | Cc1ccc2nc(N(CCN(C)C)C(=O)c3ccc4ccccc4c3)sc2c1.Cl         |               0.5
(10 rows)

Time: 41.623 ms

调整相似性搜索的阈值

默认,相似性的阈值是0.5; 可根据不同的相似性计算方法,进行设置不同的方法对应的最小阈值rdkit.tanimoto_threshold 和 rdkit.dice_threshold。

chembl_25=# select count(*) from get_mfp2_neighbors('Cc1ccc2nc(N(C)CC(=O)O)sc2c1');
 count
-------
    21
(1 row)

Time: 181.438 ms
chembl_25=# set rdkit.tanimoto_threshold=0.7;
SET
Time: 0.047 ms
chembl_25=# select count(*) from get_mfp2_neighbors('Cc1ccc2nc(N(C)CC(=O)O)sc2c1');
 count
-------
     0
(1 row)

Time: 161.228 ms
chembl_25=# set rdkit.tanimoto_threshold=0.6;
SET
Time: 0.045 ms
chembl_25=# select count(*) from get_mfp2_neighbors('Cc1ccc2nc(N(C)CC(=O)O)sc2c1');
 count
-------
     2
(1 row)

Time: 184.275 ms
chembl_25=# set rdkit.tanimoto_threshold=0.5;
SET
Time: 0.055 ms
chembl_25=# select count(*) from get_mfp2_neighbors('Cc1ccc2nc(N(C)CC(=O)O)sc2c1');
 count
-------
    21
(1 row)

Time: 181.100 ms

查找最大公共子结构MCS

MCS代码的用途是找到一组分子的最大共同子结构:

chembl_25=# select fmcs(m::text) from rdk.mols join compound_records using (molregno) where doc_id=4;
                                  fmcs
------------------------------------------------------------------------
 [#6](-[#6]-[#7]-[#6]-[#6](-,:[#6])-,:[#6])-,:[#6]-,:[#6]-,:[#6]-,:[#6]
(1 row)

Time: 31.041 ms
chembl_25=# select fmcs(m::text) from rdk.mols join compound_records using (molregno) where doc_id=5;
                                                                   fmcs
------------------------------------------------------------------------------------------------------------------------------------------
 [#6]-[#6](=[#8])-[#7]-[#6](-[#6](=[#8])-[#7]1-[#6]-[#6]-[#6]-[#6]-1-[#6](=[#8])-[#7]-[#6](-[#6](=[#8])-[#8])-[#6]-[#6])-[#6](-[#6])-[#6]
(1 row)

Time: 705.535 ms

也可以基于smiles计算最大公共子结构:

chembl_25=# select fmcs(canonical_smiles) from compound_structures join compound_records using (molregno) where doc_id=4;
                                  fmcs
------------------------------------------------------------------------
 [#6](-[#7]-[#6]-[#6]-,:[#6]-,:[#6]-,:[#6]-,:[#6])-[#6](-,:[#6])-,:[#6]
(1 row)

Time: 128.879 ms

可以根据FMCS算法对其中的参数进行调整。 以下是几个示例:

chembl_25=# select fmcs_smiles(str,'{"Threshold":0.8}') from
chembl_25-#    (select string_agg(m::text,' ') as str from rdk.mols
chembl_25(#    join compound_records using (molregno) where doc_id=4) as str ;

                                                                           fmcs_smiles
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [#6]-[#6]-[#8]-[#6](-[#6](=[#8])-[#7]-[#6](-[#6])-[#6](-,:[#6])-,:[#6])-[#6](-[#8])-[#6](-[#8])-[#6](-[#8]-[#6]-[#6])-[#6]-[#7]-[#6](-[#6])-[#6](-,:[#6])-,:[#6]
(1 row)

Time: 9673.949 ms
chembl_25=#
chembl_25=# select fmcs_smiles(str,'{"AtomCompare":"Any"}') from
chembl_25-#    (select string_agg(m::text,' ') as str from rdk.mols
chembl_25(#    join compound_records using (molregno) where doc_id=4) as str ;
                                                                              fmcs_smiles
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [#6]-,:[#6,#7]-[#8,#6]-[#6,#7](-[#6,#8]-[#7,#6]-,:[#6,#7]-,:[#6,#7]-,:[#7,#6]-,:[#6])-[#6,#7]-[#6]-[#6](-[#8,#6]-[#6])-[#6,#7]-[#7,#6]-[#6]-,:[#6,#8]-,:[#7,#6]-,:[#6]
(1 row)

Time: 304.332 ms

最大公共子结构搜索的时候,对参数AtomCompare,CompleteRingsOnly,Threshold,Timeout进行设置。

chembl_25=# select fmcs_smiles(str,'{"AtomCompare":"Any","CompleteRingsOnly":true,"Threshold":0.8,"Timeout":60}') from
chembl_25-#    (select string_agg(m::text,' ') as str from rdk.mols
chembl_25(#    join compound_records using (molregno) where doc_id=3) as str ;

WARNING:  findMCS timed out, result is not maximal
                                                                                          fmcs_smiles

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------
 [#8]=[#6](-[#7]-[#6]1:[#6]:[#6]:[#6](:[#6]:[#6]:1)-[#6](=[#8])-[#7]1-[#6]-[#6]-[#6]-[#6,#7]-[#6]2:[#6]-1:[#6]:[#6]:[#16]:2)-[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1-[#6]1:[#6]:
[#6]:[#6]:[#6]:[#6]:1
(1 row)

Time: 60479.753 ms

MCS搜索可以调整的其他参数和默认值。

  1. MaximizeBonds (true)
  2. Threshold (1.0)
  3. Timeout (-1, no timeout)
  4. MatchValences (false)
  5. MatchChiralTag (false) Applies to atoms
  6. RingMatchesRingOnly (false)
  7. CompleteRingsOnly (false)
  8. MatchStereo (false) Applies to bonds
  9. AtomCompare (“Elements”) can be “Elements”, “Isotopes”, or “Any”
  10. BondCompare (“Order”) can be “Order”, “OrderExact”, or “Any”

cartridge 详细指南

注解

实战中并没有深入使用cartridge,因此下面的内容等深入使用后再进行翻译。

新的类型

  • mol : an rdkit molecule. Can be created from a SMILES via direct type conversion, for example: ‘c1ccccc1’::mol creates a molecule from the SMILES ‘c1ccccc1’
  • qmol : an rdkit molecule containing query features (i.e. constructed from SMARTS). Can be created from a SMARTS via direct type conversion, for example: ‘c1cccc[c,n]1’::qmol creates a query molecule from the SMARTS ‘c1cccc[c,n]1’
  • sfp : a sparse count vector fingerprint (SparseIntVect in C++ and Python)
  • bfp : a bit vector fingerprint (ExplicitBitVect in C++ and Python)

参数

  • rdkit.tanimoto_threshold : threshold value for the Tanimoto similarity operator. Searches done using Tanimoto similarity will only return results with a similarity of at least this value.
  • rdkit.dice_threshold : threshold value for the Dice similiarty operator. Searches done using Dice similarity will only return results with a similarity of at least this value.
  • rdkit.do_chiral_sss : toggles whether or not stereochemistry is used in substructure matching. (available from 2013_03 release).
  • rdkit.sss_fp_size : the size (in bits) of the fingerprint used for substructure screening.
  • rdkit.morgan_fp_size : the size (in bits) of morgan fingerprints
  • rdkit.featmorgan_fp_size : the size (in bits) of featmorgan fingerprints
  • rdkit.layered_fp_size : the size (in bits) of layered fingerprints
  • rdkit.rdkit_fp_size : the size (in bits) of RDKit fingerprints
  • rdkit.torsion_fp_size : the size (in bits) of topological torsion bit vector fingerprints
  • rdkit.atompair_fp_size : the size (in bits) of atom pair bit vector fingerprints
  • rdkit.avalon_fp_size : the size (in bits) of avalon fingerprints

操作

相似性搜索

  • % : operator used for similarity searches using Tanimoto similarity. Returns whether or not the Tanimoto similarity between two fingerprints (either two sfp or two bfp values) exceeds rdkit.tanimoto_threshold.
  • # : operator used for similarity searches using Dice similarity. Returns whether or not the Dice similarity between two fingerprints (either two sfp or two bfp values) exceeds rdkit.dice_threshold.
  • <%> : used for Tanimoto KNN searches (to return ordered lists of neighbors).
  • <#> : used for Dice KNN searches (to return ordered lists of neighbors).

子结构和完整结构搜索

  • @> : substructure search operator. Returns whether or not the mol or qmol on the right is a substructure of the mol on the left.
  • <@ : substructure search operator. Returns whether or not the mol or qmol on the left is a substructure of the mol on the right.
  • @= : returns whether or not two molecules are the same.

分子比较

  • < : returns whether or not the left mol is less than the right mol
  • > : returns whether or not the left mol is greater than the right mol
  • = : returns whether or not the left mol is equal to the right mol
  • <= : returns whether or not the left mol is less than or equal to the right mol
  • >= : returns whether or not the left mol is greater than or equal to the right mol

Note Two molecules are compared by making the following comparisons in order. Later comparisons are only made if the preceding values are equal:

# Number of atoms # Number of bonds # Molecular weight # Number of rings

If all of the above are the same and the second molecule is a substructure of the first, the molecules are declared equal, Otherwise (should not happen) the first molecule is arbitrarily defined to be less than the second.

There are additional operators defined in the cartridge, but these are used for internal purposes.

函数

产生分子指纹

  • morgan_fp(mol,int default 2) : returns an sfp which is the count-based Morgan fingerprint for a molecule using connectivity invariants. The second argument provides the radius. This is an ECFP-like fingerprint.
  • morganbv_fp(mol,int default 2) : returns a bfp which is the bit vector Morgan fingerprint for a molecule using connectivity invariants. The second argument provides the radius. This is an ECFP-like fingerprint.
  • featmorgan_fp(mol,int default 2) : returns an sfp which is the count-based Morgan fingerprint for a molecule using chemical-feature invariants. The second argument provides the radius. This is an FCFP-like fingerprint.
  • featmorganbv_fp(mol,int default 2) : returns a bfp which is the bit vector Morgan fingerprint for a molecule using chemical-feature invariants. The second argument provides the radius. This is an FCFP-like fingerprint.
  • rdkit_fp(mol) : returns a bfp which is the RDKit fingerprint for a molecule. This is a daylight-fingerprint using hashed molecular subgraphs.
  • atompair_fp(mol) : returns an sfp which is the count-based atom-pair fingerprint for a molecule.
  • atompairbv_fp(mol) : returns a bfp which is the bit vector atom-pair fingerprint for a molecule.
  • torsion_fp(mol) : returns an sfp which is the count-based topological-torsion fingerprint for a molecule.
  • torsionbv_fp(mol) : returns a bfp which is the bit vector topological-torsion fingerprint for a molecule.
  • layered_fp(mol) : returns a bfp which is the layered fingerprint for a molecule. This is an experimental substructure fingerprint using hashed molecular subgraphs.
  • maccs_fp(mol) : returns a bfp which is the MACCS fingerprint for a molecule (available from 2013_01 release).

分子指纹操作

  • tanimoto_sml(fp,fp) : returns the Tanimoto similarity between two fingerprints of the same type (either two sfp or two bfp values).
  • dice_sml(fp,fp) : returns the Dice similarity between two fingerprints of the same type (either two sfp or two bfp values).
  • size(bfp) : returns the length of (number of bits in) a bfp.
  • add(sfp,sfp) : returns an sfp formed by the element-wise addition of the two sfp arguments.
  • subtract(sfp,sfp) : returns an sfp formed by the element-wise subtraction of the two sfp arguments.
  • all_values_lt(sfp,int) : returns a boolean indicating whether or not all elements of the sfp argument are less than the int argument.
  • all_values_gt(sfp,int) : returns a boolean indicating whether or not all elements of the sfp argument are greater than the int argument.

分子指纹读写

  • bfp_to_binary_text(bfp) : returns a bytea with the binary string representation of the fingerprint that can be converted back into an RDKit fingerprint in other software. (available from Q3 2012 (2012_09) release)
  • bfp_from_binary_text(bytea) : constructs a bfp from a binary string representation of the fingerprint. (available from Q3 2012 (2012_09) release)

分子的读写和验证

  • is_valid_smiles(smiles) : returns whether or not a SMILES string produces a valid RDKit molecule.
  • is_valid_ctab(ctab) : returns whether or not a CTAB (mol block) string produces a valid RDKit molecule.
  • is_valid_smarts(smarts) : returns whether or not a SMARTS string produces a valid RDKit molecule.
  • is_valid_mol_pkl(bytea) : returns whether or not a binary string (bytea) can be converted into an RDKit molecule. (available from Q3 2012 (2012_09) release)
  • mol_from_smiles(smiles) : returns a molecule for a SMILES string, NULL if the molecule construction fails.
  • mol_from_smarts(smarts) : returns a molecule for a SMARTS string, NULL if the molecule construction fails.
  • mol_from_ctab(ctab, bool default false) : returns a molecule for a CTAB (mol block) string, NULL if the molecule construction fails. The optional second argument controls whether or not the molecule’s coordinates are saved.
  • mol_from_pkl(bytea) : returns a molecule for a binary string (bytea), NULL if the molecule construction fails. (available from Q3 2012 (2012_09) release)
  • qmol_from_smiles(smiles) : returns a query molecule for a SMILES string, NULL if the molecule construction fails. Explicit Hs in the SMILES are converted into query features on the attached atom.
  • qmol_from_ctab(ctab, bool default false) : returns a query molecule for a CTAB (mol block) string, NULL if the molecule construction fails. Explicit Hs in the SMILES are converted into query features on the attached atom. The optional second argument controls whether or not the molecule’s coordinates are saved.
  • mol_to_smiles(mol) : returns the canonical SMILES for a molecule.
  • mol_to_smarts(mol) : returns SMARTS string for a molecule.
  • mol_to_pkl(mol) : returns binary string (bytea) for a molecule. (available from Q3 2012 (2012_09) release)
  • mol_to_ctab(mol,bool default true) : returns a CTAB (mol block) string for a molecule. The optional second argument controls whether or not 2D coordinates will be generated for molecules that don’t have coordinates. (available from the 2014_03 release)
  • mol_to_svg(mol,string default ‘’,int default 250, int default 200, string default ‘’) : returns an SVG with a drawing of the molecule. The optional parameters are a string to use as the legend, the width of the image, the height of the image, and a JSON with additional rendering parameters. (available from the 2016_09 release)

子结构操作

  • substruct(mol,mol) : returns whether or not the second mol is a substructure of the first.
  • substruct_count(mol,mol,bool default true) : returns the number of substructure matches between the second molecule and the first. The third argument toggles whether or not the matches are uniquified. (available from 2013_03 release)
  • mol_adjust_query_properties(mol,string default ‘’) : returns a new molecule with additional query information attached. (available from the 2016_09 release)

描述符

  • mol_amw(mol) : returns the AMW for a molecule.
  • mol_logp(mol) : returns the MolLogP for a molecule.
  • mol_tpsa(mol) : returns the topological polar surface area for a molecule (available from Q1 2011 (2011_03) release).
  • mol_fractioncsp3(mol) : returns the fraction of carbons that are sp3 hybridized (available from 2013_03 release).
  • mol_hba(mol) : returns the number of Lipinski H-bond acceptors (i.e. number of Os and Ns) for a molecule.
  • mol_hbd(mol) : returns the number of Lipinski H-bond donors (i.e. number of Os and Ns that have at least one H) for a molecule.
  • mol_numatoms(mol) : returns the total number of atoms in a molecule.
  • mol_numheavyatoms(mol) : returns the number of heavy atoms in a molecule.
  • mol_numrotatablebonds(mol) : returns the number of rotatable bonds in a molecule (available from Q1 2011 (2011_03) release).
  • mol_numheteroatoms(mol) : returns the number of heteroatoms in a molecule (available from Q1 2011 (2011_03) release).
  • mol_numrings(mol) : returns the number of rings in a molecule (available from Q1 2011 (2011_03) release).
  • mol_numaromaticrings(mol) : returns the number of aromatic rings in a molecule (available from 2013_03 release).
  • mol_numaliphaticrings(mol) : returns the number of aliphatic (at least one non-aromatic bond) rings in a molecule (available from 2013_03 release).
  • mol_numsaturatedrings(mol) : returns the number of saturated rings in a molecule (available from 2013_03 release).
  • mol_numaromaticheterocycles(mol) : returns the number of aromatic heterocycles in a molecule (available from 2013_03 release).
  • mol_numaliphaticheterocycles(mol) : returns the number of aliphatic (at least one non-aromatic bond) heterocycles in a molecule (available from 2013_03 release).
  • mol_numsaturatedheterocycles(mol) : returns the number of saturated heterocycles in a molecule (available from 2013_03 release).
  • mol_numaromaticcarbocycles(mol) : returns the number of aromatic carbocycles in a molecule (available from 2013_03 release).
  • mol_numaliphaticcarbocycles(mol) : returns the number of aliphatic (at least one non-aromatic bond) carbocycles in a molecule (available from 2013_03 release).
  • mol_numsaturatedcarbocycles(mol) : returns the number of saturated carbocycles in a molecule (available from 2013_03 release).
  • mol_inchi(mol) : returns an InChI for the molecule. (available from the 2011_06 release, requires that the RDKit be built with InChI support).
  • mol_inchikey(mol) : returns an InChI key for the molecule. (available from the 2011_06 release, requires that the RDKit be built with InChI support).
  • mol_formula(mol,bool default false, bool default true) : returns a string with the molecular formula. The second argument controls whether isotope information is included in the formula; the third argument controls whether “D” and “T” are used instead of [2H] and [3H]. (available from the 2014_03 release)

连接描述符

  • mol_chi0v(mol) - mol_chi4v(mol) : returns the ChiXv value for a molecule for X=0-4 (available from 2012_01 release).
  • mol_chi0n(mol) - mol_chi4n(mol) : returns the ChiXn value for a molecule for X=0-4 (available from 2012_01 release).
  • mol_kappa1(mol) - mol_kappa3(mol) : returns the kappaX value for a molecule for X=1-3 (available from 2012_01 release).
  • mol_numspiroatoms : returns the number of spiro atoms in a molecule (available from 2015_09 release).
  • mol_numbridgeheadatoms : returns the number of bridgehead atoms in a molecule (available from 2015_09 release).

MCS最大公共子结构

  • fmcs(mols) : an aggregation function that calculates the MCS for a set of molecules
  • fmcs_smiles(text, json default ‘’) : calculates the MCS for a space-separated set of SMILES. The optional json argument is used to provide parameters to the MCS code.

其他

  • rdkit_version() : returns a string with the cartridge version number.

There are additional functions defined in the cartridge, but these are used for internal purposes.

cartridge(postsql)的python接口

>>> import psycopg2
>>> conn = psycopg2.connect(database='chembl_25')
>>> curs = conn.cursor()
>>> curs.execute('select * from rdk.mols where m@>%s',('c1cccc2c1nncc2',))
>>> curs.fetchone()
(9830, 'CC(C)Sc1ccc(CC2CCN(C3CCN(C(=O)c4cnnc5ccccc54)CC3)CC2)cc1')
>>> curs.execute('select molregno,mol_send(m) from rdk.mols where m@>%s',('c1cccc2c1nncc2',))
>>> row = curs.fetchone()
>>> row
(9830, <memory at 0x...>)
>>> from rdkit import Chem
>>> m = Chem.Mol(row[1].tobytes())
>>> Chem.MolToSmiles(m,True)
'CC(C)Sc1ccc(CC2CCN(C3CCN(C(=O)c4cnnc5ccccc54)CC3)CC2)cc1'

cartridge(postsql)的python ORM操作

对象关系映射(ORM)操作

实战:django搭建化合物库管理系统

License

This document is copyright (C) 2013-2016 by Greg Landrum

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. To view a copy of this license, visit <http://creativecommons.org/licenses/by-sa/4.0/> or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

The intent of this license is similar to that of the RDKit itself. In simple words: “Do whatever you want with it, but please give us some credit.”