基于分子指纹的大规模分子聚类

新药早期研发的应用场景中, 常常需要对虚拟化合物库进行聚类，从不同类别中挑选出多样性好的代表化合物，但视聚类规模对使用的算法与硬件有不同的要求，这里列出两种实测后，对内存与聚类时间比较友好的算法分享。在化合物规模小于10万时，推荐使用RDkit方法，在化合物规模远大于10万时，推荐使用ChemFP。

1. RDkit （推荐<10 万化合物）

RDKit实现了Butina聚类算法,基于文献 Unsupervised Database Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way to Cluster Small and Large Data Sets', JCICS, 39, 747-750 (1999)

from rdkit import Chem
from rdkit.Chem import AllChem

def ClusterFps(fps,cutoff=0.2):
    from rdkit import DataStructs
    from rdkit.ML.Cluster import Butina

    # first generate the distance matrix:
    dists = []
    nfps = len(fps)
    for i in range(1,nfps):
        sims = DataStructs.BulkTanimotoSimilarity(fps[i],fps[:i])
        dists.extend([1-x for x in sims])

    # now cluster the data:
    cs = Butina.ClusterData(dists,nfps,cutoff,isDistData=True)
    return cs

定义函数后我们读取化合物

ms = [x for x in Chem.SDMolSupplier('ZINC_example.sdf')]
print (len(ms))
output: 10000

计算指纹并聚类，cutoff是指Tanimoto 相似度的阈值，差异度大于cutoff的化合物会被分到不同的类别中

fps = [AllChem.GetMorganFingerprintAsBitVect(x,2,1024) for x in ms
clusters=ClusterFps(fps,cutoff=0.4)

可视化结果看看，

print(clusters[100])
output: (553, 297, 547)

from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
m1 = ms[553]
m2 = ms[297]
m3 = ms[547]
mols=(m1,m2,m3)
Draw.MolsToGridImage(mols)

图一、该类别下的三个分子

RDkit的聚类算法可以支持10万级别的化合物库，然而在内存为125 GB的服务器上测试20万分子的聚类时，可以看到随着聚类的时间增长，占用的内存也不停增加，最终在半个小时后，聚类20万分子由于内存溢出而崩溃。

2. ChemFP ( > 10 万化合物)

Chemfp 是一个商用的python库，但其中的免费版本也包含分子聚类功能，其ChemFP文档地址见链接。在python2.7的环境下，通过命令行安装chemFP 1.6.1免费版

# 创建python 2.7名为chemfp的环境
conda create -n chemfp python=2.7
#在该环节中安装所需python包
conda activate chemfp
python2.7 -m pip install chemfp
# 安装所需的RDkit支持,更早版本的也可以，例如RDKit 2013.03, RDKit 2016.09, RDKit 2017.03, RDKit 2017.09 (dev)
conda install  -C rdkit rdkit=2018.09.3

安装成功之后，就可以开始聚类了，我们准备文件test_100K_cluster.smi

# chemFP支持多种输入格式，这里采用SMILES的文件，包含index与SMILES两列
index=0，
with open ('test_100K_cluster.smi', 'w') as f:
    for smi in ZINC_list:
            f.write(smi+'\t'+ str(index) + '\n')
            index+=1

命令行中调用rdkit2fps模块将刚刚的smi文件转为指纹，采用MORGAN分子指纹，半径为2，结果存在test-mol.fps中

rdkit2fps test_100K_cluster.smi --morgan --radius 2 > test-mol.fps

真正开始聚类之前，我们下载一个聚类函数的脚本放在你的工作文件夹下。为防止下载链接失效，脚本内容也放在了文章最末。
命令行中运行该脚本。threshold = 0.6 可按项目需求调节，默认0.8

python taylor_butina.py --profile --threshold 0.6 test-mol.fps -o test-mol.clusters

输出的test-mol.cluster是一个文本文件，以我们在smi文件中赋予的index来展示聚类结果，该文本文件的后处理脚本请自行撰写。

实测下来，在一百万化合物数据上，实现聚类仅花费了10分钟。相比于RDkit自带算法10万化合物的半小时聚类，速度快了30倍以上。

阶段	耗时	内存占用
生成指纹文件	1 min	NA
Load time	2.9 sec	299 MB
Similarity time	548.5 sec	818 MB
Clustering time	2.9 sec	261 MB

Total time: 555.3 sec
得到的聚类结果是
0.6 similarity 98195 clusters

附录：应用于ChemFP的taylor_butina.py脚本内容，

#!/usr/bin/env python

from __future__ import print_function

# An implementation of the Taylor-Butina clustering algorithm for
# chemical fingerprints.

# Robin Taylor, "Simulation Analysis of Experimental Design Strategies
# for Screening Random Compounds as Potential New Drugs and
# Agrochemicals", J. Chem. Inf. Comput. Sci., 1995, 35 (1), pp 59-67.
# DOI: 10.1021/ci00023a009
#
# Darko Butina, "Unsupervised Data Base Clustering Based on Daylight's
# Fingerprint and Tanimoto Similarity: A Fast and Automated Way To
# Cluster Small and Large Data Sets", J. Chem. Inf. Comput. Sci.,
# 1999, 39 (4), pp 747-750
# DOI: 10.1021/ci9803381

# Copyright 2017 Andrew Dalke <dalke@dalkescientific.com>
# Distributed under the "MIT license"
#
# Permission is hereby granted, free of charge, to any person
# obtaining a copy of this software and associated documentation files
# (the "Software"), to deal in the Software without restriction,
# including without limitation the rights to use, copy, modify, merge,
# publish, distribute, sublicense, and/or sell copies of the Software,
# and to permit persons to whom the Software is furnished to do so,
# subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be
# included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

__version__ = "1.0"

import sys
import argparse
import time

# I use the third-party "psutil" package to get memory use information.
try:
    import psutil
except ImportError:
    # If it isn't available, then don't report memory use.
    psutil = None
    def get_memory_use():
        return None
else:
    import os
    _process = psutil.Process(os.getpid())
    def get_memory_use():
        info = _process.memory_info()
        return info.rss # or info.vms?

# This requires chemfp. See http://chemfp.com/
import chemfp
from chemfp import search
        
# Convert the number of bytes into a more human-readable form.
def human_memory(n):
    if n < 1024:
        return "%d B" % (n,)
    for unit, denom in (("KB", 1024), ("MB", 1024**2),
                         ("GB", 1024**3), ("PB", 1024**4)):
        f = n/denom
        if f < 10.0:
            return "%.2f %s" % (f, unit)
        if f < 100.0:
            return "%d %s" % (round(f, 1), unit)
        if f < 1000.0:
            return "%d %s" % (round(f, 0), unit)
    return ">1TB ?!?"

##### Measure the current time and memory use, so I can generate a delta report.
class ProfileStats(object):
    def __init__(self, timestamp, memory_use):
        self.timestamp = timestamp
        self.memory_use = memory_use

def get_profile_stats():
    return ProfileStats(time.time(), get_memory_use())

def get_profile_time():
    return ProfileStats(time.time(), None)

def profile_report(title, start, end):
    dt = end.timestamp - start.timestamp
    if start.memory_use is None or end.memory_use is None:
        # Memory use not available.
        sys.stderr.write("%s time: %.1f sec\n" % (title, dt))
    else:
        delta_memory = end.memory_use - start.memory_use
        memory_str = human_memory(delta_memory)
        sys.stderr.write("%s time: %.1f sec memory: %s\n" % (title, dt, memory_str))

######

# The results of the Taylor-Butina clustering
class ClusterResults(object):
    def __init__(self, true_singletons, false_singletons, clusters):
        self.true_singletons = true_singletons
        self.false_singletons = false_singletons
        self.clusters = clusters

# The clustering implementation
def taylor_butina_cluster(similarity_table):
    # Sort the results so that fingerprints with more hits come
    # first. This is more likely to be a cluster centroid. Break ties
    # arbitrarily by the fingerprint id; since fingerprints are
    # ordered by the number of bits this likely makes larger
    # structures appear first.:

    # Reorder so the centroid with the most hits comes first.  (That's why I do
    # a reverse search.)  Ignore the arbitrariness of breaking ties by
    # fingerprint index

    centroid_table = sorted(((len(indices), i, indices)
                                 for (i,indices) in enumerate(similarity_table.iter_indices())),
                            reverse=True)

    # Apply the leader algorithm to determine the cluster centroids
    # and the singletons:

    # Determine the true/false singletons and the clusters
    true_singletons = []
    false_singletons = []
    clusters = []

    seen = set()
    for (size, fp_idx, members) in centroid_table:
        if fp_idx in seen:
            # Can't use a centroid which is already assigned
            continue
        seen.add(fp_idx)

        # Figure out which ones haven't yet been assigned
        unassigned = set(members) - seen

        if not unassigned:
            false_singletons.append(fp_idx)
            continue

        # this is a new cluster
        clusters.append( (fp_idx, unassigned) )
        seen.update(unassigned)

    # Return the results:
    return ClusterResults(true_singletons, false_singletons, clusters)

def report_cluster_results(cluster_results, arena, outfile):
    true_singletons = cluster_results.true_singletons
    false_singletons = cluster_results.false_singletons
    clusters = cluster_results.clusters

    # Sort the singletons by id.
    print(len(true_singletons), "true singletons", file=outfile)
    print("=>", " ".join(sorted(arena.ids[idx] for idx in true_singletons)),
          file=outfile)
    print(file=outfile)

    print(len(false_singletons), "false singletons", file=outfile)
    print("=>", " ".join(sorted(arena.ids[idx] for idx in false_singletons)),
          file=outfile)
    print(file=outfile)

    # Sort so the cluster with the most compounds comes first,
    # then by alphabetically smallest id
    def cluster_sort_key(cluster):
        centroid_idx, members = cluster
        return -len(members), arena.ids[centroid_idx]

    clusters.sort(key=cluster_sort_key)

    print(len(clusters), "clusters", file=outfile)
    for centroid_idx, members in clusters:
        print(arena.ids[centroid_idx], "has", len(members), "other members", file=outfile)
        print("=>", " ".join(arena.ids[idx] for idx in members), file=outfile)


#### Command-line driver

p = parser = argparse.ArgumentParser(
    description="An implementation of the Taylor-Butina clustering algorithm using chemfp")
p.add_argument("--threshold", "-t", type=float, default=0.8,
               help="threshold similarity (default: 0.8)")
p.add_argument("--output", "-o", metavar="FILENAME",
               help="output filename (default: stdout)")
p.add_argument("--profile", action="store_true",
               help="report time and memory use")
p.add_argument("--version", action="version",
               version="spam %(prog)s " + __version__ + " (using chemfp " + chemfp.__version__ + ")")
p.add_argument("fingerprint_filename", metavar="FILENAME")

# Turn the --output option into a file object and close function.
def _close_nothing():
    pass

def open_output(parser, filename):
    ## open a file, or use None to use stdout
    if filename is None:
        return sys.stdout, _close_nothing
    try:
        outfile = open(filename, "w")
    except IOError as err:
        parser.error("Cannot open --output file: %s" % (err,))
    return outfile, outfile.close

## 

def main(args=None):
    args = parser.parse_args(args)

    if args.profile and psutil is None:
        sys.stderr.write("WARNING: Must install the 'psutil' module to see memory statistics.\n")

    # Load the fingerprints
    start_stats = get_profile_stats()
    try:
        arena = chemfp.load_fingerprints(args.fingerprint_filename)
    except IOError as err:
        sys.stderr.write("Cannot open fingerprint file: %s" % (err,))
        raise SystemExit(2)

    # Make sure I can generate output before doing the heavy calculations
    outfile, outfile_close = open_output(parser, args.output)

    try:
        load_stats = get_profile_stats()

        # Generate the NxN similarity matrix for the given threshold
        similarity_table = search.threshold_tanimoto_search_symmetric(
            arena, threshold = args.threshold)
        similarity_stats = get_profile_stats()

        # Do the clustering
        cluster_results = taylor_butina_cluster(similarity_table)
        cluster_stats = get_profile_stats()

        # Report the results
        report_cluster_results(cluster_results, arena, outfile)

        # Report the time and memory use.
        if args.profile:
            print("#fingerprints:", len(arena), "#bits/fp:", arena.num_bits, "threshold:", args.threshold,
                  "#matches:", similarity_table.count_all(), file=sys.stderr)
            profile_report("Load", start_stats, load_stats)
            profile_report("Similarity", load_stats, similarity_stats)
            profile_report("Clustering", similarity_stats, cluster_stats)
            profile_report("Total", start_stats, get_profile_time())
    finally:
        outfile_close()

if __name__ == "__main__":
    main()

最后编辑于：2022.05.10 16:52:13

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 199,902评论 5赞 468
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 84,037评论 2赞 377
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 146,978评论 0赞 332
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 53,867评论 1赞 272
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 62,763评论 5赞 360
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,104评论 1赞 277
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,565评论 3赞 390
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,236评论 0赞 254
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,379评论 1赞 294
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,313评论 2赞 317
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,363评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,034评论 3赞 315
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,637评论 3赞 303
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,719评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,952评论 1赞 255
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,371评论 2赞 346
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 41,948评论 2赞 341

基于分子指纹的大规模分子聚类

1. RDkit （推荐<10 万化合物）

2. ChemFP ( > 10 万化合物)

推荐阅读更多精彩内容