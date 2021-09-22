Machine learning has been applied to many problems in cheminformatics and life science, for example, investigating molecular property and developing new drugs. One critical issue in the problem-solving pipeline for these applications is to select a proper molecular representation that featurizes the target dataset and serves the downstream model. Figure 1 shows a conceptual framework for different molecular representations. Usually, a molecule is represented by a linear form as a SMILES string, or by a graph form as an adjacent matrix maybe together with a node attribute matrix for atoms and an edge attribute matrix for bonds. A SMILES string could be further converted into different formats such as molecular fingerprint, one-hot encoding, or word embedding. On the other hand, the graph form of molecular representation could be directly used by the downstream model or be converted into graph embedding for the task.

SCIENCE ・ 10 DAYS AGO