Abstract:
We aim to leverage the use of LLMs in drug discovery to generate new molecules given the description of their properties. Zinc-250K dataset is used to build the dataset and various LLMs and SOTA Models are selected for inferencing the baseline tasks on the vanilla LLMs. These models are evaluated on the basis of three types of losses - token level loss, structural loss and property level loss. Evaluation metrics like validity, fragment similarity, scaffold similarity etc. are carefully studied and chosen for this task. These metrics are then used to evaluate the LLMs. It is observed that the LLMs do not perform well in preserving the structure of the molecule and cannot generate syntactically valid notations. It performs decently in generating molecules according to properties.