Abstract:
The rapid surge in unstructured biomedical literature has made it increasingly difficult to keep knowledge graphs up to date—yet these graphs are essential tools for driving forward scientific research. Manually curating such data is not only time-consuming but also unsustainable at scale. In this thesis, I tackle this issue by developing an automated pipeline that uses Large Language Models (LLMs) to expand biomedical knowledge graphs more efficiently. The approach begins with building a base version of the BioKG using structured data from TSV files. This base is then enriched with additional information drawn from unstructured text. The pipeline brings in bulk data from PubTator, filters it to retain human-specific content, and stores it in MongoDB for further processing. A key element of the system is an LLM-based extraction module, powered by GPT-4o-mini, which uses carefully crafted prompts aligned with the BioKG schema to identify and extract new entities and relationships. To maintain consistency and avoid redundancy, a validation and ingestion module integrates this data into a Neo4j graph database, ensuring the final graph remains accurate and cohesive. The results show a clear increase in both the number of entities and relationships in the graph. Additionally, an interactive visualization tool highlights the impact of the updates, providing qualitative insights into the improvements. This project offers a practical, scalable framework for continuously updating biomedical knowledge graphs—an important step toward making them more useful for research and healthcare purposes.