Abstract:
This thesis explores the development and evaluation of a smart, multi-agent system designed to retrieve information efficiently from a diverse biomedical knowledge graph. The graph was carefully built using BioKG data in TSV format and structured in Neo4j, with special attention given to properly representing multi-valued properties as lists. Early attempts to use embedding-based models—such as Nomic Atlas v1, BioBERT, and PubMedBERT—for semantic search presented several obstacles. The main issues stemmed from the highly varied nature of the data, repeated terms that introduced bias, and the models’ limited ability to process structured key-value information effectively. Due to these limitations, a BM25 retriever was initially used for keyword-based node extraction. While it served as a practical starting point, its dependency on exact keyword matches proved restrictive. To address these shortcomings and enhance retrieval accuracy, a layered multi-agent system was built using the LangChain supervisor agent framework, with GPT-4o Mini at its core. This system includes several specialized agents: one for handling query expansion and rewriting (including web search via Tavily), another for initial node retrieval using BM25, and a graph traversal agent that navigates the graph using Cypher queries and generates comprehensive responses. Together, these components form a robust solution for querying complex biomedical datasets. The system not only improves over basic retrieval methods but also illustrates the potential of agent-based architectures in exploring large, heterogeneous knowledge sources.