Abstract:
Chemical information science faces an important bottleneck because millions of chemical structures are trapped in visual formats throughout scientific literature and patents, making them inaccessible for automatic analysis and large-scale data mining. Traditional optical chemical structure recognition (OCSR) methods depend on the rules-based approaches that demonstrate limited robustness when processing the real-world literature diversity, while the current deep learning approaches seek large-scale computational resources yet remain impractical for comprehensive deployment. This research addresses these limitations through the development of an integrated three-phase deep learning pipeline that (1) a Faster R-CNN with ResNet-50 backbone and Feature Pyramid Network architecture adapted for chemical structure detection, handling diverse molecular configurations across 15 chemical elements and 4 bond types (19 classes total); (2) uses spatial connectivity analysis using K-D tree algorithms to generate adjacency and bond-order matrices for molecular graph representation; and (3) uses multi-strategy SMILES generation with progressive RDKit sanitization, fragment-linking, and domain-aware validation. Key technical innovations include chemical-aware anchor generation, class-specific confidence thresholds, focal loss implementation, and strategic training methodologies addressing severe class imbalance. The developed system displays strong performance through comprehensive evaluation on 14,997 testing images; 612,371 total detections (99.7% detection rate) at 40.83 detections per image, 99.2% successful molecular graph conversion, 98.1% right bond connectivity, and SMILES generating (41.2% valid). While 25 epochs on the full 100K dataset are converged to a loss of 0.8877. The system achieves an mAP of 74.9% with 88.1% of successfully generated molecules that receive high-quality scores (80) on the comprehensive verification metrics. The framework is optimized for standard computational infrastructure with efficient memory use