Results

Overview

We compared three approaches for constructing a high-precision corpus of genetic research articles on non-model organisms using a pre-filtered set of 531 articles. All methods were applied to the same input corpus to ensure a direct and fair comparison. Evaluation focused on precision, reflecting the primary requirement of literature curation workflows, where minimizing false positives is critical due to the substantial downstream costs of manual review.

Performance Comparison

Overall Precision and Retrieval Characteristics

Table 1 summarizes the performance of the three approaches. Custom Biomni achieved the highest precision (90.91%), outperforming Default Biomni (80.00%) and PubTator (60.00%).

Metric	Default Biomni	Custom Biomni	PubTator
Input corpus size	531	531	531
Papers retrieved	75	44	55
Precision	80.00%	90.91%	60.00%
Gene annotation precision	59.72%	71.11%	38.97%
Species annotation precision	63.98%	83.33%	65.55%

A clear retrieval volume–precision trade-off was observed. Default Biomni retrieved the largest number of papers but at lower precision, whereas Custom Biomni retrieved fewer papers while maintaining substantially higher precision. PubTator showed intermediate retrieval volume but the lowest precision overall.

Entity Annotation Performance

Custom Biomni consistently outperformed the other methods in both gene and species annotation precision. In particular, gene annotation precision for PubTator was markedly low (38.97%), substantially lagging behind both Biomni variants. This gap was less pronounced for species annotations, where PubTator performed comparably to Default Biomni.

These results support our hypothesis that gene annotation for non-model organisms remains a major limitation of existing automated annotation systems, whereas species recognition is relatively robust.

Error Analysis

False Positives

Analysis of false positives revealed distinct error modes across methods:

PubTator frequently admitted papers with generic or incidental gene mentions, contamination from model organism–centric studies, or non-genetic research falsely flagged by gene annotations.
Default Biomni errors were primarily due to overgeneralization and insufficient contextual discrimination between primary research focus and background mentions.
Custom Biomni produced the fewest false positives, mostly limited to edge cases involving complex multi-organism studies or borderline definitions of genetics research.

The reduction in false positives for Custom Biomni can be attributed to its integration of external validation using NCBI Gene and Taxonomy resources.

Computational Efficiency

Cost and Time Considerations

Both Biomni approaches required modest computational resources. Custom Biomni incurred slightly higher cost and processing time than Default Biomni, reflecting its additional validation steps.

Method	Total time	Total cost	Cost per true positive
Custom Biomni	6.09 h	$6.76	$0.169
Default Biomni	5.28 h	$6.01	$0.100
PubTator	<5 min	$0	$0*

PubTator requires substantial downstream manual filtering due to lower precision.

Although Custom Biomni is more expensive per retrieved paper, its higher precision substantially reduces downstream curation effort, suggesting better overall cost-effectiveness for large-scale literature curation.

Scalability

Both Biomni methods demonstrated stable performance across the evaluated corpus. Extrapolation to larger datasets (e.g., 10,000 articles) suggests feasible runtimes (approximately 4–5 days) and moderate costs, supporting practical scalability for large literature collections.

Summary of Key Findings

Custom Biomni achieved the highest precision (90.91%), substantially reducing false positives.
PubTator exhibited particularly weak gene annotation performance for non-model organisms.
Higher precision corresponded to lower retrieval volume, highlighting an inherent trade-off.
Improved annotation precision translated into meaningful reductions in downstream curation cost.
AI agent–based approaches with database-aware validation offer a promising direction for high-precision literature curation in understudied biological domains.