Methods
Overview
This study compares three approaches for constructing a high-precision corpus of genetic research articles on non-model organisms: (1) PubTator-based annotation filtering, (2) a Biomni AI agent with default configuration, and (3) a Biomni AI agent with custom enhancements. We evaluated these methods on a corpus of open-access journal articles published over a three-day period, assessing their precision in identifying relevant literature.
Operational Definitions
Model vs. Non-Model Organisms
Model Organism Definition: We operationally defined model organisms as the top 20 most frequently studied species in genetic research, based on annotation frequency in genome editing meta-database (https://doi.org/10.1016/j.ggedit.2022.100024).
Data Source: Genome editing meta-database (data/20251008_ge_metadata_all.csv) (downloaded from https://github.com/szktkyk/gem_api)
Selection Criteria:
- Analyzed species frequency in studies using genome editing tools since 2000
- Top 20 species classified as "model organisms"
- All other species classified as "non-model organisms"
- Implementation: scripts/select_model_species.py
Coverage Analysis:
- The top 20 model organisms accounted for 92.21% of all papers in the database
- Results documented in config/top20_organisms_with_taxid.csv
- Visualization available in figures/model_species/top20_organisms_bar_chart.png
- coverage statistics available in figures/model_species/coverage_statistics.txt
Reference Sources: - NIH Model Organisms FAQ: https://public.csr.nih.gov/FAQs/ReviewersFAQs/ModelOrganisms - Howe et al. (2017). The Model Organism as a System: Integrating 'Omics' Data Sets. BMC Biology, 15:45. https://doi.org/10.1186/s12915-017-0391-5
Genetics Research Classification
Scope Definition: The scope of this study was operationally defined to focus on genetic research conducted within non-model organisms. Given the lack of a widely accepted definition of “non-model organism genetics research” in the literature, we adopted a pragmatic scope tailored to the objectives of this analysis. Specifically, we excluded studies in which non-model organisms were primarily investigated in the context of their effects on model organism genomes or biology (e.g., bacterial effects on human genomes), as such studies are centered on model organisms rather than on the genetics of non-model species themselves.
*Classification Framework: In the absence of standardized taxonomies for classifying genetic research in non-model organisms, we developed a six-category classification framework to capture the major types of genetic studies observed in the literature.
The six genetics research categories used in this study are as follows:
- Genomic Sequencing & Identification: Genome sequencing and novel gene identification
- Comparative Analysis & Annotation: Ortholog/homolog analysis for functional inference through sequence homology
- Gene Expression Profiling: Expression analysis across conditions, tissues, or developmental stages
- Phylogenetic and Evolutionary Analysis: Gene family expansion/contraction and evolutionary adaptation signatures
- Functional Validation & Bioengineering: Experimental validation using techniques such as CRISPR or RNAi
- Methodological Development & Diagnostics: Development of genetic diagnostic techniques and methodological advances
Inclusion Criteria: Papers satisfying at least one of the above six categories were classified as genetics research.
Note: This classification system represents an initial framework. More refined categorization may be possible upon completion of comprehensive literature annotation.
Data Collection
Literature Corpus
Search Query: all[filter] AND pubmed pmc open access[filter] AND journal article[pt]
Date Range: December 1-3, 2024 (3-day window)
Initial Corpus Size: 4,959 articles
Limitation: The narrow 3-day sampling window represents a methodological limitation. Ideally, a 1-month window with random sampling would provide more representative coverage. See LIMITATIONS.md for detailed discussion.
Workflow for AI Agent Input Selection
Implementation: wf_pre_agent.py
Configuration Parameters:
config = WorkflowConfig(
date_start="20241201",
date_end="20241203",
days_per_chunk_step1=1,
chunk_size_step2=900,
chunk_size_step3=400,
max_retries=3,
retry_delay=10,
model_species_config="config/top20_organisms_with_taxid.csv",
output_dir="data/pre_agent"
)
Filtering Criteria: - Used PubTator annotations on titles and abstracts - Selected papers where the most frequently mentioned species was a non-model organism - Filtered Corpus Size: 531 papers
Quality Control:
- Manual validation of main() function
- Unit tests implemented in test_wf_pre_agent.py
- Venn diagram visualization generated using create_venn_diagram.py (output: figures/wf_filtering/)
Workflow for PubTator Baseline
Implementation: filter_pubtator_annotations.py
Input: 531-paper corpus from AI agent workflow
Filtering Criteria: 1. Retrieved all PubTator annotations (species and genes) from titles and abstracts 2. Excluded papers with any model organism annotations 3. Excluded papers lacking both species and gene annotations
Filtered Corpus Size: 55 papers (10.4% of 531-paper corpus)
Stability Check: Replication in November 2024 yielded identical 55-paper result, confirming reproducibility.
Experimental Setup
PubTator Baseline
Approach: Rule-based filtering using existing PubTator annotations
Rationale: Establishes baseline performance of current state-of-the-art automated annotation systems
Process: 1. Extract species and gene annotations from PubTator API 2. Apply filtering rules (exclude model organisms, require gene annotations) 3. No manual intervention or AI-based decision making
Default Biomni Agent
Framework: Biomni (https://biomni.stanford.edu/) with CodeAct architecture
Model: GPT-4.1-mini (selected over GPT-5-mini due to 20-40 second latency reduction)
Tools Available: Standard Biomni toolkit for literature analysis and database queries
Limitations of Batch Processing: - Initial attempts to process all 531 papers in a single agent session failed - CodeAct architecture generated loop-based code that applied uniform processing to all PMIDs - This prevented the desired "organic processing" (identify gene → query database → validate → compare with species → query alternative candidates) - Solution: Implemented paper-by-paper processing approach
Custom Biomni Agent
Enhancements over Default Biomni:
1. MCP Server Integration
NCBI Gene Database Access (Enhanced Entity Linking): - Query NCBI Gene database for gene information - Retrieve gene names, IDs, and synonyms - Validate consistency by cross-referencing gene name with species taxonomy ID
NCBI Taxonomy Database Access (Enhanced Entity Linking): - Query NCBI Taxonomy database for species information - Retrieve taxonomy IDs, scientific names, and taxonomic classifications (class level) - Enable precise species identification and validation
Structured Data Generation (Robust Data Capture): - Incrementally append validated annotations to structured output files - Ensure data persistence in case of agent interruption - Enable resumable processing for large batches
2. Iterative Development and Validation
Conducted six rounds of agent behavior validation and prompt refinement:
- [Trial 1-6 documentation links preserved for internal reference]
- Final prompt design and paper-level processing approach determined in Trial 6
- Implementation: run_agent_20251030.py
3. Batch Processing Architecture
Challenge: Memory accumulation from repeated agent.go() calls
- Accumulated message history and MCP connections
- All log_messages retained in memory
- Memory exhaustion (OOM Killer) after ~76 papers
Solution: Subprocess-based batch processing
- Implementation: run_agent_batch_subprocess.py (orchestrator) + run_agent_worker.py (worker)
- Process papers in small batches with agent restart between batches
- Achieved stable processing of 419 papers without interruption
Performance and Cost Analysis
Processing Volume: 419 papers (manual termination; system remained stable)
Results: 62 papers identified as non-model organism genetics research (14.8% hit rate)
Computational Cost (300-paper subset): - Total Cost: $3.76 - Total Processing Time: 11,182.65 seconds (3.11 hours) - Average Time per Paper: 37.3 seconds - Average Cost per Paper: $0.0125
Evaluation Metrics
Precision
Definition: $$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}$$
Rationale for Focus on Precision: - For this project, minimizing false positives (irrelevant papers incorrectly classified as relevant) has been mostly paid attention - High precision ensures curated sets contain primarily relevant papers - Recall (sensitivity) and F1-score evaluation deferred to future work
Ground Truth Establishment: Manual curation of sampled papers to establish true positive classifications
Evaluation Scope: This study reports precision only. Comprehensive evaluation including accuracy, recall, and F1-score remains for future investigation (see LIMITATIONS.md).
Implementation Details
Programming Language: Python 3.11.14
Key Dependencies: - Biomni framework - OpenAI API (GPT-4.1-mini) - NCBI E-utilities - PubTator API
Code Availability:
- Workflow implementations: wf_pre_agent.py, filter_pubtator_annotations.py
- Agent implementations: run_agent_batch_subprocess.py, run_agent_worker.py
- Analysis scripts: scripts/select_model_species.py, scripts/check_species_match.py
- Configuration files: config/top20_organisms_with_taxid.csv, config/mcp_config.yaml
Docker Environment:
- Dockerfile: setup/Dockerfile
- Environment specifications: setup/environment.yml, setup/customized_bio_env.yml
Reproducibility: All code, configurations, and processed data are available in this repository to enable full reproduction of results.
Default Biomni Configuration: In this study, Default Biomni refers to a configuration in which a subset of the standard Biomni tools was manually selected based on their relevance to the target task. This setting reflects a realistic, task-oriented use of Biomni rather than an unconstrained or fully enabled tool suite, and was intended to provide a fair and practical baseline for comparison.