Tutorial: Automated Research Workflow

📹 Tutorial Videos

Presentation Tutorial

Tutorial Video

How to design, organize, and execute a generic multi-agent system for JAMA-style paper generation.

This guide explains the architecture and operation of the BST236 Midterm Project Workflow—a system designed to turn raw public health data into publication-ready research papers with minimal human intervention.

1. Executive Summary

1.1 Why This Workflow Exists

Research is often slowed by the friction of switching between data analysis, literature review, and manuscript drafting. This workflow eliminates that friction by:

Preventing Context Loss: Agents share a unified understanding of the data across all phases.
Enforcing Standards: Built-in "Quality Gates" ensure every output meets JAMA Network Open guidelines.
Scaling Effort: Complex statistical modeling and LaTeX typesetting are automated, allowing the researcher to focus on high-level interpretation.

1.2 The Goal

To move from Raw Data → Publication-Ready PDF + Presentation Materials in under 60 minutes.

2. Getting Started: Installation & Setup

2.1 System Requirements

Our workflow is designed to be cross-platform but has been rigorously tested on:

macOS (Primary development environment)
Linux/Ubuntu
Windows (via WSL/WSL2 recommended)

2.2 Software Prerequisites

You will need three core components installed on your system:

2.2.1 Python Environment (3.9+)

The orchestrator and analysis scripts require standard scientific computing packages.

# Install required Python packages
pip install pandas numpy scipy statsmodels matplotlib seaborn scikit-learn openpyxl

2.2.2 LaTeX Distribution

This is required for generating the final publication-ready PDF.

macOS: brew install --cask mactex
Ubuntu: sudo apt-get install texlive-full
Windows: Download MiKTeX

2.2.3 AI Assistant (Optional)

If you wish to use the multi-agent system via a CLI assistant, we recommend the GitHub Copilot CLI:

gh extension install github/gh-copilot
gh auth login

2.3 Repository Setup

Clone the Workflow:

git clone https://github.com/your-squad/midterm-project.git
cd midterm-project

Verify Installation:

Run a quick sanity check to ensure all dependencies are met:

python -c "import pandas, statsmodels, matplotlib; print('✅ Python OK')"
pdflatex --version | head -n 1

Important

PATH Issues: Ensure that pdflatex and bibtex are in your system's PATH. On macOS, you may need to add /Library/TeX/texbin to your path manually.

3. Quick Start

3.1 Method 1: The Python Orchestrator (Script Mode)

The fastest way to execute the workflow is via the Python orchestrator. This is recommended for the "90-minute exam" scenario where stability is key. Place your datasets in a folder (e.g., exam_data/) and run:

# Run the complete end-to-end workflow
python workflow/orchestrator.py exam_data/ -o exam_paper/

3.2 Method 2: The AI CLI (Agent Mode)

For interactive control or fine-tuning, you can use an AI CLI (like Gemini CLI or Claude Code) to trigger the agents directly. This allows the AI to "think" through each step and handle complex edge cases that a static script might miss.

The "Power Prompt":

Simply point the AI to the workflow/ folder and issue a high-level directive:

"Use the multi-agent workflow system in workflow/ to generate a JAMA Network Open paper 
from the data in exam_data/. 

1. Follow the Agent definitions in workflow/agents/
2. Use the Skills provided in workflow/skills/
3. Adhere to the Quality Standards in workflow/prompts/

Deliver the final PDF to the exam_paper/ directory."

Pro Tip

Pro Tip: When using Agent Mode, you can pause between phases to review the research_question.md or analysis_plan.md before the AI proceeds to the statistical modeling phase.

Verification

Verification: After the workflow finishes, check exam_paper/paper.pdf. If you see "Reference ??", it means the BibTeX phase encountered an error; check exam_paper/paper.log for details.

4. System Architecture

4.1 The 3-Layer Design Pattern

Our workflow is built on a modular "3-layer cake" architecture, ensuring that domain expertise is separated from technical implementation.

Agent Layer (The Brains)

Skills Layer (The Hands)

Prompts Layer (The Rulebook)

4.1.1 Layer 1: The Agent Layer (`workflow/agents/`)

Specialized AI entities with specific "personalities" and goals.

orchestrator.agent.md: The "Project Manager" who handles phase transitions.
data-explorer.agent.md: The "Detective" who finds the research question.
statistician.agent.md: The "Math Expert" who runs regressions.
visualizer.agent.md: The "Designer" who creates JAMA-compliant figures.
literature-reviewer.agent.md: The "Librarian" who manages references.
paper-writer.agent.md: The "Author" who synthesizes everything into LaTeX.
quality-controller.agent.md: The "Critic" who reviews the final product.
post-production.agent.md: The "Marketer" who builds slides and websites.

4.1.2 Layer 2: The Skills Layer (`workflow/skills/`)

Atomic, reusable code libraries that agents "invoke" to perform work.

data-loading.skill.md: Robust parsing for CSV/Excel/Markdown.
exploratory-analysis.skill.md: Automatic profiling and EDA.
statistical-modeling.skill.md: DiD, Logistic Regression, and Inference code.
plotting.skill.md: Matplotlib/Seaborn templates for high-DPI figures.
latex-generation.skill.md: LaTeX boilerplate and table formatting.
reference-management.skill.md: BibTeX generation and formatting.
paper-compilation.skill.md: Logic for the multi-pass PDF compilation.

4.1.3 Layer 3: The Prompts Layer (`workflow/prompts/`)

The "Rulebook" that defines quality standards and formatting requirements.

research-question-prompt.md: The FINER criteria (Feasible, Interesting, Novel, Ethical, Relevant).
analysis-plan-prompt.md: Statistical rigorousness standards.
visualization-guidelines.md: JAMA figure styling (colorblind-friendly, vector format).
writing-guidelines.md: The academic voice and section constraints.
review-checklist.md: The 20-point rubric for final approval.

5. The 7-Phase Execution Loop

The orchestrator.py script executes the following sequence:

1. Exploration

2. Analysis

3. Visualization

4. Literature

5. Writing

6. Review

7. Production

Phase	Script	Agent	Output
1. Exploration	`01_data_explorer.py`	Data Explorer	`research_question.md`, `analysis_plan.md`
2. Analysis	`02_statistician.py`	Statistician	`results/`, `analysis_code.py`
3. Visualization	`03_visualizer.py`	Visualizer	`figures/.pdf`, `tables/.tex`
4. Literature	`04_literature_reviewer.py`	Literature Reviewer	`references.bib`, `citations.md`
5. Writing	`05_paper_writer.py`	Paper Writer	`paper.tex` (Complete JAMA Draft)
6. Review	`06_quality_controller.py`	Quality Controller	`review_report.md`, Revised `.tex`
7. Production	`07_post_production.py`	Post-Production	`slides/`, `website/`, `social/`

6. Operational Patterns

6.1 Pattern: The Adversarial Critic Loop

The system doesn't just "finish" at Phase 5. Phase 6 invokes the Quality Controller Agent to audit the paper against the review-checklist.md. If the score is below the "95% Human-Quality" threshold, the Writer is re-activated with specific correction instructions.

6.2 Pattern: Contractor Mode (Manual Intervention)

While orchestrator.py handles the "Big Picture," users can use the CLI to drop into a specific phase.

# Example: Only regenerate figures if the data changed
python workflow/03_visualizer.py exam_data/ exam_paper/

6.3 Pattern: Post-Paper Products

Once the paper is finalized, the Post-Production Agent automatically converts the LaTeX content into:

Beamer Slides: A 15-slide summary of the findings.
Interactive Website: A Plotly-based dashboard for exploring the results.
Social Media kit: Summaries for LinkedIn and Twitter.

Pro Tip

Pro Tip: Use the interactive website to verify that the numbers in the "Results" section match the raw data visualization. It's a great sanity check!

7. Reference Material

7.1 Appendix: File Structure

/
├── workflow/
│   ├── agents/    # Specialized AI instructions
│   ├── skills/    # Reusable code libraries (Python/LaTeX)
│   ├── prompts/   # Quality standards & checklists
│   └── scripts/   # Phase-specific execution logic (01-07)
├── exam_paper/    # The generated output folder
│   ├── paper.pdf  # Main Deliverable
│   ├── figures/   # Vector PDFs
│   └── post_paper_products/ # Slides & Website
└── sample/        # Testing datasets

7.2 Troubleshooting Guide

Issue	Potential Cause	Recommended Fix
LaTeX Compilation Error	Missing figure or special character ($ or %)	Check `paper.log` and verify all figure paths in `.tex`.
Phase 2 Error	Dataset too large or non-numeric data	Check `data_summary.md` for variable type mismatches.
Agent Refusals	Over-sensitive safety filters	Ensure your instructions are strictly scientific/medical.

Security First

Security First: Never store API keys or database credentials in the workflow/ folder. Use environment variables for all sensitive configuration.

🎓 Tutorial: Building an Automated Research Workflow

📹 Tutorial Videos

Presentation Tutorial

Tutorial Video

1. Executive Summary

1.1 Why This Workflow Exists

1.2 The Goal

2. Getting Started: Installation & Setup

2.1 System Requirements

2.2 Software Prerequisites

2.2.1 Python Environment (3.9+)

2.2.2 LaTeX Distribution

2.2.3 AI Assistant (Optional)

2.3 Repository Setup

3. Quick Start

3.1 Method 1: The Python Orchestrator (Script Mode)

3.2 Method 2: The AI CLI (Agent Mode)

4. System Architecture

4.1 The 3-Layer Design Pattern

4.1.1 Layer 1: The Agent Layer (workflow/agents/)

4.1.2 Layer 2: The Skills Layer (workflow/skills/)

4.1.3 Layer 3: The Prompts Layer (workflow/prompts/)

5. The 7-Phase Execution Loop

6. Operational Patterns

6.1 Pattern: The Adversarial Critic Loop

6.2 Pattern: Contractor Mode (Manual Intervention)

6.3 Pattern: Post-Paper Products

7. Reference Material

7.1 Appendix: File Structure

7.2 Troubleshooting Guide

4.1.1 Layer 1: The Agent Layer (`workflow/agents/`)

4.1.2 Layer 2: The Skills Layer (`workflow/skills/`)

4.1.3 Layer 3: The Prompts Layer (`workflow/prompts/`)