# Test Inputs

Fixtures used by `scripts/smoke_ai_endpoints.py` to drive every AI endpoint with realistic data. Every CSV here is loaded by `scripts/_fixtures.py:load_fixtures()` — the CSVs are the **single source of truth**; no inputs are hardcoded in the script.

## Layout

```
data/
├── screening/                          # Inputs for /v1/screening/* and /v1/criteria/*
│   ├── TestMVP_citations.csv           # 121 papers (the screening corpus)
│   ├── TestMVP_criteria.csv            # 50 inclusion/exclusion criteria
│   ├── TestMVP_questions.csv           # 6 research questions (multi-line)
│   └── TestMVP_project_description.csv # 1 row: project name + description
└── indexing/                           # Inputs for /v1/indexer/*
    ├── TestMVP2_Indexing.csv           # 46 records w/ ground-truth extraction values
    └── TestMVP2_extraction_fields.csv  # 11 IndexerField specs (the schema to extract)
```

## Per-file reference

### `screening/TestMVP_citations.csv` — 121 papers

| Column | Used as | Notes |
|---|---|---|
| `ID` | `papers[*].id` | String identifier (PMID-derived) |
| `Title` | `papers[*].title` | Paper title |
| `Abstract` | `papers[*].abstract` | Full abstract text |
| `Full_Citation` | `papers[*].citation` | Formatted citation string |
| Other columns | unused | Authors, DOI, journal, MeSH, etc. — kept for context, ignored by the loader |

**Project context:** ILL_ONC_1 — comprehensive genomic profiling / NGS in solid cancers.

### `screening/TestMVP_criteria.csv` — 50 criteria

| Column | Used as | Example |
|---|---|---|
| `criteria_name` | `criteria[*].name` | `"Disease"` |
| `criteria_type` | `criteria[*].type` | `"exclude"` (or `"include"`) |
| `criteria_value` | `criteria[*].value` | `"Other diseases"` |

### `screening/TestMVP_questions.csv` — 6 research questions

| Column | Used as |
|---|---|
| `questions` | `questions[*]` |

Questions are multi-line (embedded `-` bullets), so `wc -l` reports 25 physical lines but pandas correctly parses 6 logical rows.

### `screening/TestMVP_project_description.csv` — 1 project

| Column | Used as |
|---|---|
| `project_name` | `Fixtures.project_name` |
| `project_description` | `Fixtures.project_description` |

### `indexing/TestMVP2_Indexing.csv` — 46 records

| Column | Used as | Notes |
|---|---|---|
| `id` | `records[*].ID` | API expects capitalized keys |
| `title` | `records[*].Title` | |
| `abstract` | `records[*].Abstract` | |
| `country`, `study_size`, `year_coverage`, `gender`, `age`, `publication_type`, `model_type`, `data_source`, `subpopulation`, `age_exact`, `follow_up_years` | ground truth | Pre-populated extraction values — useful for spot-checking live LLM output |

The `country` column is also used standalone as input to `POST /v1/indexer/group-tags` (deduplicated unique values).

### `indexing/TestMVP2_extraction_fields.csv` — 11 indexer field specs

The schema the indexer is asked to extract. Mirrors `api/schemas/indexer.py:IndexerField`.

| Column | Type | Example |
|---|---|---|
| `name` | string | `country` |
| `description` | string | `Country or countries where the study population was recruited or data originates.` |
| `data_type_primary` | enum | `string` (Text), `number` (Number), `array-string` (List of strings) |
| `examples` | **JSON-encoded list** | `["United Kingdom", "United States", "Denmark"]` |
| `examples_mode` | enum | `guide` (suggestions only) or `enum` (strict allowed values) |
| `depth` | enum | `minimal` (value+confidence+evidence) or `full` (+reasoning+normalised_value) |

**Editing tip:** the `examples` cell must remain valid JSON. Open in a text editor or use `python -c "import json; json.dumps([...])"` to regenerate. Excel/Google Sheets will preserve it as a quoted string when saving back to CSV.

## Endpoint → input mapping

| Endpoint | Inputs from |
|---|---|
| `POST /v1/screening/jobs` | `papers` (citations.csv) + `criteria` (criteria.csv) + `questions` (questions.csv) |
| `POST /v1/screening/estimate` | counts only (model + papers/criteria counts) |
| `POST /v1/criteria/picos` · `/refine-context` · `/generate` | `project_description` + `questions` |
| `POST /v1/criteria/analyze-question` | `questions[0]` |
| `POST /v1/criteria/refine` · `/consolidate` | `criteria` + `project_description` |
| `POST /v1/indexer/run` · `/jobs` | `records` (Indexing.csv) + `fields` (extraction_fields.csv) |
| `POST /v1/indexer/refine-fields` · `/suggest-fields` | `fields` + project context |
| `POST /v1/indexer/group-tags` | `field_name="country"` + unique `country` values from Indexing.csv |
| `POST /v1/indexer/estimate` | counts + `fields` |

## Live-pass downsampling

The smoke test's live pass (default `--live-sample 20`) sends only the first 20 papers / 20 indexer records to keep cost bounded (~$0.30–$0.80 with `gpt-5-nano`/`gpt-5-mini`). The full corpus is still used by the mock pass since mock mode is free.