traverse/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Traverse is a semantic search tool for OpenStreetMap tags. It converts French natural language queries ("où manger", "parking vélo") into OSM tags (`amenity=restaurant`, `amenity=bicycle_parking`).

## Commands

Build search indexes (requires GPU via switcherooctl):
```bash
switcherooctl launch uv run create-index.py
```

Run interactive search:
```bash
switcherooctl launch uv run search.py
```

Run evaluation (98.4% recall on 100 test cases):
```bash
switcherooctl launch uv run test/evaluate.py
```

## Architecture

Two-stage retrieval pipeline with pure, interchangeable functions:

```python
candidates, search_settings, rerank_settings = prepare()
results = search(query, candidates, search_settings)    # list[Candidate] → list[Candidate]
reranked = rerank(query, results, rerank_settings)       # list[Candidate] → list[Candidate]
```

1. **Embedding Search** (`utils/embedding_search.py`): Uses `intfloat/multilingual-e5-base` with "query:"/"passage:" prefixes. Searches both POI and attribute FAISS indexes, returns top candidates.

2. **Cross-Encoder Reranking** (`utils/rerank_with_crossencoder.py`): Uses `Qwen/Qwen3-Reranker-0.6B` (LLM-based yes/no reranker) on CUDA. Splits results into popular (usage >= 10k) and niche, returns top 5 of each.

### Core Types

`Candidate` dataclass (`utils/types/__init__.py`) — used everywhere:
- `tag`, `description_fr`, `description_natural`, `category`, `usage_count`, `score`

### Data Flow

- `data/osm_wiki_tags_cleaned.json`: Source data with OSM tags, French descriptions, and enriched descriptions
- `data/osm_wiki_tags_natural_desc.json`: Natural French descriptions generated by Mistral Large
- `create-index.py`: Generates separate FAISS indexes for POI and attribute categories
- `data/poi.index`, `data/attributes.index`: FAISS vector indexes
- `utils/prepare.py`: Startup functions — `load_candidates()`, `load_search_settings()`, `load_rerank_settings()`, `prepare()`

### Tag Categories

- **POI**: Points of interest (restaurants, shops, etc.)
- **Attributes**: Characteristics (cuisine type, wheelchair access, etc.)

## Key Files

- `utils/types/__init__.py`: `Candidate` dataclass
- `utils/prepare.py`: Data/model loading functions
- `utils/embedding_search.py`: `search()` - embedding search
- `utils/rerank_with_crossencoder.py`: `rerank()` - cross-encoder reranking
- `test/evaluate.py`: Evaluation script with recall/MRR metrics
- `data/search_cases.json`: Test cases for evaluation

## Future Work

- API REST (FastAPI) - planned
- Automatic POI/attribute detection (tested, heuristics best at 87%)
- Query expansion with LLM (implemented but adds latency without improving recall)