traverse/CLAUDE.md
2026-02-09 01:39:47 +01:00

71 lines
2.7 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Traverse is a semantic search tool for OpenStreetMap tags. It converts French natural language queries ("où manger", "parking vélo") into OSM tags (`amenity=restaurant`, `amenity=bicycle_parking`).
## Commands
Build search indexes (requires GPU via switcherooctl):
```bash
switcherooctl launch uv run create-index.py
```
Run interactive search:
```bash
switcherooctl launch uv run search.py
```
Run evaluation (98.4% recall on 100 test cases):
```bash
switcherooctl launch uv run test/evaluate.py
```
## Architecture
Two-stage retrieval pipeline with pure, interchangeable functions:
```python
candidates, search_settings, rerank_settings = prepare()
results = search(query, candidates, search_settings) # list[Candidate] → list[Candidate]
reranked = rerank(query, results, rerank_settings) # list[Candidate] → list[Candidate]
```
1. **Embedding Search** (`utils/embedding_search.py`): Uses `intfloat/multilingual-e5-base` with "query:"/"passage:" prefixes. Searches both POI and attribute FAISS indexes, returns top candidates.
2. **Cross-Encoder Reranking** (`utils/rerank_with_crossencoder.py`): Uses `Qwen/Qwen3-Reranker-0.6B` (LLM-based yes/no reranker) on CUDA. Splits results into popular (usage >= 10k) and niche, returns top 5 of each.
### Core Types
`Candidate` dataclass (`utils/types/__init__.py`) — used everywhere:
- `tag`, `description_fr`, `description_natural`, `category`, `usage_count`, `score`
### Data Flow
- `data/osm_wiki_tags_cleaned.json`: Source data with OSM tags, French descriptions, and enriched descriptions
- `data/osm_wiki_tags_natural_desc.json`: Natural French descriptions generated by Mistral Large
- `create-index.py`: Generates separate FAISS indexes for POI and attribute categories
- `data/poi.index`, `data/attributes.index`: FAISS vector indexes
- `utils/prepare.py`: Startup functions — `load_candidates()`, `load_search_settings()`, `load_rerank_settings()`, `prepare()`
### Tag Categories
- **POI**: Points of interest (restaurants, shops, etc.)
- **Attributes**: Characteristics (cuisine type, wheelchair access, etc.)
## Key Files
- `utils/types/__init__.py`: `Candidate` dataclass
- `utils/prepare.py`: Data/model loading functions
- `utils/embedding_search.py`: `search()` - embedding search
- `utils/rerank_with_crossencoder.py`: `rerank()` - cross-encoder reranking
- `test/evaluate.py`: Evaluation script with recall/MRR metrics
- `data/search_cases.json`: Test cases for evaluation
## Future Work
- API REST (FastAPI) - planned
- Automatic POI/attribute detection (tested, heuristics best at 87%)
- Query expansion with LLM (implemented but adds latency without improving recall)