# OCR Cleanup: Ottoman/Turkish Transliteration PDFs

When extracting text from religious/classical Ottoman-Turkish PDFs (like evrâd, ilahis, dualar), pymupdf's raw output contains significant artifacts:

## Common Artifacts
- **Page number lines**: `- 321 -`, `- 322 -`
- **Decorative dot rows**: `• •• •• •• •• ••` or `.. .. .. .. ..`
- **Vertical separator artifacts**: isolated `I`, `l`, `i`, `1`, `t`, `f`, `¢`, `:`, `;`, `*` characters (OCR misreads of ornamental borders)
- **Folio markers**: `F. 21`, `F: 22`
- **Leading/trailing dot clusters**: `....`, `......`

## Cleanup Script Pattern

```python
import re

with open('raw_ocr.txt', 'r', encoding='utf-8') as f:
    raw = f.read()

# Split by page markers
pages = re.split(r'--- Seite \d+ ---', raw)
pages = pages[1:]  # skip empty first element

cleaned_pages = []
for page in pages:
    lines = page.strip().split('\n')
    cleaned_lines = []
    for line in lines:
        stripped = line.strip()
        # Skip page number lines
        if re.match(r'^-?\s*\d+\s*-?$', stripped):
            continue
        # Skip dot/bullet-only lines
        if re.match(r'^[\.•\s\-_]+$', stripped) and len(stripped) < 10:
            continue
        # Skip folio markers
        if re.match(r'^[Ff][\.:]\s*\d+$', stripped):
            continue
        # Skip single-char OCR artifacts
        if stripped in ('I', 'l', 'i', '1', 't', 'f', '¢', ':', ';', '*', '·', '܀', '܂'):
            continue
        # Skip repeated dots with spaces
        if re.match(r'^[\.\s]+$', stripped):
            continue
        # Clean leading/trailing dots
        cleaned = re.sub(r'^[\.•\s]+', '', stripped)
        cleaned = re.sub(r'[\.\s]+$', '', cleaned)
        if cleaned:
            cleaned_lines.append(cleaned)
    if cleaned_lines:
        cleaned_pages.append('\n'.join(cleaned_lines))

full_cleaned = '\n\n'.join(cleaned_pages)
```

## Pitfalls
- The OCR text will have transliteration errors (e.g. `i` vs `İ`, `ı` vs `I`, `f` vs `t`). Manual review is needed for liturgical accuracy.
- Arabic phrases embedded in Latin transliteration should be wrapped in `<span class="ar" dir="rtl">` for proper RTL rendering.
- Page numbers in these PDFs often appear on every page (even blank ones), so page count from OCR ≠ content pages.
