# OCR Cleanup: Ottoman/Turkish Transliteration PDFs When extracting text from religious/classical Ottoman-Turkish PDFs (like evrâd, ilahis, dualar), pymupdf's raw output contains significant artifacts: ## Common Artifacts - **Page number lines**: `- 321 -`, `- 322 -` - **Decorative dot rows**: `• •• •• •• •• ••` or `.. .. .. .. ..` - **Vertical separator artifacts**: isolated `I`, `l`, `i`, `1`, `t`, `f`, `¢`, `:`, `;`, `*` characters (OCR misreads of ornamental borders) - **Folio markers**: `F. 21`, `F: 22` - **Leading/trailing dot clusters**: `....`, `......` ## Cleanup Script Pattern ```python import re with open('raw_ocr.txt', 'r', encoding='utf-8') as f: raw = f.read() # Split by page markers pages = re.split(r'--- Seite \d+ ---', raw) pages = pages[1:] # skip empty first element cleaned_pages = [] for page in pages: lines = page.strip().split('\n') cleaned_lines = [] for line in lines: stripped = line.strip() # Skip page number lines if re.match(r'^-?\s*\d+\s*-?$', stripped): continue # Skip dot/bullet-only lines if re.match(r'^[\.•\s\-_]+$', stripped) and len(stripped) < 10: continue # Skip folio markers if re.match(r'^[Ff][\.:]\s*\d+$', stripped): continue # Skip single-char OCR artifacts if stripped in ('I', 'l', 'i', '1', 't', 'f', '¢', ':', ';', '*', '·', '܀', '܂'): continue # Skip repeated dots with spaces if re.match(r'^[\.\s]+$', stripped): continue # Clean leading/trailing dots cleaned = re.sub(r'^[\.•\s]+', '', stripped) cleaned = re.sub(r'[\.\s]+$', '', cleaned) if cleaned: cleaned_lines.append(cleaned) if cleaned_lines: cleaned_pages.append('\n'.join(cleaned_lines)) full_cleaned = '\n\n'.join(cleaned_pages) ``` ## Pitfalls - The OCR text will have transliteration errors (e.g. `i` vs `İ`, `ı` vs `I`, `f` vs `t`). Manual review is needed for liturgical accuracy. - Arabic phrases embedded in Latin transliteration should be wrapped in `` for proper RTL rendering. - Page numbers in these PDFs often appear on every page (even blank ones), so page count from OCR ≠ content pages.