# Notion Workspace Audit Pattern

A repeatable script and checklist for analyzing and cleaning a messy Notion workspace via the API.

## Quick audit

```python
import subprocess, json, os

NOTION_API_KEY = os.environ.get("NOTION_API_KEY") or subprocess.run(
    "grep NOTION /DATA/.hermes/.env | cut -d= -f2-",
    shell=True, capture_output=True, text=True
).stdout.strip().strip('"')

def notion_api(method, endpoint, payload=None, version="2025-09-03"):
    cmd = ["curl", "-s", "-X", method, f"https://api.notion.com/v1{endpoint}",
           "-H", f"Authorization: Bearer {NOTION_API_KEY}",
           "-H", f"Notion-Version: {version}",
           "-H", "Content-Type: application/json"]
    if payload is not None:
        cmd += ["-d", json.dumps(payload)]
    r = subprocess.run(cmd, capture_output=True, text=True)
    return json.loads(r.stdout)

# Paginated search
all_results = []
cursor = None
while True:
    data = notion_api("POST", "/search", {"page_size": 100, **({"start_cursor": cursor} if cursor else {})})
    all_results.extend(data.get("results", []))
    cursor = data.get("next_cursor")
    if not cursor:
        break

pages = [r for r in all_results if r.get("object") == "page"]
databases = [r for r in all_results if r.get("object") in ("database", "data_source")]

print(f"Pages: {len(pages)}, Databases: {len(databases)}")

# Categorize titled vs. untitled
def get_title(r):
    if r.get("object") in ("database", "data_source"):
        return (r.get("title", [{}])[0].get("text", {}).get("content", "")
                if r.get("title") else r.get("name", "")).strip() or "(unnamed)"
    props = r.get("properties", {})
    for key in ["title", "Name", "Thema", "Titel"]:
        if key in props:
            try:
                return props[key]["title"][0]["text"]["content"].strip() or "(unnamed)"
            except:
                pass
    return "(unnamed)"

titled = [p for p in pages if get_title(p) != "(unnamed)"]
untitled = [p for p in pages if get_title(p) == "(unnamed)"]

print(f"Titled pages: {len(titled)}, Untitled pages: {len(untitled)}")
print(f"Named databases: {sum(1 for d in databases if get_title(d) != '(unnamed)')}")
print(f"Unnamed databases: {sum(1 for d in databases if get_title(d) == '(unnamed)')}")

# List untitled pages with parent info (likely cleanup candidates)
for p in untitled[:30]:
    print(f"- {p['id']} parent={p.get('parent',{}).get('type')} "
          f"created={p['created_time'][:10]} edited={p['last_edited_time'][:10]}")
```

## Cleanup priorities

1. **Unnamed databases** — highest visibility impact. Rename them via API or UI.
2. **Untitled pages** — usually empty DB rows or quick-capture junk. Inspect and either rename or archive/delete.
3. **Stale content** — pages not edited in >6 months may belong in an archive.
4. **Duplicate DBs** — especially in finance, media, and project tracking; merge or clearly separate.
5. **Top-level structure** — group pages under umbrella pages (Studium, YouTube, Privat, Tech, Finanzen).

## Pitfalls

- A page inside a database inherits no name from the DB; the title property must be filled.
- Child databases may only be queryable with API version `2022-06-28`, not `2025-09-03`.
- Before deleting anything, archive (move to an "Archive" page) so the user can review.

## One-shot shell variant

If you only need counts:

```bash
export NOTION_API_KEY="$(grep NOTION /DATA/.hermes/.env | cut -d= -f2-)"
curl -s -X POST https://api.notion.com/v1/search \
  -H "Authorization: Bearer $NOTION_API_KEY" \
  -H "Notion-Version: 2025-09-03" \
  -H "Content-Type: application/json" \
  -d '{"page_size":100}' | jq '
  {
    total: .results | length,
    pages: [.results[] | select(.object=="page")] | length,
    databases: [.results[] | select(.object=="database" or .object=="data_source")] | length
  }'
```
