Screen Capture & OCR¶
Screen Capture¶
ScreenCaptureManager captures a user-selected screen region using macOS's built-in screencapture tool, invoked via AppleScript.
Why AppleScript?¶
macOS ties TCC (Transparency, Consent, and Control) permissions — including Screen Recording — to the binary path. This creates a problem during development: every Xcode rebuild produces a new binary at a different path, invalidating the Screen Recording permission grant.
By running screencapture via AppleScript's do shell script, the capture executes as an independent process with its own TCC context:
do shell script "/usr/sbin/screencapture -i -s '/path/to/output.png'"
The -i flag enables interactive mode (like Cmd+Shift+4) and -s restricts to selection mode. The user draws a rectangle, and the screenshot is saved to the specified path.
Capture Flow¶
- Generate a unique temp file path:
{tempDir}/mathy_capture_{UUID}.png - Run the AppleScript via
NSAppleScripton a background queue (DispatchQueue.global) - User draws a selection rectangle on screen
- If the file exists after the script returns (user didn't cancel):
- Copy to persistent storage:
~/Library/Application Support/Mathy/images/capture_{timestamp}.png - Delete the temp file
- Return the persistent URL
- Copy to persistent storage:
- If no file exists (user pressed Escape): return
nil
The entire operation uses withCheckedContinuation to bridge to async/await.
Image Storage¶
Captured images are stored persistently at:
~/Library/Application Support/Mathy/images/capture_{timestamp}.png
These files are referenced by ConversionRecord.imagePath and displayed in the preview popup. When history entries are deleted (individually or via "Clear All"), the associated image files are also removed from disk.
OCR Service¶
OCRService is the HTTP client that sends captured images to the local Python server for LaTeX recognition.
Request Format¶
POST http://127.0.0.1:8765/predict
Content-Type: multipart/form-data; boundary={uuid}
Timeout: 30 seconds
--{boundary}
Content-Disposition: form-data; name="file"; filename="capture.png"
Content-Type: image/png
{binary image data}
--{boundary}--
The multipart body is constructed manually (no third-party HTTP library). A UUID is used as the boundary string.
Response Handling¶
Success (200):
{"latex": "\\frac{a}{b}"}
The latex field is extracted and returned as a String.
Errors:
| Code | Meaning | Handling |
|---|---|---|
| 400 | Invalid image | Throws OCRError.serverError(detail) |
| 503 | Model not loaded | Throws OCRError.serverError(detail) |
| 500 | Prediction failed | Throws OCRError.serverError(detail) |
| Network error | Server unreachable | Throws underlying URLError |
Error responses are decoded from {"detail": "..."} when available, with a fallback to a generic HTTP status code message.
Error Types¶
enum OCRError: Error {
case invalidResponse // Non-HTTP response
case httpError(Int) // Non-200 status (fallback)
case serverError(String) // Detailed error from server
}
Integration¶
OCRService is called from AppState.startCapture(). Errors are caught and printed to console but not shown to the user (the capture simply appears to fail silently). The service uses URLSession.shared.data(for:) with native async/await — no callbacks or Combine.