Tesseract OCR 

Tesseract is an open-source OCR utility capable of extracting text from images and scanned documents.

Tesseract is local and private

Tesseract runs locally on your server. No data is uploaded to OpenAI or any other third party. This ensures that:

  • No sensitive media files leave your server
  • Tesseract can be used in secure or offline environments
  • It complies with strict data protection policies (e.g., GDPR)

Installation instructions (Ubuntu/Debian)

1. Update the package list and install dependencies

sudo apt update
sudo apt install tesseract-ocr

2. Verify Installation

tesseract --help

3. Ensure Tesseract is in the system PATH

If Tesseract is not available globally, ensure ~/.local/bin is in your PATH, or symlink manually:

sudo ln -s ~/.local/bin/tesseract /usr/local/bin/tesseract

Security & permissions

Ensure the web server user (e.g., www-data) has:

  • Execute permission for tesseract 
  • Read and write access to resource files and temporary directories

Plugin configuration

Plugin settings can be configured under Admin > System > Plugins > Tesseract > Configuration.

  • Select a field in which to store the extracted text.
  • Specify which file extensions will be processed - the default covers the most popular types

Processing

Tesseract will run via the Cron mechanism so if your system is set up correctly the processing will happen automatically, periodically. New files will be processed on upload.

You can run the process manually via:

php plugins/tesseract/scripts/process.php

Combining with OpenAI GPT

Updates to metadata will trigger OpenAI GPT if configured to take the Tesseract field (set in the plugin settings, above) as input. This means you can use GPT to take the extracted text and autmatically generate titles, summaries, descriptions, translations and automatically tag your resources, based on the textual contents of the file.