Mac安装OCRmyPDF并配置依赖环境

工具: homebrew(x86)
环境: conda虚拟环境 python=3.7
tips: M1 芯片利用 homebrew 安装 miniconda 搭建 python3.7 的虚拟环境
如果还装了 miniforge,注意在安装完后根据提示 init 一下你的 shell
OCRmyPDF 似乎还未对3.8及以上的版本作适配
官方文档: https://ocrmypdf.readthedocs.io/en/latest/installation.html

搭建虚拟环境 & 安装 OCRmyPDF

brew install miniconda
conda install python=3.7
conda create -n py37 python=3.7
conda activate py37
conda info | grep env # 看下目前环境
pip install ocrmypdf
# 嫌麻烦的话直接brew install ocrmypdf
ocrmypdf --version
# conda deactivate # 关闭

pip安装需要配置依赖的包

官方提示:
As of ocrmypdf 7.2.1, the following versions are recommended:
Python 3.7 or 3.8
Ghostscript 9.23 or newer
qpdf 8.2.1
Tesseract 4.0.0 or newer
以下三个为可选项:
jbig2enc 0.29 or newer
pngquant 2.5 or newer
unpaper 6.1

利用下面这个命令根据提示进行配置,

ocrmypdf -l eng --clean-final input.pdf ocr.pdf

出现该指令代表以已经配置成功: InputFileError: File not found - input.pdf

  1. 安装 tesseract-lang
    brew install tesseract-lang
  1. 安装 unpaper

    报错信息 The program ‘unpaper’ could not be executed or was not found on your system PATH. This program is required when you use the [‘–clean, –clean-final’] arguments. You could try omitting these arguments, or installthe package.

    brew install unpaper
  2. 安装 Ghostscript

    报错信息 Could not find program ‘gs’ on the PATH

    brew install Ghostscript

此时再次运行测试的命令就不会报依赖包缺失的错误了,若有其他的需求,根据官方文档给出的提示用 brew install 安装即可