Troubleshooting PDF OCR using Python on Mac

I wrote a script to extract some text from a PDF (image-based text, so pdftotext wouldn’t work).

Using pdf2image convert_from_path I simply could not get any data out of the pdf. I tried multiple PDFs while testing and convert_from_path just kept returning an empty variable.

Turned out that my homebrew install of xpdf was interfering with my homebrew install of poppler.

Uninstalling xpdf (brew uninstall xpdf) and reinstalling poppler (brew install poppler) seemed to fix things up. My suspicion is that they both come with their own versions of pdfinfo which is used by pdf2image. Just a hunch, I don’t know enough about what’s going on under the hood. So, anyway, if pdf2image isn’t working correctly for you and you’re on a Mac, make sure you’ve got poppler installed and that xpdf’s pdfinfo isn’t being used.

Posted

in

Current Spins

Top Albums

Check out my album Set It All Down on your favorite streaming service.


Posts Worth Reading:


Letterboxd


Reading Notes

  • Who profits from our constant state of dissatisfaction? The answer, of course, is painfully obvious. Every industry that sells a solution to a problem you […]
  • the shifts have been in place for awhile. A certain kind of book—say those reviewed in the NYRB—will become like opera, or theater, or ballet, […]
  • • No more struggle: “Whatever arises, train again and again in seeing it for what it is. The innermost essence of mind is without bias. […]
  • The real problem, in my mind, isn’t in the nature of this particular Venture-Capital operation. Because the whole raison-d’etre of Venture Capital is to make […]
  • . The EU invokes a mechanism called the precautionary principle in cases where an innovation, such as GMOs, has not yet been sufficiently researched for […]

Saved Links

RSS Error: A feed could not be found at `https://links.jimwillis.org/feed/atom?`; the status code is `404` and content-type is `text/html; charset=utf-8`