from Fractal Reader Development and Operation Diary Extracting clean text from PDFs

  • This seems quite troublesome. I couldn’t find any commercially available libraries that can extract text cleanly from PDFs.
  • It seems like using the code around renderTextLayer in pdf.js could enable text extraction (takker)
  • Trial and Error Memo on Extracting Text from PDFs | Kan Hatakeyama
    • PyMuPDF seems to offer good accuracy.
    • It seems text can be cleanly selected even in a PDF viewer.
      • I think this is limited to cases where both the text information and the order of the text are embedded in the PDF (takker).
      • If it were a web browser, it should be constructing <span> elements in order according to the text sequence.
      • Selecting text in the order of the spans allows for clean text extraction.
    • Of course, it’s powerless against PDFs where pages are embedded as images.