lambda-text-extractor

<span style="color: #24292e;">lambda-text-extractor</span> is a Python 3.6 app that works with the AWS Lambda architecture to extract text from common binary document formats.

Features

Some of its key features are:

  • out of the box support for many common binary document formats (see section on Supported Formats),
  • scalable PDF parsing using OCR in parallel using AWS Lambda and asyncio,
  • creation of text searchable PDFs after OCR,
  • serverless architecture makes deployment quick and easy,
  • detailed instruction for preparing libraries and dependencies necessary for processing binary documents, and
  • sensible unicode handling

Supported Formats

<span style="color: #24292e;">lambda-text-extractor</span> supports many common and legacy document formats:

  • Portable Document Format (<span style="color: #24292e;">.pdf</span>),
  • Microsoft Word 2, 6, 7, 97, 2000, 2002 and 2003 (<span style="color: #24292e;">.doc</span>) using Antiword with fallback to Catdoc,
  • Microsoft Word 2007 OpenXML files (<span style="color: #24292e;">.docx</span>) using python-docx,
  • Microsoft PowerPoint 2007 OpenXML files (<span style="color: #24292e;">.pptx</span>) using python-pptx,
  • Microsoft Excel 5.0, 97-2003, and 2007 OpenXML files (<span style="color: #24292e;">.xls</span><span style="color: #24292e;">.xlsx</span>) using xlrd,
  • OpenDocument 1.2 (<span style="color: #24292e;">.odm</span><span style="color: #24292e;">.odp</span><span style="color: #24292e;">.ods</span><span style="color: #24292e;">.odt</span><span style="color: #24292e;">.oth</span><span style="color: #24292e;">.otm</span><span style="color: #24292e;">.otp</span><span style="color: #24292e;">.ots</span><span style="color: #24292e;">.ott</span>) using odfpy,
  • Rich Text Format (<span style="color: #24292e;">.rtf</span>) using UnRTF v0.21.9,
  • XML files and HTML web pages (<span style="color: #24292e;">.html</span><span style="color: #24292e;">.htm</span><span style="color: #24292e;">.xml</span>) using lxml,
  • CSV files (<span style="color: #24292e;">.csv</span>) using Python csv module,
  • Images (<span style="color: #24292e;">.tiff</span><span style="color: #24292e;">.jpg</span><span style="color: #24292e;">.jpeg</span><span style="color: #24292e;">.png</span>) using Tesseract, and
  • Plain text files (<span style="color: #24292e;">.txt</span>)