lambda-text-extractor

lambda-text-extractor is a Python 3.6 app that works with the AWS Lambda architecture to extract text from common binary document formats.

Features

Some of its key features are:

out of the box support for many common binary document formats (see section on Supported Formats),

scalable PDF parsing using OCR in parallel using AWS Lambda and asyncio,

creation of text searchable PDFs after OCR,

serverless architecture makes deployment quick and easy,

detailed instruction for preparing libraries and dependencies necessary for processing binary documents, and

sensible unicode handling

Supported Formats

lambda-text-extractor supports many common and legacy document formats:

Portable Document Format (.pdf),

PDFs with a text layer using Poppler utilities,

PDFs with OCR using Tesseract and Ghostscript 9.21 for PDF manipulation,

Microsoft Word 2, 6, 7, 97, 2000, 2002 and 2003 (.doc) using Antiword with fallback to Catdoc,

Microsoft Word 2007 OpenXML files (.docx) using python-docx,

Microsoft PowerPoint 2007 OpenXML files (.pptx) using python-pptx,

Microsoft Excel 5.0, 97-2003, and 2007 OpenXML files (.xls, .xlsx) using xlrd,

OpenDocument 1.2 (.odm, .odp, .ods, .odt, .oth, .otm, .otp, .ots, .ott) using odfpy,

Rich Text Format (.rtf) using UnRTF v0.21.9,

XML files and HTML web pages (.html, .htm, .xml) using lxml,

CSV files (.csv) using Python csv module,

Images (.tiff, .jpg, .jpeg, .png) using Tesseract, and

Plain text files (.txt)