lambda-text-extractor
<span style="color: #24292e;">lambda-text-extractor</span>
is a Python 3.6 app that works with the AWS Lambda architecture to extract text from common binary document formats.Features
Some of its key features are:
- out of the box support for many common binary document formats (see section on Supported Formats),
- scalable PDF parsing using OCR in parallel using AWS Lambda and asyncio,
- creation of text searchable PDFs after OCR,
- serverless architecture makes deployment quick and easy,
- detailed instruction for preparing libraries and dependencies necessary for processing binary documents, and
- sensible unicode handling
Supported Formats
<span style="color: #24292e;">lambda-text-extractor</span>
supports many common and legacy document formats:
- Portable Document Format (
<span style="color: #24292e;">.pdf</span>
),
- PDFs with a text layer using Poppler utilities,
- PDFs with OCR using Tesseract and Ghostscript 9.21 for PDF manipulation,
- Microsoft Word 2, 6, 7, 97, 2000, 2002 and 2003 (
<span style="color: #24292e;">.doc</span>
) using Antiword with fallback to Catdoc,- Microsoft Word 2007 OpenXML files (
<span style="color: #24292e;">.docx</span>
) using python-docx,- Microsoft PowerPoint 2007 OpenXML files (
<span style="color: #24292e;">.pptx</span>
) using python-pptx,- Microsoft Excel 5.0, 97-2003, and 2007 OpenXML files (
<span style="color: #24292e;">.xls</span>
,<span style="color: #24292e;">.xlsx</span>
) using xlrd,- OpenDocument 1.2 (
<span style="color: #24292e;">.odm</span>
,<span style="color: #24292e;">.odp</span>
,<span style="color: #24292e;">.ods</span>
,<span style="color: #24292e;">.odt</span>
,<span style="color: #24292e;">.oth</span>
,<span style="color: #24292e;">.otm</span>
,<span style="color: #24292e;">.otp</span>
,<span style="color: #24292e;">.ots</span>
,<span style="color: #24292e;">.ott</span>
) using odfpy,- Rich Text Format (
<span style="color: #24292e;">.rtf</span>
) using UnRTF v0.21.9,- XML files and HTML web pages (
<span style="color: #24292e;">.html</span>
,<span style="color: #24292e;">.htm</span>
,<span style="color: #24292e;">.xml</span>
) using lxml,- CSV files (
<span style="color: #24292e;">.csv</span>
) using Python csv module,- Images (
<span style="color: #24292e;">.tiff</span>
,<span style="color: #24292e;">.jpg</span>
,<span style="color: #24292e;">.jpeg</span>
,<span style="color: #24292e;">.png</span>
) using Tesseract, and- Plain text files (
<span style="color: #24292e;">.txt</span>
)