lambda-text-extractor
<span style="color: #24292e;">lambda-text-extractor</span>is a Python 3.6 app that works with the AWS Lambda architecture to extract text from common binary document formats.Features
Some of its key features are:
- out of the box support for many common binary document formats (see section on Supported Formats),
- scalable PDF parsing using OCR in parallel using AWS Lambda and asyncio,
- creation of text searchable PDFs after OCR,
- serverless architecture makes deployment quick and easy,
- detailed instruction for preparing libraries and dependencies necessary for processing binary documents, and
- sensible unicode handling
Supported Formats
<span style="color: #24292e;">lambda-text-extractor</span>supports many common and legacy document formats:
- Portable Document Format (
<span style="color: #24292e;">.pdf</span>),
- PDFs with a text layer using Poppler utilities,
- PDFs with OCR using Tesseract and Ghostscript 9.21 for PDF manipulation,
- Microsoft Word 2, 6, 7, 97, 2000, 2002 and 2003 (
<span style="color: #24292e;">.doc</span>) using Antiword with fallback to Catdoc,- Microsoft Word 2007 OpenXML files (
<span style="color: #24292e;">.docx</span>) using python-docx,- Microsoft PowerPoint 2007 OpenXML files (
<span style="color: #24292e;">.pptx</span>) using python-pptx,- Microsoft Excel 5.0, 97-2003, and 2007 OpenXML files (
<span style="color: #24292e;">.xls</span>,<span style="color: #24292e;">.xlsx</span>) using xlrd,- OpenDocument 1.2 (
<span style="color: #24292e;">.odm</span>,<span style="color: #24292e;">.odp</span>,<span style="color: #24292e;">.ods</span>,<span style="color: #24292e;">.odt</span>,<span style="color: #24292e;">.oth</span>,<span style="color: #24292e;">.otm</span>,<span style="color: #24292e;">.otp</span>,<span style="color: #24292e;">.ots</span>,<span style="color: #24292e;">.ott</span>) using odfpy,- Rich Text Format (
<span style="color: #24292e;">.rtf</span>) using UnRTF v0.21.9,- XML files and HTML web pages (
<span style="color: #24292e;">.html</span>,<span style="color: #24292e;">.htm</span>,<span style="color: #24292e;">.xml</span>) using lxml,- CSV files (
<span style="color: #24292e;">.csv</span>) using Python csv module,- Images (
<span style="color: #24292e;">.tiff</span>,<span style="color: #24292e;">.jpg</span>,<span style="color: #24292e;">.jpeg</span>,<span style="color: #24292e;">.png</span>) using Tesseract, and- Plain text files (
<span style="color: #24292e;">.txt</span>)
source:
$link[host]
Read Original Source
Tags: aws-lambda, neotext, programming, python, text