Datasette: instantly create and publish an API for your SQLite databases

A key feature of datasette is that the API it provides is very deliberately read-only. This provides a number of interesting benefits:

 

  • It lets us use SQLite in production in high traffic scenarios. SQLite is an incredible piece of technology, but it is rarely used in web application contexts due to its limitations with respect to concurrent writes. Datasette opens SQLite files using the immutable option, eliminating any concurrency concerns and allowing SQLite to go even faster for reads.
  • Since the database is read-only, we can accept abritrary SQL queries from our users!

lambda-text-extractor

<span style="color: #24292e;">lambda-text-extractor</span> is a Python 3.6 app that works with the AWS Lambda architecture to extract text from common binary document formats.

Features

Some of its key features are:

  • out of the box support for many common binary document formats (see section on Supported Formats),
  • scalable PDF parsing using OCR in parallel using AWS Lambda and asyncio,
  • creation of text searchable PDFs after OCR,
  • serverless architecture makes deployment quick and easy,
  • detailed instruction for preparing libraries and dependencies necessary for processing binary documents, and
  • sensible unicode handling

Supported Formats

<span style="color: #24292e;">lambda-text-extractor</span> supports many common and legacy document formats:

  • Portable Document Format (<span style="color: #24292e;">.pdf</span>),
  • Microsoft Word 2, 6, 7, 97, 2000, 2002 and 2003 (<span style="color: #24292e;">.doc</span>) using Antiword with fallback to Catdoc,
  • Microsoft Word 2007 OpenXML files (<span style="color: #24292e;">.docx</span>) using python-docx,
  • Microsoft PowerPoint 2007 OpenXML files (<span style="color: #24292e;">.pptx</span>) using python-pptx,
  • Microsoft Excel 5.0, 97-2003, and 2007 OpenXML files (<span style="color: #24292e;">.xls</span><span style="color: #24292e;">.xlsx</span>) using xlrd,
  • OpenDocument 1.2 (<span style="color: #24292e;">.odm</span><span style="color: #24292e;">.odp</span><span style="color: #24292e;">.ods</span><span style="color: #24292e;">.odt</span><span style="color: #24292e;">.oth</span><span style="color: #24292e;">.otm</span><span style="color: #24292e;">.otp</span><span style="color: #24292e;">.ots</span><span style="color: #24292e;">.ott</span>) using odfpy,
  • Rich Text Format (<span style="color: #24292e;">.rtf</span>) using UnRTF v0.21.9,
  • XML files and HTML web pages (<span style="color: #24292e;">.html</span><span style="color: #24292e;">.htm</span><span style="color: #24292e;">.xml</span>) using lxml,
  • CSV files (<span style="color: #24292e;">.csv</span>) using Python csv module,
  • Images (<span style="color: #24292e;">.tiff</span><span style="color: #24292e;">.jpg</span><span style="color: #24292e;">.jpeg</span><span style="color: #24292e;">.png</span>) using Tesseract, and
  • Plain text files (<span style="color: #24292e;">.txt</span>)

The future is looking bright for Python

I’m no data scientist, but to me it’s pretty obvious that Python has, by a very large margin, the greatest positive slope (future?). In fact, it appears to be only one of two languages listed here that even has a positive slope (R is the other one, and it looks like Assembly is low but pretty steady).