Chris Albon: Machine Learning @ Wikimedia

Notes On Using
Data Science & Machine Learning
To Fight For Something That Matters

I am the Director of Machine Learning at the Wikimedia Foundation. I have spent over a decade applying statistical learning, artificial intelligence, and software engineering to political, social, and humanitarian efforts.

Learning machine learning? Check out my Machine Learning Flashcards, my book (Machine Learning With Python Cookbook), or come study with me.

How To Set Up a Firewall with UFW on Ubuntu 18.04

Step 1 — Using IPv6 with UFW (Optional)

This tutorial is written with IPv4 in mind, but will work for IPv6 as well as long as you enable it. If your Ubuntu server has IPv6 enabled, ensure that UFW is configured to support IPv6 so that it will manage firewall rules for IPv6 in addition to IPv4. To do this, open the UFW configuration with nano or your favorite editor.

  • sudo nano /etc/default/ufw

Then make sure the value of IPV6 is yes. It should look like this:

/etc/default/ufw excerpt

Save and close the file. Now, when UFW is enabled, it will be configured to write both IPv4 and IPv6 firewall rules. However, before enabling UFW, we will want to ensure that your firewall is configured to allow you to connect via SSH. Let’s start with setting the default policies.

  • sudo ufw allow 443
  • sudo ufw allow https


Step 4 — Enabling UFW

To enable UFW, use this command:

  • sudo ufw enable

You will receive a warning that says the command may disrupt existing SSH connections. We already set up a firewall rule that allows SSH connections, so it should be fine to continue. Respond to the prompt with y and hit ENTER.

Javascript OCR Pdf-to-text


PDF-to-Text is an OCR, Pure Javascript by tesseract.js api, mobile-ready that convert PDF text-image to text.


PDF-to-Text uses a number of open source projects to work properly:

  • [JavaScript] – awesome!
  • [HTML] – HTML enhanced for web apps!
  • [CSS] – Fence!
  • [Magic] – that”s nice!


PDF-to-Text requires Node.js v4+ or any server enviroment to run.

Start the server.

$ npm install http-server -g
$ cd pdf-to-text-master
$ http-server

Text Fragments Draft Community Group Report, 9 June 2020


Text Fragments adds support for specifying a text snippet in the URL fragment. When navigating to a URL with such a fragment, the user agent can quickly emphasise and/or bring it to the user’s attention.

The core use case for text fragments is to allow URLs to serve as an exact text reference across the web. For example, Wikipedia references could link to the exact text they are quoting from a page. Similarly, search engines can serve URLs that direct the user to the answer they are looking for in the page rather than linking to the top of the page.

2.1.2. User sharing

With text fragments, browsers may implement an option to ‘Copy URL to here’ when the user opens the context menu on a text selection. The browser can then generate a URL with the text selection appropriately specified, and the recipient of the URL will have the specified text conveniently indicated. Without text fragments, if a user wants to share a passage of text from a page, they would likely just copy and paste the passage, in which case the receiver loses the context of the page.

This specification intentionally doesn’t define what actions a user agent should or could take to “indicate” a text match. There are different experiences and trade-offs a user agent could make. Some examples of possible actions:

  • Providing visual emphasis or highlight of the text passage
  • Automatically scrolling the passage into view when the page is navigated
  • Activating a UA’s find-in-page feature on the text passage
  • Providing a “Click to scroll to text passage” notification
  • Providing a notification when the text passage isn’t found in the page


3.2. Syntax

This section is non-normative

text fragment directive is specified in the fragment directive (see § 3.3 The Fragment Directive) with the following format:

          context  |-------match-----|  context

(Square brackets indicate an optional parameter)

The text parameters are percent-decoded before matching. Dash (-), ampersand (&), and comma (,) characters in text parameters must be percent-encoded to avoid being interpreted as part of the text directive syntax.

The only required parameter is textStart. If only textStart is specified, the first instance of this exact text string is the target text.

#:~:text=an%20example%20text%20fragment indicates that the exact text “an example text fragment” is the target text.

If the textEnd parameter is also specified, then the text directive refers to a range of text in the page. The target text range is the text range starting at the first instance of startText, until the first instance of endText that appears after startText. This is equivalent to specifying the entire text range in the startText parameter, but allows the URL to avoid being bloated with a long text directive.

#:~:text=an%20example,text%20fragment indicates that the first instance of “an example” until the following first instance of “text fragment” is the target text.

3.2.1. Context Terms

This section is non-normative

The other two optional parameters are context terms. They are specified by the dash (-) character succeeding the prefix and preceding the suffix, to differentiate them from the textStart and textEnd parameters, as any combination of optional parameters may be specified.

Context terms are used to disambiguate the target text fragment. The context terms can specify the text immediately before (prefix) and immediately after (suffix) the text fragment, allowing for whitespace.

While the context terms must be the immediate text surrounding the target text fragment, any amount of whitespace is allowed between context terms and the text fragment. This helps allow context terms to be across element boundaries, for example if the target text fragment is at the beginning of a paragraph and it must be disambiguated by the previous element’s text as a prefix.

The context terms are not part of the targeted text fragment and must not be visually indicated.

#:~:text=this%20is-,an%20example,-text%20fragment would match to “an example” in “this is an example text fragment”, but not match to “an example” in “here is an example text”.