Zero-Width Characters: Invisibly fingerprinting text

Journalists watch out—you may be unintentionally revealing sources.

Countermeasures for journalists or others engaged with leakers, in decreasing order of effectiveness:

  • Avoid releasing excerpts and raw documents.
  • Get the same documents from multiple leakers to ensure they have the exact same content on a byte-by-byte level.
  • Manually retype excerpts to avoid invisible characters and homoglyphs.
  • Keep excerpts short to limit the amount of information shared.
  • Use a tool that strips non-whitelisted characters from text before sharing it with others.

w3lib.url.canonicalize_url

w3lib.url.canonicalize_url(urlkeep_blank_values=Truekeep_fragments=Falseencoding=None)[source]

Canonicalize the given url by applying the following procedures:

  • sort query arguments, first by key, then by value
  • percent encode paths ; non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
  • percent encode query arguments ; non-ASCII characters are percent-encoded using passed encoding (UTF-8 by default)
  • normalize all spaces (in query arguments) ‘+’ (plus symbol)
  • normalize percent encodings case (%2f -> %2F)
  • remove query arguments with blank values (unless keep_blank_values is True)
  • remove fragments (unless keep_fragments is True)

The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3).

>>> import w3lib.url
>>>
>>> # sorting query arguments
>>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'
>>>
>>> # UTF-8 conversion + percent-encoding of non-ASCII characters
>>> w3lib.url.canonicalize_url(u'http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'
>>>