Zero-Width Characters: Invisibly fingerprinting text

Journalists watch out—you may be unintentionally revealing sources.

Countermeasures for journalists or others engaged with leakers, in decreasing order of effectiveness:

  • Avoid releasing excerpts and raw documents.
  • Get the same documents from multiple leakers to ensure they have the exact same content on a byte-by-byte level.
  • Manually retype excerpts to avoid invisible characters and homoglyphs.
  • Keep excerpts short to limit the amount of information shared.
  • Use a tool that strips non-whitelisted characters from text before sharing it with others.

w3lib.url.canonicalize_url

w3lib.url.canonicalize_url(urlkeep_blank_values=Truekeep_fragments=Falseencoding=None)[source]

Canonicalize the given url by applying the following procedures:

  • sort query arguments, first by key, then by value
  • percent encode paths ; non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
  • percent encode query arguments ; non-ASCII characters are percent-encoded using passed encoding (UTF-8 by default)
  • normalize all spaces (in query arguments) ‘+’ (plus symbol)
  • normalize percent encodings case (%2f -> %2F)
  • remove query arguments with blank values (unless keep_blank_values is True)
  • remove fragments (unless keep_fragments is True)

The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3).

>>> import w3lib.url
>>>
>>> # sorting query arguments
>>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'
>>>
>>> # UTF-8 conversion + percent-encoding of non-ASCII characters
>>> w3lib.url.canonicalize_url(u'http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'
>>>

 

Related Pins at Pinterest: The Evolution of a Real-World Recommender System

Related Pins is the Web-scale recommender system that powers over 40% of user engagement on Pinterest. This paper is a longitudinal study of three years of its development, exploring the evolution of the system and its components from prototypes to present state. Each component was originally built with many constraints on engineering effort and computational resources, so we prioritized the simplest and highest-leverage solutions. We show how organic growth led to a complex system and how we managed this complexity. Many challenges arose while building this system, such as avoiding feedback loops, evaluating performance, activating content, and eliminating legacy heuristics. Finally, we offer suggestions for tackling these challenges when engineering Web-scale recommender systems