Journalists watch out—you may be unintentionally revealing sources.
Countermeasures for journalists or others engaged with leakers, in decreasing order of effectiveness:
- Avoid releasing excerpts and raw documents.
- Get the same documents from multiple leakers to ensure they have the exact same content on a byte-by-byte level.
- Manually retype excerpts to avoid invisible characters and homoglyphs.
- Keep excerpts short to limit the amount of information shared.
- Use a tool that strips non-whitelisted characters from text before sharing it with others.
Canonicalize the given url by applying the following procedures:
- sort query arguments, first by key, then by value
- percent encode paths ; non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
- percent encode query arguments ; non-ASCII characters are percent-encoded using passed encoding (UTF-8 by default)
- normalize all spaces (in query arguments) ‘+’ (plus symbol)
- normalize percent encodings case (%2f -> %2F)
- remove query arguments with blank values (unless keep_blank_values is True)
- remove fragments (unless keep_fragments is True)
The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3).>>> import w3lib.url >>> >>> # sorting query arguments >>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50') 'http://www.example.com/do?a=50&b=2&b=5&c=3' >>> >>> # UTF-8 conversion + percent-encoding of non-ASCII characters >>> w3lib.url.canonicalize_url(u'http://www.example.com/r\u00e9sum\u00e9') 'http://www.example.com/r%C3%A9sum%C3%A9' >>>
Track your expenses, send invoices, get paid and balance your books with Wave.