URL Decoding

Some security products that detect phishing URLs in an email modify the URL so it’s not easy to click on. One example of a product that does this is the poorly named APT (in this case, it stands for Advanced Threat Protection, not Advanced Persistent Threat), which includes something called ‘safelinks’. It’s used by Outlook Web Access to encode links that are marked as unsafe like this (I changed all http to hXXp):

hXXps://system.safelinks.protection.outlook.com/?url=hXXp%3A%2F%2Fwww.companydomain.com%2F&data=02%7C01%7CJohn.Doe%40companydomain.com%7Cd82291866011 49a0a50e08d6bcf96782%7C660292d2cfd54a3db7a7e8f7ee458a0a%7C0%7C0%7C6369041766968 03378&sdata=O2s6j74FrcOVv1T1x4todcmOe5wOyBAcuDn8skOcfds%3D&reserved=0

This seems kind of nice, but to myself and other researchers, it’s pretty annoying. When someone sends us a potential phishing email to investigate, we have to take time and brain power to figure out the URL. It’s right there in this long URL, but there is also a lot of obfuscation. Trying to get information on it with just a quick glance isn’t going to work. If you have enough minor annoyances like this during your work day, they add up to a big headache and a desire to leave the security world forever.

It’s easy to create a python script that will take a URL like this and print out the actual URL for quicker analysis.

from urllib.parse import unquote # python 3
import re

obfuscated_url = 'hXXps://system.safelinks.protection.outlook.com/?url=hXXp%3A%2F%2Fwww.companydomain.com%2F&data=02%7C01%7CJohn.Doe%40companydomain.com%7Cd8229186601149a0a50e08d6bcf96782%7C660292d2cfd54a3db7a7e8f7ee458a0a%7C0%7C0%7C636904176696803378&sdata=O2s6j74FrcOVv1T1x4todcmOe5wOyBAcuDn8skOcfds%3D&reserved=0'

def decode_url(url):
    url = url.lower().replace('hxxp','http') # May need to adjust if the obfuscated URL doesn't get the http as hxxp
    decoded_url = unquote(url)
    try:
        final_url = ''
        h = decoded_url.split('http')
        for i in h:
            if len(i) < 1:
                pass
            else:
                _url = "{}{}".format('http',i)
                final_url = re.findall('(?:(?:https?|http):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', _url)
                final_url = "/".join(final_url)
        return final_url
    except:
        return url

de_obfuscated_url = decode_url(obfuscated_url)
print(de_obfuscated_url)

When you run this, you’ll get back:

http://www.companydomain.com//john.doe/companydomain.com