Collecting Favicon data

Favicons are the little icon that appears on the tab of your web browser when visiting most websites. They’re small image files, usually .png with the extension ‘.ico’. While generally considered benign, it’s possible they could be used for some sort of malicious activity with a little creativity. However, what I’m going to write about here is the collection of favicons to use in comparing phishing pages and the legitimate websites they are impersonating.

Most commoddity phishing borrows components from a real website; things like images, text, and favicons. If you build a benign set of the calculated hash of the favicons on popular domain names and then compare the hashed favicon of something that might be phishing, it’s likely you’ll see the same hash since a phishing actor probably just took the real websites favicon.

How does this help in phishing detection? Well, I’m not 100% sure yet – it’s a work in progress. I’m thinking that if you have a site which is pretending to be PayPal[.]com, but it’s at a different domain, you could compare things like the text in the page, the location a submit button sends you to, the A record the domain is hosted on, and anything else you can get your hands on. If the domain is using different infrastructure, but the favicon is the same, maybe that’s something of interest – or maybe it’s not. A more interesting idea might be to take a set of favicons from known malicious sites and see which ones are sharing icons, or to otherwise find patterns in them.

This is still in-progress research, but thought it might be interesting to write about where it’s at, and to share some of the code I’ve written.

The Benign Set:

Let’s start by building a benign set of favicons. To do that, we need legitimate domains. Thankfully, my work makes available a list of the top 1 million domains as seen through our resolvers. It’s calculated daily, with the most popular domains starting at the top.

The list can be found here.

Either download and uncompress the CSV manually or use the following code to do so:

import zipfile
def download_top_domains(): # download top domains
    r = requests.get('http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip',stream=True)
    z = zipfile.ZipFile('top-1m.csv.zip')
    z.extractall()
    z.close()

I don’t really want to scrape all 1 million domains, so in a terminal, we can decrease the amount to start with by typing head -500 top-1m.csv > 500.csv. That will give us the top 500 domains. Change to whatever number you’re interested in.

To get favicons, I’m using this great module. I also use hashlib to generate the SHA256, tldextract to get rid of subdomains, and urllib to break apart URLs.

The imports for the favicon scraper script:

import favicon, requests, hashlib, tldextract
from urllib.parse import urlparse   # get the filename from the URL path

Specify the name of the input file and the folder to save the downloaded favicons. Make sure to create the ‘favicons’ folder in the same directory where you’re running this script.

inputfile = '500.csv'
savefolder = .'/favicons'

Function to hash the icons after downloading them:

def hash_icon(filename):    # compute sha256 hash
    BUF_SIZE = 65536  # read stuff in 64kb chunks
    sha256 = hashlib.sha256()
    with open(filename, 'rb') as f:
        while True:
            data = f.read(BUF_SIZE)
            if not data:
                break
            sha256.update(data)
    return(sha256.hexdigest())

Function to read the top domains file:

def read_top_domains(filename):
    top_domains = []
    with open(filename,newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        for row in csvreader:
            extracted_domain = tldextract.extract(row[1])
            domain = "{}.{}".format(extracted_domain.domain,extracted_domain.suffix)
            top_domains.append(domain)
    return(top_domains)

The main ‘proccess_icons’ function is as follows.

We’ll add ‘https’ or ‘http’ in front of the domain because the favicon library looks up the domain as is. If the protocol isn’t listed before the domain, it will fail. Putting the protocol before the domain may not always work and you will likely see some fails, but it works well enough.

Following that, we change the user agent from ‘python-requests’ to one that looks like a browser. This is in case any of these sites doesn’t return content for python scapers.

Then, we get any icons found on the website, and iterate through them doing the following:

  • Get the icon filename from only the first level domain (no subdomains).
  • Download the icon and save it with ‘domain_iconname’
  • Calculate a SHA256 hash of the icon and check that it hasn’t already been collected
  • Add the information to a spreadsheet. In my systems at work, I write this to a google sheet, but for now, it’s just going to print to the terminal. I’m working on a write-up on sending information to google sheets.
shahashdeduplicator = []
def process_icons(top_domains):
    for d in top_domains:

        if 'https' not in d:
            domain = 'https://' + d
        else:
            domain = 'http://' + d
        try:
            user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
            headers = {'User-Agent': user_agent}

            icons = favicon.get(domain, headers=headers, timeout=2)
            for icon in icons:  # for each icon in the list of found icons

                response = requests.get(icon.url, stream=True, headers=headers)
                filename = urlparse(icon.url).path.split('/')[-1]
                domain_name = urlparse(domain).netloc
                save_icon_filename = "{}/{}_{}".format(savefolder,domain_name,filename)
                with open(save_icon_filename, 'wb') as image:
                    for chunk in response.iter_content(1024):
                        image.write(chunk)
                sha256 = hash_icon(save_icon_filename)
                line = [d,filename,sha256]
                if sha256 not in shahashdeduplicator:
                    print(line)
                    shahashdeduplicato
        except Exception as E:
            print("FAILED: {}".format(d))

At the bottom of the script, you make your function calls:

top_domains = set(read_top_domains(inputfile))
print("{} Total first level domains from this list".format(len(top_domains)))
process_icons(top_domains)

Something like the following will be displayed to the terminal (Domain, filename, SHA256):

['google.com', 'favicon.ico', '6da5620880159634213e197fafca1dde0272153be3e4590818533fab8d040770']
['netflix.com', 'nficon2016.ico', 'abe8012eb65c0dc0ac3e87dcc1e60e1908ebd8f12b7c47a5df1856f7a7bb1edd']
['netflix.com', 'nficon2016.png', '7341f7b8b0ae3c0da4aea559efc31f0b53d9db9dd291664fdcf7d618fd95ed8a']

And the icons can be found in the favicons folder, named with the domain_filename. I don’t really use them after this.

Here’s the complete script:

import favicon, requests, hashlib, zipfile, csv, tldextract
from urllib.parse import urlparse   # to get the filename from the URL path

inputfile = '10.csv'
savefolder = './favicons'

def hash_icon(filename):    # compute sha256 hash
	BUF_SIZE = 65536  # read stuff in 64kb chunks
	sha256 = hashlib.sha256()
	with open(filename, 'rb') as f:
		while True:
			data = f.read(BUF_SIZE)
			if not data:
				break
			sha256.update(data)
	return(sha256.hexdigest())

def download_top_domains(): # download top domains
	r = requests.get('http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip',stream=True)
	z = zipfile.ZipFile('top-1m.csv.zip')
	z.extractall()
	z.close()
	
def read_top_domains(filename):
	top_domains = []
	with open(filename,newline='') as csvfile:
		csvreader = csv.reader(csvfile, delimiter=',')
		for row in csvreader:
			extracted_domain = tldextract.extract(row[1])
			domain = "{}.{}".format(extracted_domain.domain,extracted_domain.suffix)
			top_domains.append(domain)
	return(top_domains)

shahashdeduplicator = []

def process_icons(top_domains):
	for d in top_domains:
		
		if 'https' not in d:
			domain = 'https://' + d
		else:
			domain = 'http://' + d
		try:
			user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
			headers = {'User-Agent': user_agent}

			icons = favicon.get(domain, headers=headers, timeout=2)
			for icon in icons:  # for each icon in the list of found icons

				response = requests.get(icon.url, stream=True, headers=headers)
				filename = urlparse(icon.url).path.split('/')[-1]
				domain_name = urlparse(domain).netloc
				save_icon_filename = "{}/{}_{}".format(savefolder,domain_name,filename)
				with open(save_icon_filename, 'wb') as image:
					for chunk in response.iter_content(1024):
						image.write(chunk)
				sha256 = hash_icon(save_icon_filename)
				line = [d,filename,sha256]
				if sha256 not in shahashdeduplicator:
					print(line)
					shahashdeduplicator.append(sha256)
		except Exception as E:
			print("FAILED: {}".format(d))

# only if you need to download the top domains list for popularity
# download_top_domains()

top_domains = set(read_top_domains(inputfile))
print("{} Total first level domains from this list".format(len(top_domains)))
process_icons(top_domains)