Analyzing Phishing Emails

When looking at phishing emails, I am interested primarily in:

  • The email, including the source and headers
  • The URL in the email
  • Any attachments

To get to those, you have to open the email. This can be done by literally opening it and manually pulling that stuff out, or you can do it programmatically.

The manual way

You can actually open the email using an email client. Best practice is to use an email client that isn’t also used for your email (as in, a separate email program) and to disable loading of remote assets (so no network connections are made to grab images or other items in the email).

I find Thunderbird to be the best for my needs. Each time after starting it, I have to do the same steps:

1: Click out of the prompt to set up an email account

2: Click through a menu to open (no keyboard shortcut for this in Thunderbird)

3: Notice that the email is not the default format for Thunderbird to open

4: Change the option to open all files

5: FINALLY open the file

6: Then deal with ANOTHER account set up page

7: THEN, confirm you really want to exit the account setup:

Ok, great. There’s the URL. If there are any attachments, they can be exported to analyze. But you run a few risks – if you fudge the right-click on the URL to copy it, you might accidentally click and open it using your browser. If you download the attachment, you might accidentally have it open or download to a location on your drive automatically and may also forget to delete the attachment after submitting to Virus Total or wherever.

In a pinch, doing all this works, but:

We can probably automate this!

The file I’m working with is all kinds of wrong. Before we successfully read it, lets look at my failure at opening the email with Python:

We should be able to open the email with:

import email
msgfile = 'FW Validate Your Email Now !.msg'
fp = open(msgfile)
msg = email.message_from_file(fp)

But we have a problem with this specific email. When attempting to open it, we get the following error message after that last line:

There’s a problem with the encoding. Usually, emails should open fine as ‘UTF-8’ by default with the python email import or by specifying ‘ISO-8859-1’ with: fp = open(msgfile,encoding="ISO-8859-1") But it’s just not working.

There’s no error message when specifying ISO-8859-1, but when looking at the length of the msg variable, it comes back as 0 (meaning there’s nothing in ‘msg’)

Ok, so we have to go back into Thunderbird, since in it we can find out the encoding of an email by analyzing the source. It should be listed in the headers, looking something like this:

However, If we try to view the source of this email back in Thunderbird, we get a mess:

Ok, so this isn’t working…why?

Actually, this email extraction would work on some emails, but this appears to be an Outlook msg file. It’s ‘special’, so we have to try a different approach.

Let’s start again:

Let’s install a Python module called ‘extract_msg’. It’s used for extracting data in Outlook msg files and can be installed with pip install extract_msg.

Starting our script, we import the new module:

import extract_msg

Then open the file and get it ready for action:

msgfile = r'FW Validate Your Email Now !.msg'
msg = extract_msg.Message(msgfile)

We can now look at various parts of the message, such as the sender, date, subject, body, and more, with things like: msg.sender,, msg.subject, msg.body

For example, if running in an interactive shell and we want to read the body of the email:

This displayed it with all the \n and \r, and is difficult to read. We can print the body so it’s cleaner by just using the print command:

This particular email doesn’t have an attachment, but it does have a URL. There are a lot of ways to get that – either regular expressions or converting the data in the email body to a list and using the TLDExtract module to grab URLs. It’s more efficient to use regular expressions on the raw text and then use TLDExtract after.

Import the module for regular expressions:

import re

And using the following regex code, we will probably get all the URLs stored in a list called urls:

urls = re.findall('(?:(?:https?|http):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', msg_body)

That gives us:

Now we can analyze the malicious URL:

Ok, we’ve made some progress, but there’s a lot more that can be done. This ended up being a custom solution to one problem. It’s not perfect and it’s not complete yet.

I’ll cover the following in future posts:

  • Detecting the type of email file so it can be correctly opened
  • Sending any domains and URLs to third party analysis, and automatically visiting them through TOR
  • Pulling out and submitting attachments to VirusTotal and/or Cuckoo Sandbox
  • Automatic analysis of the headers
  • Processing emails as they come in on a mail server