Practical Security

I'd like to discuss about security where in practical digital world.

Extracting Data from XML by Python

Posted on May 31, 2016 by Hojin

I usually executed md5deep64.exe with “-d” parameter to create the result as XML format includes both file full path and MD5 value.

C:> md5deep64.exe -r -d * > C:\%COMPUTERNAME%_%DATE%.xml

The XML file that is the result of the command above shows like this as below.

<?xml version='1.0' encoding='UTF-8'?>
<dfxml xmloutputversion='1.0'>
<metadata
xmlns='http://md5deep.sourceforge.net/md5deep/'
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:type>Hash List</dc:type>
</metadata>
<creator version='1.0'>
<program>MD5DEEP</program>
<version>4.3</version>
<build_environment>
<compiler>GCC 4.7</compiler>
</build_environment>
<execution_environment>
<command_line>c:\temp\md5deep64.exe -r -d *</command_line>
<start_time></start_time>
</execution_environment>
</creator>
<configuration>
<algorithms>
<algorithm name='md5' enabled='1'/>
<algorithm name='sha1' enabled='0'/>
<algorithm name='sha256' enabled='0'/>
<algorithm name='tiger' enabled='0'/>
<algorithm name='whirlpool' enabled='0'/>
</algorithms>
</configuration>
<fileobject>
<filename>C:\bootmgr</filename>
<filesize>398356</filesize>
<ctime></ctime>
<mtime></mtime>
<atime></atime>
<hashdigest type='MD5'>55272fe96ad87017755fd82f7928fda0</hashdigest>
</fileobject>
<fileobject>
<filename>C:\BOOTNXT</filename>
<filesize>1</filesize>
<ctime></ctime>
<mtime></mtime>
<atime></atime>
<hashdigest type='MD5'>93b885adfe0da089cdf634904fd59f71</hashdigest>
</fileobject>
</dfxml>

To extract md5 and filepath from the XML, we can use minidom python library.

from xml.dom import minidom
xmldoc = minidom.parse(fn)
files = xmldoc.getElementsByTagName('fileobject')
for fileobject in files:
  fn = fileobject.getElementsByTagName('filename')[0]
  md5 = fileobject.getElementsByTagName('hashdigest')[0]
  print fn.firstChild.data +", "+ md5.firstChild.data

Once we execute the python code to parsing a huge XML, however, we can easily meet Memory Error. To avoid this kind of error, I used BeautifulSoup.

from bs4 import BeautifulSoup
fp = open(fn, 'r')
soup = BeautifulSoup(fp, 'xml')
for node in soup.findAll('fileobject'):
  try:
    print "%s, %s"%(node.hashdigest.string,node.filename.string)
  except UnicodeEncodeError as e:
    continue

The whole code is uploaded at my GitHub.
https://github.com/hojinpk/CodeSnippets/blob/master/extracting_md5_from_XML.py

Leave a comment Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Design a site like this with WordPress.com