Extracting Data from XML by Python

I usually executed md5deep64.exe with “-d” parameter to create the result as XML format includes both file full path and MD5 value.

C:> md5deep64.exe -r -d * > C:\%COMPUTERNAME%_%DATE%.xml

The XML file that is the result of the command above shows like this as below.

<?xml version='1.0' encoding='UTF-8'?>
<dfxml xmloutputversion='1.0'>
<metadata
xmlns='http://md5deep.sourceforge.net/md5deep/'
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:type>Hash List</dc:type>
</metadata>
<creator version='1.0'>
<program>MD5DEEP</program>
<version>4.3</version>
<build_environment>
<compiler>GCC 4.7</compiler>
</build_environment>
<execution_environment>
<command_line>c:\temp\md5deep64.exe -r -d *</command_line>
<start_time></start_time>
</execution_environment>
</creator>
<configuration>
<algorithms>
<algorithm name='md5' enabled='1'/>
<algorithm name='sha1' enabled='0'/>
<algorithm name='sha256' enabled='0'/>
<algorithm name='tiger' enabled='0'/>
<algorithm name='whirlpool' enabled='0'/>
</algorithms>
</configuration>
<fileobject>
<filename>C:\bootmgr</filename>
<filesize>398356</filesize>
<ctime></ctime>
<mtime></mtime>
<atime></atime>
<hashdigest type='MD5'>55272fe96ad87017755fd82f7928fda0</hashdigest>
</fileobject>
<fileobject>
<filename>C:\BOOTNXT</filename>
<filesize>1</filesize>
<ctime></ctime>
<mtime></mtime>
<atime></atime>
<hashdigest type='MD5'>93b885adfe0da089cdf634904fd59f71</hashdigest>
</fileobject>
</dfxml>

To extract md5 and filepath from the XML, we can use minidom python library.

from xml.dom import minidom
xmldoc = minidom.parse(fn)
files = xmldoc.getElementsByTagName('fileobject')
for fileobject in files:
  fn = fileobject.getElementsByTagName('filename')[0]
  md5 = fileobject.getElementsByTagName('hashdigest')[0]
  print fn.firstChild.data +", "+ md5.firstChild.data

Once we execute the python code to parsing a huge XML, however, we can easily meet Memory Error. To avoid this kind of error, I used BeautifulSoup.

from bs4 import BeautifulSoup
fp = open(fn, 'r')
soup = BeautifulSoup(fp, 'xml')
for node in soup.findAll('fileobject'):
  try:
    print "%s, %s"%(node.hashdigest.string,node.filename.string)
  except UnicodeEncodeError as e:
    continue

The whole code is uploaded at my GitHub.
https://github.com/hojinpk/CodeSnippets/blob/master/extracting_md5_from_XML.py

error C2679

binary ‘+=’ : no operator found which takes a right-hand operand of type ‘BYTE [6]’ (or there is no acceptable conversion)

CString ret;
typedef struct _SID_IDENTIFIER_AUTHORITY {
BYTE Value[6];
} SID_IDENTIFIER_AUTHORITY, *PSID_IDENTIFIER_AUTHORITY;

ret += sid.IdentifierAuthority.Value;

I just “(TCHAR)” to change the value type like this.

ret += (TCHAR)sid.IdentifierAuthority.Value;

 

error MSB8031

When you build source code that was being made by VS6 in VS 2013, you may occurred this error message.

Building an MFC project for a non-Unicode character set is deprecated. You must change the project property to Unicode or download an additional library. See http://go.microsoft.com/fwlink/p/?LinkId=286820 for more information.

You can download Multibyte MFC Library for VS 2013. This add-on for VS 2013 contains the multibyte character set (MBCS) version of the Microsoft Foundation Class (MFC) Library.

http://www.microsoft.com/en-US/download/details.aspx?id=40770