Sunday, July 28, 2013

Extract certain tag with inner tags from huge XML

Task: You have huge XML file (300 Mb), you need to get out(filter out/extract/grab) it only one tag with specific value.

Under Ubuntu 12.04 install xslt processor:
sudo apt-get install xsltproc

Structure of huge XML (hugeFile.xml):
<?xml version="1.0" encoding="ISO-8859-1"?>
<Feed ExtractDate="07/25/2013" ExtractTime="15:30:15">
  ..... a lot of companies information
<COMPANY ... LegalName="MyCompany" .....>
     ..... a lot of inner tags .....
  </COMPANY>
 ..... a lot of companies information
</Feed>
 

Create extract.xsl file:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">

<xsl:element name="TagToExtract">
   <xsl:apply-templates select="//COMPANY[@LegalName='MyCompany']" />
</xsl:element>

 </xsl:template>
  <xsl:template match="//COMPANY[@LegalName='MyCompany']">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()" />
    </xsl:copy>
  </xsl:template>
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()" />
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

Execute XSLT processor (12 seconds):
xsltproc extract.xsl hugeFile.xml > 1.xml



No comments:

Post a Comment