Showing posts with label huge xml. Show all posts
Showing posts with label huge xml. Show all posts

Sunday, July 28, 2013

Extract certain tag with inner tags from huge XML

Task: You have huge XML file (300 Mb), you need to get out(filter out/extract/grab) it only one tag with specific value.

Under Ubuntu 12.04 install xslt processor:
sudo apt-get install xsltproc

Structure of huge XML (hugeFile.xml):
<?xml version="1.0" encoding="ISO-8859-1"?>
<Feed ExtractDate="07/25/2013" ExtractTime="15:30:15">
  ..... a lot of companies information
<COMPANY ... LegalName="MyCompany" .....>
     ..... a lot of inner tags .....
  </COMPANY>
 ..... a lot of companies information
</Feed>
 

Create extract.xsl file:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">

<xsl:element name="TagToExtract">
   <xsl:apply-templates select="//COMPANY[@LegalName='MyCompany']" />
</xsl:element>

 </xsl:template>
  <xsl:template match="//COMPANY[@LegalName='MyCompany']">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()" />
    </xsl:copy>
  </xsl:template>
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()" />
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

Execute XSLT processor (12 seconds):
xsltproc extract.xsl hugeFile.xml > 1.xml