Using XML::Twig to read XML (sent by NCBI eUtils)

This page discusses the use of the Perl module XML::Twig to process XML files. Examples are based on XML returned from requests for data from the National Center for Biotechnology Information (NCBI). The NCBI programmatic interface supported by the Entrez eUtils can be used to retrieve data in a variety of formats, but here will be used to get XML content.

In particular, all the examples herein involve eFetch requests for Single Nucleotide Polymorphism (SNP) data from the NCBI dbSNP database. Here is an archived version of the results of the following request for data for SNP ID 243, dating from November 2008. The "live" version is at:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=243&report=XML&mode=xml Note that the returned data is composed of one ExchangeSet element containing several other elements, and Twig will be used to extract data from some of those elements, while ignoring the rest. (Actually, Twig itself will process the complete file, but only return element data as requested.

Also, in the examples in this document, desired information will be stored in a global hash named %stuff for use by the calling routine and/or element handlers to be discussed below.

The key values being used with %stuff are similar to the XPATH strings used to identify elements. For example, all data enclosed by the element tag pairs

<Rs><Sequence>...</Sequence></Rs> will be placed into $stuff{ 'Rs#Sequence' } Note that this will include any data enclosed by the child elements enclosed by ... tag pairs. In this case, the data enclosed by the child elements Seq3, Seq5, and Observed will be returned along with any data within the element but outside of those enclosed tag pairs.

In addition, all attributes associated with an element tag will be placed into %stuff using keys including the string "attribute". For example the "exemplarId" attribute value in the Sequence tag

<Sequence exemplarId=". . ."> <xmp> will be stored into <xmp> $stuff{ 'Rs#Sequence#attribute#exemplarId' }

An example that extracts attribute values

Here is Perl code to extract the attributes from the Rs tag <Rs rsId="243" snpClass="snp" snpType="notwithdrawn" molType="cDNA" bitField="030100080001060100000100"> and place the attribute values into the %stuff hash. -bash-3.2$ cat Twig-example-0.pl # This hack uses the XML::Twig module to parse Rs elements # downloaded from the dbSNP database at NCBI. use XML::Twig; $url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=243&report=XML&mode=xml"; # Set up handlers for processing specific elements and/or tags. $t = XML::Twig->new( twig_handlers => { 'Rs' => \&process_Rs, } ); # Start the parsing process. The parser will call the sub "process_Rs" when # it reaches the </Rs> tag, as instructed by the handler set up above. $t->parseurl( $url ); print "\nAnd now print the contents of the stuff hash\n\n"; foreach $a_thingie (sort( keys( %stuff ) ) ) { $value_for_key = $stuff{ $a_thingie }; print "Stuff value for key: $a_thingie is: $value_for_key\n"; } exit; ##################### # Here is the handler for processing the Rs element. It will be # called when the </Rs> tag is encountered by the parser. ##################### sub process_Rs # get only the attributes. { my ( $t, $elt ) = @_; my ( $thingie, $an_attribute_value ); $count{ 'Rs' }++; # Get each attribute and store it in %stuff. Since there # is only one Rs element per SNP document, there will # be no collisions here. foreach $thingie ( keys( %{$elt->atts} ) ) { $an_attribute_value = ${$elt->atts}{ $thingie }; $count{ "Rs#attribute#$thingie" }++; $stuff{ "Rs#attribute#$thingie" } = $an_attribute_value; } } When this program runs it prints the contents of the %stuff hash, showing the value of each attribute within the Rs tag. The result looks like: -bash-3.2$ perl Twig-example-0.pl And now print the contents of the stuff hash Stuff value for key: Rs#attribute#bitField is: 030100080001060100000100 Stuff value for key: Rs#attribute#molType is: cDNA Stuff value for key: Rs#attribute#rsId is: 243 Stuff value for key: Rs#attribute#snpClass is: snp Stuff value for key: Rs#attribute#snpType is: notwithdrawn Note that this program also keeps a count of the number of times the handler was called. This count is not useful here, but may become useful later when XML input contains multiple elements of the same type.

Note also that storing the attribute values with special keys that include the string "attribute" could conflict with intertag data within an element structure like

<Rs><Sequence><attribute><exemplarId>... </exemplarId></attribute></Sequence></Rs> Since there are no elements named "attribute" within the SNP data, this is not now a problem.

An example that extracts inter-tag content

Here is another example that shows how inter-tag content is processed. It gets the content between the <Seq5> and </Seq5> tags, and also the content between the <Sequence> and </Sequence> tags. Note that the latter includes the former, as well as any content within the enclosed <Seq3> and <Observed> elements. -bash-3.2$ cat Twig-example-1.pl # This hack uses the XML::Twig module to parse Rs elements # downloaded from the dbSNP database at NCBI. require( use XML::Twig; $url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=243&report=XML&mode=xml"; # Set up handlers for processing specific elements and/or tags. $t = XML::Twig->new( twig_handlers => { 'Rs/Sequence/Seq5' => \&process_Rs_Sequence_Seq5, 'Rs/Sequence' => \&process_Rs_Sequence, } ); # Start the parsing process. $t->parseurl( $url ); print "\nAnd now print the contents of the stuff hash\n\n"; foreach $a_thingie (sort( keys( %stuff ) ) ) { $value_for_key = $stuff{ $a_thingie }; print "Stuff value for key: $a_thingie is: $value_for_key\n"; } exit; ##################### #Here are the handlers. ##################### sub process_Rs_Sequence_Seq5 # Here we get the intertag content only. { my ( $t, $elt ) = @_; $count{ 'Rs#Sequence#Seq5' }++; $stuff{ 'Rs#Sequence#Seq5' } = $elt->text; 1; } sub process_Rs_Sequence # Sequence has both inter- and intra-tag { # content. my ( $t, $elt ) = @_; my ( $inter_tag_content, $thingie, $an_attribute_value ); $count{ 'Rs#Sequence' }++; $inter_tag_content = $elt->text; if( $inter_tag_content ne "" ) { $stuff{ 'Rs#Sequence' } = $inter_tag_content; } foreach $thingie ( keys( %{$elt->atts} ) ) { $an_attribute_value = ${$elt->atts}{ $thingie }; $count{ "Rs#Sequence#attribute#$thingie" }++; $stuff{ "Rs#Sequence#attribute#$thingie" } = $an_attribute_value; } 1; } The output from running this program is presented below, but long DNA sequences have been "staggered" to fit the page and the Seq5 data is in boldface type: -bash-3.2$ perl Twig-example-1.pl And now print the contents of the stuff hash
Stuff value for key: Rs#Sequence is: 
TTAGAGCAGGCTGTGTGGCCAATCCTGACACCAATGGGGCAGTGAGCTATAATCTTTCCCCAGGGAAGGGCAACAAAAATTATGAGCAA
  CAGTATAATTTATCACATTATTCGTTTTCTTCTACATCTTGGGGACCATAAAGAAGAAAGAAGCAGCTGTCTTTTTGTGGTAGTTTTGCTC
    AGAGCTGCCTAGAGCGAGGACAAGACAGGTGACCTTTCAAAATACCTTACAGACTTAGGATTTGGATTTTCATGGTGGTTGGCACAGCCCCAG
      GCTGCACAGAACTAATACCTGCTGTTCC/TTCTGCCTCCACCAGCCCTATCTCTTAGGCTCAAGGAGAAATTTTACTGGATGGGCTGTCTTTT
        CCAAAGTTTACCACCCAACACCCAATGCCCTTTGGGGCATTAGTGAATCCATTTTTCTTGACTTCTAGCATAAATTCACCCACTTATGTGTTTC
          CTTCCCAGCTGTCTTTTGGGGAGACATTGCCTTAGATGAAGATGACTTGAAGCTGTTTCACATTGACAAAGCCAGAGACTGGACCAAGCAGACAG
            TGGGGGCAACAGGACACAGCACAGGTAGGTACTGCTTCCTCCCTTCTC

Stuff value for key: Rs#Sequence#Seq5 is:
TTAGAGCAGGCTGTGTGGCCAATCCTGACACCAATGGGGCAGTGAGCTATAATCTTTCCCCAGGGAAGGGCAACAAAAATTATGAGCAA
  CAGTATAATTTATCACATTATTCGTTTTCTTCTACATCTTGGGGACCATAAAGAAGAAAGAAGCAGCTGTCTTTTTGTGGTAGTTTTGCTC
    AGAGCTGCCTAGAGCGAGGACAAGACAGGTGACCTTTCAAAATACCTTACAGACTTAGGATTTGGATTTTCATGGTGGTTGGCACAGCCCCAG
      GCTGCACAGAACTAATACCTGCTGTTC

Stuff value for key: Rs#Sequence#attribute#exemplarSs is: 38524064
-bash-3.2$ 

Processing multiple elements of the same type

Special procedures are required for processing multiple instances of the same element. For example, the most difficult aspect of processing the dbSNP data is related to processing <Rs><Ss>...</Ss><Ss>...</Ss> . . .<Ss>...</Ss></Rs> structures. In the following example, the attributes associated with each Ss element are stored in a hash called %temp_attributes, which is then stored in a "temporary" array called @Ss_temp_array for processing later.

Here the array values are accessed in the calling routine, but they could be processed as the Rs element is processed, since all the Ss elements are enclosed within the Rs element and will, therefore, have been processed before the Rs tag is processed.

-bash-3.2$ cat Twig-example-2.pl # This hack uses the XML::Twig module to parse Rs elements # downloaded from the dbSNP database at NCBI. use XML::Twig; $url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=243&report=XML&mode=xml"; # Set up handlers for processing specific elements and/or tags. $t = XML::Twig->new( twig_handlers => { 'Rs/Ss' => \&process_Rs_Ss } ); # Start the parsing process. $t->parseurl( $url ); print "\nAnd now print the hashes stored in the Ss_temp_array\n\n"; $array_count = 0; foreach $an_Ss_attribute_hash ( @Ss_temp_array ) { print "\nHere is attribute hash $array_count:\n"; foreach $an_Ss_attribute_hash_entry (keys( %$an_Ss_attribute_hash ) ) { print "Entry $array_count: $an_Ss_attribute_hash_entry has value " . "$an_Ss_attribute_hash->{ $an_Ss_attribute_hash_entry }\n"; } $array_count++; } exit; ##################### #Here are the handlers. ##################### sub process_Rs_Ss # Here we get and save the attributes for all Rs_Ss { # tags, so we can eventually keep the one whose id # matches the Rs_Sequence exemplar ID. my ( $t, $elt ) = @_; my ( $an_attribute_value, $thingie, $inter_tag_content, %temp_attributes ); # @Ss_temp_attributes is NOT local, but $temp_attributes IS local. $count{ 'Rs#Ss' }++; $inter_tag_content = $elt->text; # Save the attribute values for this Rs_Ss tag in a hash. undef %temp_attributes; foreach $thingie ( keys( %{$elt->atts} ) ) { $an_attribute_value = ${$elt->atts}{ $thingie }; $temp_attributes{ "Rs#Ss#attribute#$thingie" } = $an_attribute_value; } # Also save the intertag content for this element in the hash. if( $inter_tag_content ne "" ) { $temp_attributes{ 'Rs#Ss' } = $inter_tag_content; } # Now push the hash containing attribute values for this Ss entry # into an array for later use as in the calling routine or # as process_Rs wraps up. push( @Ss_temp_array, {%temp_attributes} ); 1; } When this script is run, the output for the first 2 Ss elements will look like: -bash-3.2$ perl Twig-example-2.pl|more And now print the hashes stored in the Ss_temp_array Here is attribute hash 0: Entry 0: Rs#Ss#attribute#ssId has value 243 Entry 0: Rs#Ss#attribute#handle has value KWOK Entry 0: Rs#Ss#attribute#buildId has value 36 Entry 0: Rs#Ss#attribute#subSnpClass has value snp Entry 0: Rs#Ss#attribute#validated has value by-frequency Entry 0: Rs#Ss#attribute#molType has value cDNA Entry 0: Rs#Ss#attribute#strand has value top Entry 0: Rs#Ss#attribute#batchId has value 47 Entry 0: Rs#Ss#attribute#locSnpId has value D10S1257 Entry 0: Rs#Ss has value CGTTTTCTTCTACATCTTGGGGNACATAAAGANGAAAGAAGNAGCTGTCTTTTTGTGGTAGTTTTGCTCAGAGC TGCCTAGAGCNAGGACAAGACAGGTGACCTTTCAAAATACCTTACAGACTTAGGATTTGGATTTTCATGGTGGT TGGCACAGCCCAGGCTCAACAGAACTAATACCTGCTGTTCC/TTCTGCCTCCACCAGCCCTATCTCTTAGGCTCA AGGAGAAATTTTACTGGATGGGCTGTCTTTTCCAAA Entry 0: Rs#Ss#attribute#orient has value forward Entry 0: Rs#Ss#attribute#methodClass has value computed Here is attribute hash 1: Entry 1: Rs#Ss#attribute#ssId has value 834 Entry 1: Rs#Ss#attribute#handle has value WIAF Entry 1: Rs#Ss#attribute#buildId has value 36 Entry 1: Rs#Ss#attribute#subSnpClass has value snp Entry 1: Rs#Ss#attribute#validated has value by-submitter Entry 1: Rs#Ss#attribute#molType has value cDNA Entry 1: Rs#Ss#attribute#strand has value top Entry 1: Rs#Ss#attribute#batchId has value 485 Entry 1: Rs#Ss#attribute#locSnpId has value WIAF-1435 Entry 1: Rs#Ss has value CGTTTTCTTCTACATCTTGGGGNACATAAAGANGAAAGAAGNAGCTGTCTTTTTGTGGTAGTTTTGCTCAGAGC TGCCTAGAGCNAGGACAAGACAGGTGACCTTTCAAAATACCTTACAGACTTAGGATTTGGATTTTCATGGTGGT TGGCACAGCCCAGGCTCAACAGAACTAATACCTGCTGTTCC/TTCTGCCTCCACCAGCCCTATCTCTTAGGCTCA AGGAGAAATTTTACTGGATGGGCTGTCTTTTCCAAA Entry 1: Rs#Ss#attribute#orient has value forward Entry 1: Rs#Ss#attribute#methodClass has value sequence . . .

Choosing a specific Ss element from the array

In the case of the Ss elements included in SNP data, there may be multiple Ss elements, but only the one whose ID matches the exemplarId in the Sequence element is needed by the calling routine.

Since the Sequence element is not guaranteed to be processed before the the Ss elements, the attributes associated with each Ss element are stored in the @Ss_temp_array, as shown above, and then the required array entry is selected when the enclosing Rs element is processed, or in the calling routine.

Note that other elements that appear multiple times may not be so difficult to process because they can be selected "on the fly." For example, multiple Assembly elements may be easier to process if, for example, the desired Assembly element can be identified by the Assembly element handler. (If you only want that Assembly element whose "reference" attribute value is "true".)

Additional information

For more information about using the XML::Twig see For more information about using the NCBI eUtils see