will be stored into
$stuff{ 'Rs#Sequence#attribute#exemplarId' }
An example that extracts attribute values
Here is Perl code to extract the attributes from the Rs tag
and place the attribute values into the %stuff hash.
-bash-3.2$ cat Twig-example-0.pl
# This hack uses the XML::Twig module to parse Rs elements
# downloaded from the dbSNP database at NCBI.
use XML::Twig;
$url =
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=243&report=XML&mode=xml";
# Set up handlers for processing specific elements and/or tags.
$t = XML::Twig->new(
twig_handlers => {
'Rs' => \&process_Rs,
}
);
# Start the parsing process. The parser will call the sub "process_Rs" when
# it reaches the tag, as instructed by the handler set up above.
$t->parseurl( $url );
print "\nAnd now print the contents of the stuff hash\n\n";
foreach $a_thingie (sort( keys( %stuff ) ) )
{
$value_for_key = $stuff{ $a_thingie };
print "Stuff value for key: $a_thingie is: $value_for_key\n";
}
exit;
#####################
# Here is the handler for processing the Rs element. It will be
# called when the tag is encountered by the parser.
#####################
sub process_Rs # get only the attributes.
{
my ( $t, $elt ) = @_;
my ( $thingie, $an_attribute_value );
$count{ 'Rs' }++;
# Get each attribute and store it in %stuff. Since there
# is only one Rs element per SNP document, there will
# be no collisions here.
foreach $thingie ( keys( %{$elt->atts} ) )
{
$an_attribute_value = ${$elt->atts}{ $thingie };
$count{ "Rs#attribute#$thingie" }++;
$stuff{ "Rs#attribute#$thingie" } = $an_attribute_value;
}
}
When this program runs it prints the contents of the
%stuff hash, showing the value of each attribute
within the Rs tag. The result looks like:
-bash-3.2$ perl Twig-example-0.pl
And now print the contents of the stuff hash
Stuff value for key: Rs#attribute#bitField is: 030100080001060100000100
Stuff value for key: Rs#attribute#molType is: cDNA
Stuff value for key: Rs#attribute#rsId is: 243
Stuff value for key: Rs#attribute#snpClass is: snp
Stuff value for key: Rs#attribute#snpType is: notwithdrawn
Note that this program also keeps a count of the number of
times the handler was called. This count is not useful here,
but may become useful later when XML input contains
multiple elements of the same type.
Note also that storing the attribute values with special keys
that include the string "attribute" could conflict with intertag
data within an element structure like
...
Since there are no elements named "attribute" within the SNP
data, this is not now a problem.
An example that extracts inter-tag content
Here is another example that shows how inter-tag content is processed.
It gets the content between the <Seq5> and </Seq5> tags, and also
the content between the <Sequence> and </Sequence> tags.
Note that the latter includes the former, as well as any content
within the enclosed <Seq3> and <Observed> elements.
-bash-3.2$ cat Twig-example-1.pl
# This hack uses the XML::Twig module to parse Rs elements
# downloaded from the dbSNP database at NCBI.
require(
use XML::Twig;
$url =
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=243&report=XML&mode=xml";
# Set up handlers for processing specific elements and/or tags.
$t = XML::Twig->new(
twig_handlers => { 'Rs/Sequence/Seq5' => \&process_Rs_Sequence_Seq5,
'Rs/Sequence' => \&process_Rs_Sequence,
} );
# Start the parsing process.
$t->parseurl( $url );
print "\nAnd now print the contents of the stuff hash\n\n";
foreach $a_thingie (sort( keys( %stuff ) ) )
{
$value_for_key = $stuff{ $a_thingie };
print "Stuff value for key: $a_thingie is: $value_for_key\n";
}
exit;
#####################
#Here are the handlers.
#####################
sub process_Rs_Sequence_Seq5 # Here we get the intertag content only.
{
my ( $t, $elt ) = @_;
$count{ 'Rs#Sequence#Seq5' }++;
$stuff{ 'Rs#Sequence#Seq5' } = $elt->text;
1;
}
sub process_Rs_Sequence # Sequence has both inter- and intra-tag
{ # content.
my ( $t, $elt ) = @_;
my ( $inter_tag_content, $thingie, $an_attribute_value );
$count{ 'Rs#Sequence' }++;
$inter_tag_content = $elt->text;
if( $inter_tag_content ne "" )
{
$stuff{ 'Rs#Sequence' } = $inter_tag_content;
}
foreach $thingie ( keys( %{$elt->atts} ) )
{
$an_attribute_value = ${$elt->atts}{ $thingie };
$count{ "Rs#Sequence#attribute#$thingie" }++;
$stuff{ "Rs#Sequence#attribute#$thingie" } = $an_attribute_value;
}
1;
}
The output from running this program is presented below, but long
DNA sequences have been "staggered" to fit the page and the Seq5 data
is in boldface type:
-bash-3.2$ perl Twig-example-1.pl
And now print the contents of the stuff hash
Stuff value for key: Rs#Sequence is:
TTAGAGCAGGCTGTGTGGCCAATCCTGACACCAATGGGGCAGTGAGCTATAATCTTTCCCCAGGGAAGGGCAACAAAAATTATGAGCAA
CAGTATAATTTATCACATTATTCGTTTTCTTCTACATCTTGGGGACCATAAAGAAGAAAGAAGCAGCTGTCTTTTTGTGGTAGTTTTGCTC
AGAGCTGCCTAGAGCGAGGACAAGACAGGTGACCTTTCAAAATACCTTACAGACTTAGGATTTGGATTTTCATGGTGGTTGGCACAGCCCCAG
GCTGCACAGAACTAATACCTGCTGTTCC/TTCTGCCTCCACCAGCCCTATCTCTTAGGCTCAAGGAGAAATTTTACTGGATGGGCTGTCTTTT
CCAAAGTTTACCACCCAACACCCAATGCCCTTTGGGGCATTAGTGAATCCATTTTTCTTGACTTCTAGCATAAATTCACCCACTTATGTGTTTC
CTTCCCAGCTGTCTTTTGGGGAGACATTGCCTTAGATGAAGATGACTTGAAGCTGTTTCACATTGACAAAGCCAGAGACTGGACCAAGCAGACAG
TGGGGGCAACAGGACACAGCACAGGTAGGTACTGCTTCCTCCCTTCTC
Stuff value for key: Rs#Sequence#Seq5 is:
TTAGAGCAGGCTGTGTGGCCAATCCTGACACCAATGGGGCAGTGAGCTATAATCTTTCCCCAGGGAAGGGCAACAAAAATTATGAGCAA
CAGTATAATTTATCACATTATTCGTTTTCTTCTACATCTTGGGGACCATAAAGAAGAAAGAAGCAGCTGTCTTTTTGTGGTAGTTTTGCTC
AGAGCTGCCTAGAGCGAGGACAAGACAGGTGACCTTTCAAAATACCTTACAGACTTAGGATTTGGATTTTCATGGTGGTTGGCACAGCCCCAG
GCTGCACAGAACTAATACCTGCTGTTC
Stuff value for key: Rs#Sequence#attribute#exemplarSs is: 38524064
-bash-3.2$
Processing multiple elements of the same type
Special procedures are required for processing multiple instances of the
same element. For example, the most difficult aspect of
processing the dbSNP data is related to processing
...... . . ....
structures.
In the following example, the attributes associated with each
Ss element are stored in a hash called %temp_attributes, which
is then stored in a "temporary" array called @Ss_temp_array
for processing later.
Here the array values are accessed in the calling routine, but
they could be processed as the Rs element is processed, since
all the Ss elements are enclosed within the Rs element and will,
therefore, have been processed before the Rs tag is processed.
-bash-3.2$ cat Twig-example-2.pl
# This hack uses the XML::Twig module to parse Rs elements
# downloaded from the dbSNP database at NCBI.
use XML::Twig;
$url =
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=243&report=XML&mode=xml";
# Set up handlers for processing specific elements and/or tags.
$t = XML::Twig->new(
twig_handlers => {
'Rs/Ss' => \&process_Rs_Ss
} );
# Start the parsing process.
$t->parseurl( $url );
print "\nAnd now print the hashes stored in the Ss_temp_array\n\n";
$array_count = 0;
foreach $an_Ss_attribute_hash ( @Ss_temp_array )
{
print "\nHere is attribute hash $array_count:\n";
foreach $an_Ss_attribute_hash_entry (keys( %$an_Ss_attribute_hash ) )
{
print "Entry $array_count: $an_Ss_attribute_hash_entry has value " .
"$an_Ss_attribute_hash->{ $an_Ss_attribute_hash_entry }\n";
}
$array_count++;
}
exit;
#####################
#Here are the handlers.
#####################
sub process_Rs_Ss # Here we get and save the attributes for all Rs_Ss
{ # tags, so we can eventually keep the one whose id
# matches the Rs_Sequence exemplar ID.
my ( $t, $elt ) = @_;
my ( $an_attribute_value, $thingie, $inter_tag_content, %temp_attributes );
# @Ss_temp_attributes is NOT local, but $temp_attributes IS local.
$count{ 'Rs#Ss' }++;
$inter_tag_content = $elt->text;
# Save the attribute values for this Rs_Ss tag in a hash.
undef %temp_attributes;
foreach $thingie ( keys( %{$elt->atts} ) )
{
$an_attribute_value = ${$elt->atts}{ $thingie };
$temp_attributes{ "Rs#Ss#attribute#$thingie" } = $an_attribute_value;
}
# Also save the intertag content for this element in the hash.
if( $inter_tag_content ne "" )
{
$temp_attributes{ 'Rs#Ss' } = $inter_tag_content;
}
# Now push the hash containing attribute values for this Ss entry
# into an array for later use as in the calling routine or
# as process_Rs wraps up.
push( @Ss_temp_array, {%temp_attributes} );
1;
}
When this script is run, the output for the first 2 Ss elements
will look like:
-bash-3.2$ perl Twig-example-2.pl|more
And now print the hashes stored in the Ss_temp_array
Here is attribute hash 0:
Entry 0: Rs#Ss#attribute#ssId has value 243
Entry 0: Rs#Ss#attribute#handle has value KWOK
Entry 0: Rs#Ss#attribute#buildId has value 36
Entry 0: Rs#Ss#attribute#subSnpClass has value snp
Entry 0: Rs#Ss#attribute#validated has value by-frequency
Entry 0: Rs#Ss#attribute#molType has value cDNA
Entry 0: Rs#Ss#attribute#strand has value top
Entry 0: Rs#Ss#attribute#batchId has value 47
Entry 0: Rs#Ss#attribute#locSnpId has value D10S1257
Entry 0: Rs#Ss has value
CGTTTTCTTCTACATCTTGGGGNACATAAAGANGAAAGAAGNAGCTGTCTTTTTGTGGTAGTTTTGCTCAGAGC
TGCCTAGAGCNAGGACAAGACAGGTGACCTTTCAAAATACCTTACAGACTTAGGATTTGGATTTTCATGGTGGT
TGGCACAGCCCAGGCTCAACAGAACTAATACCTGCTGTTCC/TTCTGCCTCCACCAGCCCTATCTCTTAGGCTCA
AGGAGAAATTTTACTGGATGGGCTGTCTTTTCCAAA
Entry 0: Rs#Ss#attribute#orient has value forward
Entry 0: Rs#Ss#attribute#methodClass has value computed
Here is attribute hash 1:
Entry 1: Rs#Ss#attribute#ssId has value 834
Entry 1: Rs#Ss#attribute#handle has value WIAF
Entry 1: Rs#Ss#attribute#buildId has value 36
Entry 1: Rs#Ss#attribute#subSnpClass has value snp
Entry 1: Rs#Ss#attribute#validated has value by-submitter
Entry 1: Rs#Ss#attribute#molType has value cDNA
Entry 1: Rs#Ss#attribute#strand has value top
Entry 1: Rs#Ss#attribute#batchId has value 485
Entry 1: Rs#Ss#attribute#locSnpId has value WIAF-1435
Entry 1: Rs#Ss has value
CGTTTTCTTCTACATCTTGGGGNACATAAAGANGAAAGAAGNAGCTGTCTTTTTGTGGTAGTTTTGCTCAGAGC
TGCCTAGAGCNAGGACAAGACAGGTGACCTTTCAAAATACCTTACAGACTTAGGATTTGGATTTTCATGGTGGT
TGGCACAGCCCAGGCTCAACAGAACTAATACCTGCTGTTCC/TTCTGCCTCCACCAGCCCTATCTCTTAGGCTCA
AGGAGAAATTTTACTGGATGGGCTGTCTTTTCCAAA
Entry 1: Rs#Ss#attribute#orient has value forward
Entry 1: Rs#Ss#attribute#methodClass has value sequence
.
.
.
Choosing a specific Ss element from the array
In the case of the Ss elements included in SNP data, there may be
multiple Ss elements, but only the one whose ID matches the
exemplarId in the Sequence element is needed by the calling routine.
Since the Sequence element is not guaranteed to be processed before
the the Ss elements, the attributes associated with each
Ss element are stored in the @Ss_temp_array, as shown above, and
then the required array entry is selected when the enclosing Rs
element is processed, or in the calling routine.
Note that other elements that appear multiple times may not be so
difficult to process because they can be selected "on the fly."
For example, multiple Assembly elements may be easier to process if,
for example,
the desired Assembly element can be identified by the Assembly element handler.
(If you only want that Assembly element whose "reference" attribute
value is "true".)
Additional information
For more information about using the XML::Twig see