Bio::Graph::ProteinGraph.3pm

Langue: en

Autres versions - même langue

Version: 2008-06-24 (ubuntu - 07/07/09)

Section: 3 (Bibliothèques de fonctions)

NAME

Bio::Graph::ProteinGraph - a representation of a protein interaction graph.

SYNOPSIS

   # Read in from file
   my $graphio = Bio::Graph::IO->new(-file   => 'myfile.dat',
                                     -format => 'dip');
   my $graph   = $graphio->next_network();
 
 

Using ProteinGraph

   # Remove duplicate interactions from within a dataset
   $graph->remove_dup_edges();
 
   # Get a node (represented by a sequence object) from the graph.
   my $seqobj = $gr->nodes_by_id('P12345');
 
   # Get clustering coefficient of a given node.
   my $cc = $gr->clustering_coefficient($graph->nodes_by_id('NP_023232'));
   if ($cc != -1) {  ## result is -1 if cannot be calculated
     print "CC for NP_023232 is $cc";
   }
 
   # Get graph density
   my $density = $gr->density();
 
   # Get connected subgraphs
   my @graphs = $gr->components();
 
   # Remove a node
   $gr->remove_nodes($gr->nodes_by_id('P12345'));
 
   # How many interactions are there?
   my $count = $gr->edge_count;
 
   # How many nodes are there?
   my $ncount = $gr->node_count();
 
   # Let's get interactions above a threshold confidence score.
   my $edges = $gr->edges;
   for my $edge (keys %$edges) {
          if (defined($edges->{$edge}->weight()) &&
       $edges->{$edge}->weight() > 0.6) {
                     print $edges->{$edge}->object_id(), "\t",
              $edges->{$edge}->weight(),"\n";
          }
   }
 
   # Get interactors of your favourite protein
   my $node      = $graph->nodes_by_id('NP_023232');
   my @neighbors = $graph->neighbors($node); 
   print "      NP_023232 interacts with ";
   print join " ,", map{$_->object_id()} @neighbors;
   print "\n";
 
   # Annotate your sequences with interaction info
   my @seqs; ## array of sequence objects
   for my $seq(@seqs) {
     if ($graph->has_node($seq->accession_number)) {
        my $node = $graph->nodes_by_id( $seq->accession_number);
        my @neighbors = $graph->neighbors($node);
        for my $n (@neighbors) {
          my $ft = Bio::SeqFeature::Generic->new(
                       -primary_tag => 'Interactor',
                       -tags        => { id => $n->accession_number }
                       );
             $seq->add_SeqFeature($ft);
         }
      }
   }
 
   # Get proteins with > 10 interactors
   my @nodes = $graph->nodes();
   my @hubs;
   for my $node (@nodes) {
     if ($graph->neighbor_count($node) > 10) {
        push @hubs, $node;
     }
   }
   print "the following proteins have > 10 interactors:\n";
   print join "\n", map{$_->object_id()} @hubs;
 
   # Merge graphs 1 and 2 and flag duplicate edges
   $g1->union($g2);
   my @duplicates = $g1->dup_edges();
   print "these interactions exist in $g1 and $g2:\n";
   print join "\n", map{$_->object_id} @duplicates;
 
 

Creating networks from your own data

If you have interaction data in your own format, e.g.

   edgeid  node1  node2  score
 
   my $io = Bio::Root::IO->new(-file => 'mydata');
   my $gr = Bio::Graph::ProteinGraph->new();
   my %seen = (); # to record seen nodes
   while (my $l = $io->_readline() ) {
 
   # Parse out your data...
   my ($e_id, $n1, $n2, $sc) = split /\s+/, $l;
 
   # ...then make nodes if they don't already exist in the graph...
   my @nodes =();
     for my $n ($n1, $n2 ) {
                 if (!exists($seen{$n})) {
         push @nodes,  Bio::Seq->new(-accession_number => $n);
                   $seen{$n} = $nodes[$#nodes];
       } else {
                         push @nodes, $seen{$n};
            }
     }
   }
 
   # ...and add a new edge to the graph
   my $edge  = Bio::Graph::Edge->new(-nodes => \@nodes,
                                     -id    => 'myid',
                                     -weight=> 1);
   $gr->add_edge($edge);
 
 

DESCRIPTION

A ProteinGraph is a representation of a protein interaction network. It derives most of its functionality from the Bio::Graph::SimpleGraph module, but is adapted to be able to use protein identifiers to identify the nodes.

This graph can use any objects that implement Bio::AnnotatableI and Bio::IdentifiableI interfaces. Bio::Seq (but not Bio::PrimarySeqI) objects can therefore be used for the nodes but any object that supports annotation objects and the object_id() method should work fine.

At present it is fairly 'lightweight' in that it represents nodes and edges but does not contain all the data about experiment ids etc. found in the Protein Standards Initiative schema. Hopefully that will be available soon.

A dataset may contain duplicate or redundant interactions. Duplicate interactions are interactions that occur twice in the dataset but with a different interaction ID, perhaps from a different experiment. The dup_edges method will retrieve these.

Redundant interaction are interactions that occur twice or more in a dataset with the same interaction id. These are more likely to be due to database errors. These methods are useful when merging 2 datasets using the union() method. Interactions present in both datasets, with different IDs, will be duplicate edges.

For Developers

In this module, nodes are represented by Bio::Seq::RichSeq objects containing all possible database identifiers but no sequence, as parsed from the interaction files. However, a node represented by a Bio::PrimarySeq object should work fine too.

Edges are represented by Bio::Graph::Edge objects. In order to work with SimpleGraph these objects must be array references, with the first 2 elements being references to the 2 nodes. More data can be added in $e[2]. etc. Edges should be Bio::Graph::Edge objects, which are Bio::IdentifiableI implementing objects.

At present edges only have an identifier and a weight() method, to hold confidence data, but subclasses of this could hold all the interaction data held in an XML document.

So, a graph has the following data:

1. A hash of nodes ('_nodes'), where keys are the text representation of a nodes memory address and values are the sequence object references.

2. A hash of neighbors ('_neighbors'), where keys are the text representation of a nodes memory address and a value is a reference to a list of neighboring node references.

3. A hash of edges ('_edges'), where a key is a text representation of the 2 nodes. E.g., ``address1,address2'' as a string, and values are Bio::Graph::Edge objects.

4. Look up hash ('_id_map') for finding a node by any of its ids.

5. Look up hash for edges ('_edge_id_map') for retrieving an edge object from its identifier.

6. Hash ('_components').

7. An array of duplicate edges ('_dup_edges').

8. Hash ('_is_connected').

REQUIREMENTS

To use this code you will need the Clone.pm module availabe from CPAN. You also need Class::AutoClass, available from CPAN as well. To read in XML data you will need XML::Twig available from CPAN.

SEE ALSO

Bio::Graph::SimpleGraph Bio::Graph::IO Bio::Graph::Edge Bio::Graph::IO::dip Bio::Graph::IO::psi_xml

FEEDBACK


Mailing Lists

User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated.

   bioperl-l@bioperl.org                  - General discussion
   http://bioperl.org/wiki/Mailing_lists  - About the mailing lists
 
 

Reporting Bugs

Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web:

   http://bugzilla.open-bio.org/
 
 

AUTHORS

  Richard Adams - this module, Graph::IO modules.
 
  Email richard.adams@ed.ac.uk
 
 

AUTHOR2

  Nat Goodman - SimpleGraph.pm, and all underlying graph algorithms.
 
 

has_node

  name      : has_node
  purpose   : Is a protein in the graph?
  usage     : if ($g->has_node('NP_23456')) {....}
  returns   : 1 if true, 0 if false
  arguments : A sequence identifier.
 
 

nodes_by_id

  Name      : nodes_by_id
  Purpose   : get node memory address from an id
  Usage     : my @neighbors= $self->neighbors($self->nodes_by_id('O232322'))
  Returns   : a SimpleGraph node representation ( a text representation
              of a node needed for other graph methods e.g.,
              neighbors(), edges()
  Arguments : a protein identifier., e.g., its accession number.
 
 

union

  Name        : union
  Purpose     : To merge two graphs together, flagging interactions as 
                duplicate.
  Usage       : $g1->union($g2), where g1 and g2 are 2 graph objects. 
  Returns     : void, $g1 is modified
  Arguments   : A Graph object of the same class as the calling object. 
  Description : This method merges 2 graphs. The calling graph is modified, 
                the parameter graph ($g2) in usage) is unchanged. To take 
                account of differing IDs identifying the same protein, all 
                ids are compared. The following rules are used to modify $g1.
 
                First of all both graphs are scanned for nodes that share 
                an id in common. 
 
          1. If 2 nodes(proteins) share an interaction in both graphs,
             the edge in graph 2 is copied to graph 1 and added as a
             duplicate edge to graph 1,
 
          2. If 2 nodes interact in $g2 but not $g1, but both nodes exist
             in $g1, the attributes of the interaction in $g2 are 
             used to make a new edge in $g1.
 
          3. If 2 nodes interact in g2 but not g1, and 1 of them is a new
             protein, that protein is put in $g1 and a new edge made to
             it. 
 
          4. At present, if there is an interaction in $g2 composed of a
             pair of interactors that are not present in $g1, they are 
             not copied to $g1. This is rather conservative but prevents
             the problem of having redundant nodes in $g1 due to the same
             protein being identified by different ids in the same graph.
 
          So, for example 
 
               Edge   N1  N2 Comment
 
     Graph 1:  E1     P1  P2
               E2     P3  P4
               E3     P1  P4
 
     Graph 2:  X1     P1  P2 - will be added as duplicate to Graph1
               X2     P1  X4 - X4 added to Graph 1 and new edge made
               X3     P2  P3 - new edge links existing proteins in G1
               X4     Z4  Z5 - not added to Graph1. Are these different
                               proteins or synonyms for proteins in G1?
 
 

edge_count

  Name     : edge_count
  Purpose  : returns number of unique interactions, excluding 
             redundancies/duplicates
  Arguments: void
  Returns  : An integer
  Usage    : my $count  = $graph->edge_count;
 
 

node_count

  Name     : node_count
  Purpose  : returns number of nodes.
  Arguments: void
  Returns  : An integer
  Usage    : my $count = $graph->node_count;
 
 

neighbor_count

  Name      : neighbor_count
  Purpose   : returns number of neighbors of a given node
  Usage     : my $count = $gr->neighbor_count($node)
  Arguments : a node object
  Returns   : an integer
 
 

_get_ids_by_db

  Name     : _get_ids_by_db
  Purpose  : gets all ids for a node, assuming its Bio::Seq object
  Arguments: A Bio::SeqI object
  Returns  : A hash: Keys are db ids, values are accessions
  Usage    : my %ids = $gr->_get_ids_by_db($seqobj);
 
 

add_edge

  Name        : add_edge
  Purpose     : adds an interaction to a graph.
  Usage       : $gr->add_edge($edge)
  Arguments   : a Bio::Graph::Edge object, or a reference to a 2 element list. 
  Returns     : void
  Description : This is the method to use to add an interaction to a graph. 
                It contains the logic used to determine if a graph is a 
                new edge, a duplicate (an existing interaction with a 
                different edge id) or a redundant edge (same interaction, 
                same edge id).
 
 

subgraph

  Name      : subgraph
  Purpose   : To construct a subgraph of  nodes from the main network.This 
              method overrides that of Bio::Graph::SimpleGraph in its dealings with 
              Edge objects. 
  Usage     : my $sg = $gr->subgraph(@nodes).
  Returns   : A subgraph of the same class as the original graph. Edge objects are 
              cloned from the original graph but node objects are shared, so beware if you 
              start deleting nodes from the parent graph whilst operating on subgraph nodes. 
  Arguments : A list of node objects.
 
 

add_dup_edge

  Name       : add_dup_edge
  Purpose    : to flag an interaction as a duplicate, take advantage of 
               edge ids. The idea is that interactions from 2 sources with 
               different interaction ids can be used to provide more 
               evidence for a interaction being true, while preventing 
               redundancy of the same interaction being present more than 
               once in the same dataset. 
  Returns    : 1 on successful addition, 0 on there being an existing 
               duplicate. 
  Usage      : $gr->add_dup_edge(edge->new (-nodes => [$n1, $n2],
                                            -score => $score
                                            -id    => $id);
  Arguments  : an EdgeI implementing object.
  Descripton :
 
 

edge_by_id

  Name        : edge_by_id
  Purpose     : retrieve data about an edge from its id
  Arguments   : a text identifier
  Returns     : a Bio::Graph::Edge object or undef
  Usage       : my $edge = $gr->edge_by_id('1000E');
 
 

remove_dup_edges

  Name        : remove_dup_edges
  Purpose     : removes duplicate edges from graph
  Arguments   : none         - removes all duplicate edges
                edge id list - removes specified edges
  Returns     : void
  Usage       :    $gr->remove_dup_edges()
                or $gr->remove_dup_edges($edgeid1, $edgeid2);
 
 

redundant_edge

  Name        : redundant_edge
  Purpose     : adds/retrieves redundant edges to graph
  Usage       : $gr->redundant_edge($edge)
  Arguments   : none (getter) or a Biuo::Graph::Edge object (setter). 
  Description : redundant edges are edges in a graph that have the 
                same edge id, ie. are 2 identical interactions. 
                With edge arg adds it to list, else returns list as reference.
 
 

redundant_edges

  Name         : redundant_edges
  Purpose      : alias for redundant_edge
 
 

remove_redundant_edges

  Name        : remove_redundant_edges
  Purpose     : removes redundant_edges from graph, used by remove_node(),
                may be better as an internal method??
  Arguments   : none         - removes all redundant edges
                edge id list - removes specified edges
  Returns     : void
  Usage       :    $gr->remove_redundant_edges()
                or $gr->remove_redundant_edges($edgeid1, $edgeid2);
 
 

clustering_coefficient

  Name      : clustering_coefficient
  Purpose   : determines the clustering coefficient of a node, a number 
              in range 0-1 indicating the extent to which the neighbors of
              a node are interconnnected.
  Arguments : A sequence object (preferred) or a text identifier
  Returns   : The clustering coefficient. 0 is a valid result.
              If the CC is not calculable ( if the node has <2 neighbors),
                 returns -1.
  Usage     : my $node = $gr->nodes_by_id('P12345');
              my $cc   = $gr->clustering_coefficient($node);
 
 

remove_nodes

  Name      : remove_nodes
  Purpose   : to delete a node from a graph, e.g., to simulate effect 
              of mutation
  Usage     : $gr->remove_nodes($seqobj);
  Arguments : a single $seqobj or list of seq objects (nodes)
  Returns   : 1 on success
 
 

unconnected_nodes

  Name      : unconnected_nodes
  Purpose   : return a list of nodes with no connections. 
  Arguments : none
  Returns   : an array or array reference of unconnected nodes
  Usage     : my @ucnodes = $gr->unconnected_nodes();
 
 

articulation_points

  Name      : articulation_points
  Purpose   : to find edges in a graph that if broken will fragment
                the graph into islands.
  Usage     : my $edgeref = $gr->articulation_points();
              for my $e (keys %$edgeref) {
                                    print $e->[0]->accession_number. "-".
                      $e->[1]->accession_number ."\n";
              }
  Arguments : none
  Returns   : a list references to nodes that will fragment the graph 
              if deleted. 
  Notes     : This is a "slow but sure" method that works with graphs
                up to a few hundred nodes reasonably fast.
 
 

is_articulation_point

  Name      : is_articulation_point
  Purpose   : to determine if a given node is an articulation point or not. 
  Usage     : if ($gr->is_articulation_point($node)) {.... 
  Arguments : a text identifier for the protein or the node itself
  Returns   : 1 if node is an articulation point, 0 if it is not