mercury.graph.core
mercury.graph.core.Graph(data=None, keys=None, nodes=None)
This is the main class in mercury.graph.
This class seamlessly abstracts the underlying technology used to represent the graph. You can create a graph passing the following objects to the constructor:
- A pandas DataFrame containing edges (with a keys dictionary to specify the columns and possibly a nodes DataFrame)
- A pyspark DataFrame containing edges (with a keys dictionary to specify the columns and possibly a nodes DataFrame)
- A networkx graph
- A graphframes graph
Bear in mind that the graph object is immutable. This means that you can't modify the graph object once it has been created. If you want to modify it, you have to create a new graph object.
The graph object provides:
- Properties to access the graph in different formats (networkx, graphframes, dgl)
- Properties with metrics and summary information that are calculated on demand and technology independent.
- It is inherited by other graph classes in mercury-graph providing ML algorithms such as graph embedding, visualization, etc.
Using this class from the other classes in mercury-graph:
The other classes in mercury-graph define models or functionalities that are based on graphs. They use a Scikit-learn-like API to interact with the graph object. This means that the graph object is passed to the class constructor and the class follow the Scikit-learn conventions. It is recommended to follow the same conventions when creating your own classes to work with mercury-graph.
The conventions can be found here:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
(DataFrame, Graph or DataFrame)
|
The data to create the graph from. It can be a pandas DataFrame, a networkx Graph, a pyspark DataFrame, or a Graphframe. In case it already contains a graph (networkx or graphframes), the keys and nodes arguments are ignored. |
None
|
keys
|
dict
|
A dictionary with keys to specify the columns in the data DataFrame. The keys are:
When the keys argument is not provided or the key is missing, the default values are:
|
None
|
nodes
|
DataFrame
|
A pandas DataFrame or a pyspark DataFrame with the nodes data. (Only when |
None
|
Source code in mercury/graph/core/graph.py
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
|
betweenness_centrality
property
Returns the betweenness centrality of each node in the graph as a Python dictionary.
closeness_centrality
property
Returns the closeness centrality of each node in the graph as a Python dictionary.
connected_components
property
Returns the connected components of each node in the graph as a Python dictionary.
degree
property
Returns the degree of each node in the graph as a Python dictionary.
dgl
property
Returns the graph as a DGL graph.
If the graph has not been converted to a DGL graph yet, it will be converted and cached for future use.
Returns:
Type | Description |
---|---|
DGLGraph
|
The graph represented as a DGL graph. |
edges
property
Returns an iterator over the edges in the graph.
Returns:
Type | Description |
---|---|
EdgeIterator
|
An iterator object that allows iterating over the edges in the graph. |
edges_colnames
property
Returns the column names of the edges DataFrame.
graphframe
property
Returns the graph as a GraphFrame.
If the graph has not been converted to a GraphFrame yet, it will be converted and cached for future use.
Returns:
Type | Description |
---|---|
GraphFrame
|
The graph represented as a GraphFrame. |
in_degree
property
Returns the in-degree of each node in the graph as a Python dictionary.
is_directed
property
Returns True if the graph is directed, False otherwise.
Note
Graphs created using graphframes are always directed. The way around it is to add the reverse edges to the graph.
This can be done by creating the Graph with pyspark DataFrame() and defining a key 'directed' set as False in the dict
argument. Otherwise, the graph will be considered directed even if these reversed edges have been created by other means
this class cannot be aware of.
is_weighted
property
Returns True if the graph is weighted, False otherwise.
A graph is considered weight if it has a column named 'weight' in the edges DataFrame or the column has a different name and that
name is passed in the dict
argument as the 'weight' key.
networkx
property
Returns the graph representation as a NetworkX graph.
If the graph has not been converted to NetworkX format yet, it will be converted and cached for future use.
Returns:
Type | Description |
---|---|
Graph
|
The graph representation as a NetworkX graph. |
nodes
property
Returns an iterator over all the nodes in the graph.
Returns:
Type | Description |
---|---|
NodeIterator
|
An iterator that yields each node in the graph. |
nodes_colnames
property
Returns the column names of the nodes DataFrame.
number_of_edges
property
Returns the number of edges in the graph.
Returns:
Type | Description |
---|---|
int
|
The number of edges in the graph. |
number_of_nodes
property
Returns the number of nodes in the graph.
Returns:
Type | Description |
---|---|
int
|
The number of nodes in the graph. |
out_degree
property
Returns the out-degree of each node in the graph as a Python dictionary.
pagerank
property
Returns the PageRank of each node in the graph as a Python dictionary.
_calculate_betweenness_centrality()
This internal method handles the logic of a property. It returns the betweenness centrality of each node in the graph as a Python dictionary. NOTE: This method converts the graph to a networkx graph to calculate the betweenness centrality since the algorithm is too computationally expensive to use on large graphs.
Source code in mercury/graph/core/graph.py
708 709 710 711 712 713 714 |
|
_calculate_closeness_centrality()
This internal method handles the logic of a property. It returns the closeness centrality of each node in the graph as a Python dictionary.
Source code in mercury/graph/core/graph.py
690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 |
|
_calculate_connected_components()
This internal method handles the logic of a property. It returns the connected components of each node in the graph as a Python dictionary.
Source code in mercury/graph/core/graph.py
729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 |
|
_calculate_degree()
This internal method handles the logic of a property. It returns the degree of each node in the graph.
Source code in mercury/graph/core/graph.py
651 652 653 654 655 656 657 |
|
_calculate_edges_colnames()
This internal method returns the column names of the edges DataFrame.
Source code in mercury/graph/core/graph.py
772 773 774 775 776 777 778 779 780 |
|
_calculate_in_degree()
This internal method handles the logic of a property. It returns the in-degree of each node in the graph.
Source code in mercury/graph/core/graph.py
660 661 662 663 664 665 666 |
|
_calculate_nodes_colnames()
This internal method returns the column names of the nodes DataFrame.
Source code in mercury/graph/core/graph.py
760 761 762 763 764 765 766 767 768 769 |
|
_calculate_out_degree()
This internal method handles the logic of a property. It returns the out-degree of each node in the graph.
Source code in mercury/graph/core/graph.py
669 670 671 672 673 674 675 |
|
_calculate_pagerank()
This internal method handles the logic of a property. It returns the PageRank of each node in the graph as a Python dictionary.
Source code in mercury/graph/core/graph.py
717 718 719 720 721 722 723 724 725 726 |
|
_fill_node_zeros(d)
This internal method fills the nodes that are not in the dictionary with a zero value. This make the output obtained from graphframes consistent with the one from networkx.
Source code in mercury/graph/core/graph.py
678 679 680 681 682 683 684 685 686 687 |
|
_from_dataframe(edges, nodes, keys)
This internal method extends the constructor to accept a pyspark DataFrame as input.
It takes the constructor arguments and does not return anything. It sets the internal state of the object.
Source code in mercury/graph/core/graph.py
533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 |
|
_from_graphframes(graph, directed=True)
This internal method extends the constructor to accept a graphframes graph as input.
It takes the constructor arguments and does not return anything. It sets the internal state of the object.
Source code in mercury/graph/core/graph.py
589 590 591 592 593 594 595 596 597 598 |
|
_from_networkx(graph)
This internal method extends the constructor to accept a networkx graph as input.
It takes the constructor arguments and does not return anything. It sets the internal state of the object.
Source code in mercury/graph/core/graph.py
577 578 579 580 581 582 583 584 585 586 |
|
_from_pandas(edges, nodes, keys)
This internal method extends the constructor to accept a pandas DataFrame as input.
It takes the constructor arguments and does not return anything. It sets the internal state of the object.
Source code in mercury/graph/core/graph.py
495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 |
|
_to_dgl()
This internal method handles the logic of a property. It returns the dgl graph that already exists or converts it from the networkx graph if not.
Source code in mercury/graph/core/graph.py
631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 |
|
_to_graphframe()
This internal method handles the logic of a property. It returns the graphframes graph that already exists or converts it from the networkx graph if not.
Source code in mercury/graph/core/graph.py
621 622 623 624 625 626 627 628 |
|
_to_networkx()
This internal method handles the logic of a property. It returns the networkx graph that already exists or converts it from the graphframes graph if not.
Source code in mercury/graph/core/graph.py
601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 |
|
edges_as_dataframe()
Returns the edges as a pyspark DataFrame.
If the graph is represented as a graphframes graph, the edges are extracted from it. Otherwise, the edges are converted from the pandas DataFrame representation. The columns used as the source and destination nodes are always named 'src' and 'dst', respectively, regardless of the original column names passed to the constructor.
Source code in mercury/graph/core/graph.py
481 482 483 484 485 486 487 488 489 490 491 492 |
|
edges_as_pandas()
Returns the edges as a pandas DataFrame.
If the graph is represented as a networkx graph, the edges are extracted from it. Otherwise, the graphframes graph will be used. This dataset may differ from possible pandas DataFrame passed to the constructor in the column names and order. The columns used as the source and destination nodes are always named 'src' and 'dst', respectively.
Source code in mercury/graph/core/graph.py
448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 |
|
nodes_as_dataframe()
Returns the nodes as a pyspark DataFrame.
If the graph is represented as a graphframes graph, the nodes are extracted from it. Otherwise, the nodes are converted from the pandas DataFrame representation. The column used as the node id is always named 'id', regardless of the original column name passed to the constructor.
Source code in mercury/graph/core/graph.py
467 468 469 470 471 472 473 474 475 476 477 478 |
|
nodes_as_pandas()
Returns the nodes as a pandas DataFrame.
If the graph is represented as a networkx graph, the nodes are extracted from it. Otherwise, the graphframes graph will be used. This dataset may differ from possible pandas DataFrame passed to the constructor in the column names and order. The column used as the node id is always named 'id'.
Source code in mercury/graph/core/graph.py
429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 |
|
mercury.graph.core.SparkInterface(config=None, session=None)
A class that provides an interface for interacting with Apache Spark, graphframes and dgl.
Attributes:
Name | Type | Description |
---|---|---|
_spark_session |
SparkSession
|
The shared Spark session. |
_graphframes |
module
|
The shared graphframes namespace. |
Methods:
Name | Description |
---|---|
_create_spark_session |
Creates a Spark session. |
spark |
Property that returns the shared Spark session. |
pyspark |
Property that returns the pyspark namespace. |
graphframes |
Property that returns the shared graphframes namespace. |
dgl |
Property that returns the shared dgl namespace. |
read_csv |
Reads a CSV file into a DataFrame. |
read_parquet |
Reads a Parquet file into a DataFrame. |
read_json |
Reads a JSON file into a DataFrame. |
read_text |
Reads a text file into a DataFrame. |
read |
Reads a file into a DataFrame. |
sql |
Executes a SQL query. |
udf |
Registers a user-defined function (UDF). |
stop |
Stops the Spark session. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
dict
|
A dictionary of Spark configuration options.
If not provided, the configuration in the global variable |
None
|
Source code in mercury/graph/core/spark_interface.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
|