More about data science: cienciadedatos.net

Introduction¶

One of the latest trends in the field of artificial intelligence is Graph Machine Learning (GML). This discipline focuses on applying machine learning and statistical algorithms to the study of complex networks or graphs.

A graph is a data structure formed by nodes and links that allows representing relationships between objects. For example, in a social network, the nodes could represent users and the links the connection between them.

GML uses graphs to train machine learning models and to be able to make inferences about the nodes or links. The three main types of problems that can be solved with GML are:

Inference on nodes: predicting characteristics of nodes using the characteristics of their neighboring nodes.
Inference on links: predicting characteristics or existence of links, for example, inferring which links are most likely to occur in the future. This is a typical case of recommendation systems.
Inference on sets of nodes: identifying communities of nodes that present similar behaviors. This is a typical case in customer segmentation with similar characteristics.

In this series of articles, the aim is to show the basics of Graph Machine Learning and solve practical cases using the most important graph libraries in Python such as NetworkX PyTorch Geometric, Stellargraph or Graph Neural Networks.

Partial map of the internet based on information obtained from the opte.org site on January 15, 2005. Each line drawn between two nodes represents the link between two IP addresses. The length of the lines is proportional to the waiting time between the nodes. The image represents 30% of the accessible networks in 2005.

Types of graphs¶

Graphs are classified based on their nodes and links. Several types of graphs can be differentiated depending on the type of relationship that exists between their nodes.

Undirected graphs

The links have no direction, that is, if a node $A$ is connected to a node $B$, then $B$ is also connected to $A$. An example of an undirected graph is a social network where users are connected to each other through friendship ties.

Example of a social network, where the nodes are users and the edges indicate if there is a friendship relationship between them. It is an undirected graph because the connections have no directionality.

Directed graphs

The links have a direction, that is, if a node $A$ is connected to a node $B$, $B$ is not necessarily connected to $A$. An example of a directed graph is a telephone call network where the nodes are people and the edges show each call, taking into account who calls whom.

Example of a directed graph.
**Weighted graphs** The links have an associated weight, which represents the importance or intensity of the relationship between the nodes. An example of a weighted graph is a network of cities where the links represent transport routes and the weight represents the distance between cities.

Example of a network of cities, where the nodes are cities and the edges are the distance between them. It is a weighted graph.

Bipartite graphs

The nodes are divided into two disjoint sets (two types of nodes), and only links between nodes of different sets are allowed. An example of a bipartite graph is a network of movies and actors, where the nodes of one set are the movies and the nodes of the other set are the actors; and only links between movies and actors are allowed.

Example of a bipartite graph, where the nodes can be of two types: actors or movies.

Adjacency Matrix¶

To be able to analyze a graph, it is necessary to represent it mathematically. There are two main ways to do this: the edgelist and the adjacency matrix.

The edgelist is simply a list in which all the connections are indicated. For example, in a directed graph with three nodes {$A$, $B$ and $C$}, where $A$ connects to $B$ and $C$, the edgelist is {($A$,$B$), ($A$,$C$)}. If the graph is undirected, the list must include the connections in both directions {($A$,$B$), ($B$,$A$) ($A$,$C$), ($C$,$A$)}.

Another way to represent a graph is by means of what is known as an adjacency matrix. The adjacency matrix is a matrix of dimension $NxN$, where $N$ is the number of nodes and where a 1 appears if the connection between a pair of nodes exists and a 0 otherwise. For weighted graphs, instead of a 1, the matrix presents the value of the weight of the connection. Due to its mathematical properties, the adjacency matrix is the most used method of representing graphs.

The disadvantage of the adjacency matrix is that for very large graphs, it can take up a lot of space. For this reason, a sparse adjacency matrix is frequently used, which internally only stores the information of the existing connections, that is, it does not store the zeros.

For undirected graphs, the adjacency matrix is symmetric since, if the connection between nodes $A$ and $B$ exists, the reciprocal connection will also exist.

Example of a graph and its adjacency matrix. Being an undirected graph, the matrix is symmetric.

With a mathematical formulation, the adjacency matrix of a directed graph of $\mathrm{N}$ nodes has $\mathrm{N}$ rows and $\mathrm{N}$ columns, its elements being:

$A_{i j}=1$ if there is a link pointing from node $j$ to node $i$
$A_{i j}=0$ if nodes $i$ and $j$ are not connected to each other

The adjacency matrix of an undirected network has two entries for each link. The link $(1,2)$ is represented as $A_{12}$ $=1$ and $A_{21}=1$. Therefore, the adjacency matrix of an undirected network is symmetric, $A_{i j}=A_{j i}.$

In the following figure, the different types of graphs are represented along with their adjacency matrices. As can be seen, directed graphs ($B$) are the only ones that do not have a symmetric adjacency matrix. In the case of bipartite graphs ($D$), they can be projected to obtain the indirect connections between each type of node ($E$), the latter will be seen in depth later.

Example of graphs of different types with their corresponding adjacency matrices. A) Undirected graph. B) Directed graph. C) Weighted graph. D) Bipartite graph, with two types of nodes (green and orange). E) Projection of the bipartite graph on the nodes of each type. Source: wikipedia.

Meta-information of nodes and edges¶

When working with graphs, especially if you want to make predictions or other types of analytics, it is very common to have additional information about the nodes and links. This information is called meta-information because it does not strictly belong to the graph and is usually stored as a table, whose rows are the nodes or links and its columns are the available variables for each one. If the social network example is used, the meta-information of the nodes could be the name, age, hobbies, etc...

Although less common than for nodes, links can also have meta-information. Continuing with the social network example, the meta-information of the edges could be the year in which the friendship relationship between two people began.

Bipartite Graphs¶

A bipartite graph is a network whose nodes can be divided into two disjoint sets $U$ and $V$ such that the links connect nodes of $U$ with nodes of $V$. In other words, if the nodes $U$ are colored green and the nodes $V$ are colored purple, there are only links connecting nodes of different colors.

Example of a bipartite graph. Source: Network Science - Albert-László Barabási.

It is possible to generate two projections for each bipartite network. The first projection connects nodes $U$ if they are connected to the same node $V$ in the bipartite representation. The second projection connects nodes $V$ if they are connected to the same node $U$ in the bipartite representation.

Returning to the previous example of a bipartite network, in which one set of nodes corresponds to movies ($U$) and the other to actors ($V$), and each movie is connected to the actors who act in it; the first projection of this network connects the actor nodes if they have acted in the same movie. The other projection connects the movies that share at least one actor in their cast.

Paths and distances¶

Physical distance plays a key role in the interactions between the components of systems. For example, the physical distance between two cities influences the number of visitors traveling from one city to another.

In networks, distance is a different concept from physical distance. If the question is asked: What is the distance between two users of a social network? Physical distance ceases to be relevant because two individuals living in the same building may not know each other and have very good friends in other parts of the world.

In networks, physical distance is replaced by path length. A path between two nodes is a route that starts at the first node and travels along the links of the network, passing through different nodes until it reaches the second node. The path length represents the number of links that the path between two nodes contains.

In network science, paths play a central role. Below are some of their most important properties:

Reciprocity

In an undirected network, the distance between node $i$ and node $j$ is the same as the distance between node $j$ and node $i$ ($d_{ij}$ = $d_{ji}$). On the contrary, in a directed network, often $d_{ij}$ $\neq$ $d_{ji}$. Furthermore, in a directed network, the existence of a path from node $i$ to node $j$ does not guarantee the existence of a path from $j$ to $i$.

Shortest path between two nodes

In networks, determining the distance between two nodes is equivalent to identifying the shortest path between said nodes. The distance (or shortest path) $d_{ij}$ between two nodes $i$ and $j$ can be calculated directly from the adjacency matrix $A_{i j}$. It is enough to multiply the adjacency matrix by itself as many times as the distance between two nodes.

$d_{i j}=1$: If there is a direct link between $i$ and $j$, then $A_{i j}=1$ ( $A_{i j}=0$ otherwise $)$
$d_{i j}=2$: If there is a path of length 2 between $i$ and $j$, then $A_{i k} A_{k j}=1\left(A_{i k} A_{k j}=0\right.$ otherwise). The number of paths $d_{i j}=2$ between $i$ and $j$ is

$$ N_{i j}^{(2)}=\sum_{k=1}^N A_{i k} A_{k j}=(A^2)_{i j} $$

The number of paths of length $d$ between $i$ and $j$ is $$ N_{i j}^{(d)}=(A^d)_{i j} $$

These equations are valid for both directed and undirected networks.

In other words, to obtain the distance of order 2 between any pair of nodes, it is enough to multiply the adjacency matrix by itself. If we multiply it by itself n times, we obtain the distance matrix of order n.

Thus, the distance matrix of order 2 will show us all the pairs of nodes that are connected with an intermediate node. In the following example, nodes 2 and 5 have nodes 4 and 1 in between, that is, there are two possible paths of order 2 that connect them: 2-4-5 and 2-1-5. For this reason, if the adjacency matrix is multiplied by itself, it is obtained that the element [2,5] of the distance matrix is 2. This means that there are two paths of distance 2 between nodes 2 and 5.

Example of a graph and its adjacency matrix. Being an undirected graph, the matrix is symmetric.

Properties of networks and nodes¶

The following are some important properties of networks and nodes in graph theory. In successive articles they will be explained in more detail, including their formulation and mathematical implications.

Degree of a node: the number of links a node has. Directed graphs have two types of degree: "in-degree" for incoming connections and "out-degree" for outgoing connections.

Neighborhood of a node: the nodes to which a node is directly connected.

Average degree: the average of the degree of all nodes in a network.

Density: the number of links in a network divided by the maximum number of possible links. Density provides an idea of how populated with links the network is.

Connected components: groups of nodes connected to each other. Not all connections have to exist, but from any node you can travel to the rest of the nodes.

Paths and cycles: a path is a sequence of links that connect two nodes, while a cycle is a path that begins and ends at the same node.

Centrality: a measure of the importance of a node in a network, according to its degree, neighborhood, paths and cycles, among other factors.

Clustering: a measure of the number of existing links between the neighbors of a node.

Clique: a set of nodes all connected to each other.

Subgraph: the resulting network when selecting only a series of nodes.

Node Centrality¶

Centrality is a concept used to measure the importance of nodes in a graph. The three most important measures of centrality are:

Degree centrality: it is the simplest measure, it is based on the number of links a node has, the sum of links that enter and leave.

Betweenness centrality: measures the number of paths that pass through a node.

Closeness centrality: measures the average distance from a node to all other nodes in the graph.

To learn more about the properties of graphs, you can visit the next chapter of the XX series.

Graphs with NetworkX¶

NetworkX is one of the most used Python libraries for working with graphs and networks. NetworkX allows you to create, manipulate and analyze graphs efficiently.

One of the main advantages of NetworkX is its ability to work with graphs of great size and complexity, allowing you to handle graphs with millions of nodes and links. The library has a wide variety of functionalities that allow you to create, import and export graphs in multiple formats, as well as analyze the properties of these networks (average degree, density, clustering coefficient, the shortest path between two nodes, and many others). In addition, it has a series of algorithms to search for patterns, such as community detection, centrality detection and connected components detection. All these properties will be seen in successive articles.

NetworkX is also compatible with other Python libraries, such as NumPy, SciPy, Matplotlib or Pytorch, which allows you to easily integrate network analysis into a broader data analysis workflow.

Important Functions in NetworkX¶

Below is a list of some of the most used functions in NetworkX.

Graph creation: NetworkX allows you to create empty graphs or graphs with specific nodes and links using functions like Graph() (undirected graphs), DiGraph() (directed graphs), or even for sets of graphs with MultiGraph() and MultiDiGraph().

Adding nodes and links: You can add nodes and links to an existing graph manually, either one by one with add_node() and add_edge(), or from a list with add_nodes_from() and add_edges_from(). They can also be added from a file or DataFrame with from_pandas_edgelist(). For weighted graphs, the add_weighted_edges_from function is used.

Graph information: You can get basic information about the graph, such as the number of nodes and links, using the number_of_nodes() and number_of_edges() functions. Using nodes() and edges() you can access the meta-information of nodes and edges.

Neighbors and degree: You can identify the neighbors and the degree of a node using the functions: neighbors() and degree(); for directed networks, in_degree() and out_degree() are used.

Centrality: You can calculate centrality measures, such as degree centrality, betweenness centrality, and closeness centrality among nodes using the degree_centrality(), betweenness_centrality() and closeness_centrality() functions.

Connected components: You can find the connected components of a graph using the connected_components() function.

Reading and writing files: NetworkX allows you to read and write graphs in various file formats using read_edgelist(), write_edgelist(), read_adjlist() and write_adjlist().

Optimization algorithms: NetworkX has integrated optimization algorithms such as the shortest path and the maximum flow. They are accessed with the shortest_path() and dijkstra_path() functions.

Subgraphs: You can generate subgraphs of a graph using the subgraph() function, giving a list of nodes as input.

Set operations: NetworkX allows you to perform set operations on graphs such as union, intersection and difference using union(), intersection() and difference().

Adjacency matrix: With the nx.adjacency_matrix(G) function you get the adjacency matrix of a graph. By default, it is in sparse format, to see it on the screen you have to use the to_dense() function from numpy.

Graph plot: The draw() function allows you to draw graphs using the Matplotlib library.

Creating Undirected Graphs¶

NetworkX allows you to create networks manually, adding nodes and edges one by one or from a file or a DataFrame containing the connections. The latter is especially useful when working with network data that has already been collected and is in a structured format.

Manual Creation¶

# Libraries
# ======================================================================================
import networkx as nx
import pandas as pd
import warnings
import matplotlib.pyplot as plt

warnings.filterwarnings("ignore")

To create a graph in NetworkX, a "Graph" object must be created using the nx.Graph() function. This function creates an empty graph, without nodes or edges, to which elements can be added later.

# Creating a "Graph" type instance
# ======================================================================================
G = nx.Graph()
print(G)

Graph with 0 nodes and 0 edges

It is verified that it is an undirected graph.

G.is_directed()

False

Once the Graph object has been created, it can be populated with nodes and connections. Two methods are used for this:

add_node: adds a single node to the graph.

add_nodes_from: adds multiple nodes to the graph.

add_edge: adds an edge between nodes u and v. If the nodes do not exist, they are created and added to the graph automatically.

add_edges_from: same behavior as add_edge but using a collection of edges. Each edge is defined with a tuple (u, v).

The name of the nodes can be either numeric or characters.

# Add a single node
# ======================================================================================
fig, ax = plt.subplots(figsize=(1,1))
G.add_node("A")
nx.draw(G, with_labels=True, ax=ax)
print(G)

Graph with 1 nodes and 0 edges

# Add multiple nodes
# ======================================================================================
G.add_nodes_from(["B", "C"])

fig, ax = plt.subplots(figsize=(3, 2))
nx.draw(G, with_labels=True, ax=ax)
ax.set_xlim([1.2*x for x in ax.get_xlim()])
ax.set_ylim([1.2*y for y in ax.get_ylim()])
print(G)

Graph with 3 nodes and 0 edges

# Add a single edge
# ======================================================================================
G.add_edge("A", "B")

fig, ax = plt.subplots(figsize=(3, 2))
nx.draw(G, with_labels=True, ax=ax)
ax.set_xlim([1.2*x for x in ax.get_xlim()])
ax.set_ylim([1.2*y for y in ax.get_ylim()])
print(G)

Graph with 3 nodes and 1 edges

# Add multiple edges
# ======================================================================================
G.add_edges_from([("A", "C"), ("B", "C")])

fig, ax = plt.subplots(figsize=(3, 2))
nx.draw(G, with_labels=True, ax=ax)
ax.set_xlim([1.2*x for x in ax.get_xlim()])
ax.set_ylim([1.2*y for y in ax.get_ylim()])
print(G)

Graph with 3 nodes and 3 edges

If a connection is added whose nodes do not exist, they are created automatically.

G.add_edges_from([("D", "E"), ("E", "F")])
fig, ax = plt.subplots(figsize=(4, 4))
nx.draw(G, with_labels=True, ax=ax)
ax.set_xlim([1.2*x for x in ax.get_xlim()])
ax.set_ylim([1.2*y for y in ax.get_ylim()])
print(G)

Graph with 6 nodes and 5 edges

The corresponding adjacency matrix is:

adjM = nx.adjacency_matrix(G)
# The matrix is converted from sparse to dense format to be able to print it
adjM = adjM.todense()
adjM

array([[0, 1, 1, 0, 0, 0],
       [1, 0, 1, 0, 0, 0],
       [1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 1, 0]])

And the number of nodes and edges:

print(G.number_of_edges())
print(G.number_of_nodes())

5
6

The information of the nodes and edges of the graph is stored in the nodes and edges attributes.

print(f"Graph nodes: {G.nodes}")
print(f"Graph edges: {G.edges}")

Graph nodes: ['A', 'B', 'C', 'D', 'E', 'F']
Graph edges: [('A', 'B'), ('A', 'C'), ('B', 'C'), ('D', 'E'), ('E', 'F')]

Creation from a DataFrame¶

To create a graph from a pandas dataframe, the information has to be structured in such a way that one column represents the start of each edge and another the destination. For example, to represent that there are two nodes ("A" and "B") connected to each other, a row is needed that contains the value "A" in one column and "B" in another. This information is sufficient for the two nodes and the connection between them to be created.

# Dataframe with the graph connections
# ======================================================================================
connections = pd.DataFrame(
    {
        "source": ["A", "B", "C"],
        "target": ["C", "C", "D"],
    }
)
connections

	source	target
0	A	C
1	B	C
2	C	D

In the from_pandas_edgelist function, the source and target columns are indicated (for undirected graphs, a source is chosen indistinctly).

# Create a graph from a Dataframe
# ======================================================================================
fig, ax = plt.subplots(figsize=(3,2))
G = nx.from_pandas_edgelist(connections, source="source", target="target")
nx.draw(G, with_labels=True, ax=ax)

Creating directed graphs¶

Manual creation¶

As explained previously, the links of directed graphs have a defined direction (the links of these graphs are represented with an arrow). The process of creating a directed graph is equivalent to that of an undirected graph but using DiGraph instead of Graph.

# Creation of a directed graph".
# ======================================================================================
G = nx.DiGraph()

# Connections
G.add_edges_from([(1, 2), (2, 3), (1, 4), (1, 5), (4, 2), (5, 4)])

# Graphic representation
fig, ax = plt.subplots(figsize=(3,2))
nx.draw(G, with_labels=True, ax=ax)

It is verified that it is a directed graph.

G.is_directed()

True

Creation from a DataFrame¶

The process of creating a directed graph from a DataFrame is equivalent to that of undirected graphs shown in the previous section, with the only difference that the argument create_using = nx.DiGraph must be indicated.

# Create a directed graph from a Dataframe
# ======================================================================================
G = nx.from_pandas_edgelist(
    connections,
    source = "source",
    target = "target",
    create_using = nx.DiGraph
)
fig, ax = plt.subplots(figsize=(3,2))
nx.draw(G, with_labels=True, ax=ax)

Since it is a directed graph, its adjacency matrix is not symmetric.

adjM = nx.adjacency_matrix(G)
adjM = adjM.todense()
adjM

array([[0, 1, 0, 0],
       [0, 0, 0, 1],
       [0, 1, 0, 0],
       [0, 0, 0, 0]])

Distance between two nodes¶

To obtain the distance of order 2 between any pair of nodes, it is enough to multiply the adjacency matrix by itself. If we multiply it by itself n times, we obtain the distance matrix of order n.

# Graph creation
G = nx.Graph()

# Add nodes to the graph
G.add_edges_from([(1, 2), (2, 3), (1, 4), (1, 5), (4, 2), (5, 4)])

# Graphic representation
fig, ax = plt.subplots(figsize=(3,2))
nx.draw(G, with_labels=True, ax=ax)

# Adjacency matrix
adjM = nx.adjacency_matrix(G).todense()
adjM

array([[0, 1, 0, 1, 1],
       [1, 0, 1, 1, 0],
       [0, 1, 0, 0, 0],
       [1, 1, 0, 0, 1],
       [1, 0, 0, 1, 0]])

The distance matrix of order 2 shows all pairs of nodes that are connected with an intermediate node. In the example, nodes 2 and 5 have nodes 4 and 1 in between, that is, there are two possible paths of order 2 that connect them: 2-4-5 and 2-1-5. For that reason, if the adjacency matrix is multiplied by itself, the element [2, 5] of the distance matrix has the value 2. This indicates that there are two paths of distance 2 between nodes 2 and 5.

# Matrix multiplication by itself
distances_order_two = adjM @ adjM
distances_order_two

array([[3, 1, 1, 2, 1],
       [1, 3, 0, 1, 2],
       [1, 0, 1, 1, 0],
       [2, 1, 1, 3, 1],
       [1, 2, 0, 1, 2]])

# Python indices start at 0
print(f"Paths of order two between nodes 2 and 5 = {distances_order_two[1, 4]}")

Paths of order two between nodes 2 and 5 = 2

You can also calculate the shortest path using the nx.shortest_path function directly.

nx.shortest_path(G, source=2, target=5)

[2, 1, 5]

Weighted graph¶

In a weighted graph, the edges of the graph have an associated weight. In the graphical representation of this type of network, the edges are usually shown with a different width depending on their weight.

# Graph creation
G = nx.Graph()

# Nodes and connections
G.add_weighted_edges_from(
    [(1, 2, 0.5),
    (2, 3, 0.9),
    (1, 4, 0.1),
    (1, 5, 0.75),
    (4, 2, 0.01),
    (5, 4, 0.3)]
)
G.edges(data=True)

EdgeDataView([(1, 2, {'weight': 0.5}), (1, 4, {'weight': 0.1}), (1, 5, {'weight': 0.75}), (2, 3, {'weight': 0.9}), (2, 4, {'weight': 0.01}), (4, 5, {'weight': 0.3})])

# It is verified that it is an undirected and weighted graph
G.is_directed()

False

nx.is_weighted(G)

True

# The weight of each edge is shown, as well as the nodes it connects
[a for a in G.edges(data=True)]

[(1, 2, {'weight': 0.5}),
 (1, 4, {'weight': 0.1}),
 (1, 5, {'weight': 0.75}),
 (2, 3, {'weight': 0.9}),
 (2, 4, {'weight': 0.01}),
 (4, 5, {'weight': 0.3})]

To extract the weights, iterate over each edge and access the third element of the tuple.

weights = [a[2]["weight"] for a in G.edges(data=True)]
weights

[0.5, 0.1, 0.75, 0.9, 0.01, 0.3]

To draw a weighted graph with NetworkX, it is necessary to draw nodes and edges separately. First, a graph layout is used that defines the position of the nodes. Once their position is defined, the nodes and edges are represented. For this example, the spring_layout is used, but there are many more.

# Define the position of the nodes using a layout
pos = nx.spring_layout(G)

# Represent nodes and edges
fig, ax = plt.subplots(figsize=(3,2))
nx.draw_networkx_nodes(
    G,
    pos = pos,
    ax = ax
)
nx.draw_networkx_edges(
    G,
    pos = pos,
    edgelist = G.edges,
    width = weights,
    ax = ax
);

Other layout types

Examples of the most used layouts.

Representing nodes and edges separately allows for more control over visual features, for example, the color of nodes and edges.

pos = nx.circular_layout(G)

fig, ax = plt.subplots(figsize=(3,3))
nx.draw_networkx_nodes(G, pos=pos, node_color="red", ax=ax)
nx.draw_networkx_edges(G, pos=pos, edgelist=G.edges, width=weights, edge_color="blue", ax=ax);

Bipartite graph¶

When the nodes of a graph represent entities of a different nature, the term bipartite graph is used. A common example of this type of graph is publication networks where there are "article" type nodes and "writer" type nodes.

In this type of graph, connections can only occur between nodes of a different nature, each writer is connected to the articles they have written. There can be no direct connections between articles or between writers.

To create a bipartite graph in NetworkX, the add_nodes_from() method is used to add the nodes, indicating the type with the bipartite argument, and then the add_edges_from() method is used to add the relationships between them. Below is an example of how to create a bipartite graph with two sets of nodes called "group A" and "group B".

# Create an empty bipartite graph
G_movies_actors = nx.Graph()

# Add the nodes of each group
G_movies_actors.add_nodes_from(["Movie_1", "Movie_2", "Movie_3"], bipartite="Movies")
G_movies_actors.add_nodes_from(["Actor_1", "Actor_2", "Actor_3"], bipartite="Actors")

# Add the relationships between the nodes
G_movies_actors.add_edges_from(
    [
        ("Movie_1", "Actor_1"),
        ("Movie_2", "Actor_2"),
        ("Movie_3", "Actor_3"),
        ("Movie_2", "Actor_3"),
        ("Movie_2", "Actor_1"),
    ]
)

G_movies_actors.nodes(data=True)

NodeDataView({'Movie_1': {'bipartite': 'Movies'}, 'Movie_2': {'bipartite': 'Movies'}, 'Movie_3': {'bipartite': 'Movies'}, 'Actor_1': {'bipartite': 'Actors'}, 'Actor_2': {'bipartite': 'Actors'}, 'Actor_3': {'bipartite': 'Actors'}})

To access the type of each node, iterate through each node and access the "bipartite" attribute.

node_type = [
    G_movies_actors.nodes[i]["bipartite"]
    for i in G_movies_actors.nodes()
]
node_type

['Movies', 'Movies', 'Movies', 'Actors', 'Actors', 'Actors']

# colors for each type of node in the bipartite graph
colors = {"Movies": "red", "Actors": "blue"}
node_colors = [colors[n] for n in node_type]

fig, ax = plt.subplots(figsize=(5, 4))
nx.draw(
    G_movies_actors,
    pos=nx.spring_layout(G_movies_actors),
    with_labels=True,
    node_color=node_colors,
    ax=ax,
)

In the code above, the spring_layout function is used to position the nodes in the graph. With the argument with_labels=True, the node labels are displayed, and with node_color, the color of each node is assigned.

Once the bipartite graph is created, a projection of it can be made to obtain a non-bipartite graph that contains only the nodes of one of the sets. To do this, the project() method is used, specifying the set of nodes to be included in the projection. Below, a projection of the bipartite graph is generated to obtain only the nodes of the 'Actors' group.

# The nodes of the set of interest are identified
nodes_bipartite = [
    n[0]
    for n in G_movies_actors.nodes(data=True)
    if n[1]["bipartite"] == "Actors"
]

# Projection of the bipartite graph to obtain only the nodes of group A
G_actors = nx.bipartite.projected_graph(G_movies_actors, nodes_bipartite)

fig, ax = plt.subplots(figsize=(4, 4))
nx.draw(G_actors, with_labels=True, ax=ax)
ax.set_xlim([1.2 * x for x in ax.get_xlim()])
ax.set_ylim([1.2 * y for y in ax.get_ylim()])

(-1.2399190268447968, 1.275018745745212)

Meta-information: attributes of nodes and edges¶

In most real graph problems, additional information about nodes and edges is available. Node attributes are added with the networkx.set_node_attributes(Graph, dictionary, name) method and edge attributes are added with the networkx.set_edge_attributes() method.

# Graph creation
G = nx.Graph()
G.add_edges_from([(1, 2), (2, 3), (1, 4), (1, 5), (4, 2), (5, 4)])
G.edges(data=True)

EdgeDataView([(1, 2, {}), (1, 4, {}), (1, 5, {}), (2, 3, {}), (2, 4, {}), (4, 5, {})])

# Add node attributes
node_name = {1: "Jaime", 2: "María", 3: "Julio", 4: "Rosa", 5: "Alberto"}

node_hobbies = {
    1: ["Soccer"],
    2: ["Dancing", "Paddle"],
    3: ["Golf", "Dancing"],
    4: ["Cooking"],
    5: ["Cooking", "Ham"],
}

nx.set_node_attributes(G, node_name, name="Name")
nx.set_node_attributes(G, node_hobbies, name="Hobbies")

# add edge attributes
edges_weight = {
    (1, 2): 0.5,
    (2, 3): 0.9,
    (1, 4): 0.1,
    (1, 5): 0.75,
    (4, 2): 0.01,
    (5, 4): 0.3,
}

nx.set_edge_attributes(G, edges_weight, name="weight")

To access the attributes of nodes and edges (meta-information), G.nodes(data=True) or G.edges(data=True) is used. These commands return a dictionary-like structure where the key is the name of the node and the value contains all the attributes.

G.nodes(data=True)

NodeDataView({1: {'Name': 'Jaime', 'Hobbies': ['Soccer']}, 2: {'Name': 'María', 'Hobbies': ['Dancing', 'Paddle']}, 3: {'Name': 'Julio', 'Hobbies': ['Golf', 'Dancing']}, 4: {'Name': 'Rosa', 'Hobbies': ['Cooking']}, 5: {'Name': 'Alberto', 'Hobbies': ['Cooking', 'Ham']}})

G.edges(data=True)

EdgeDataView([(1, 2, {'weight': 0.5}), (1, 4, {'weight': 0.1}), (1, 5, {'weight': 0.75}), (2, 3, {'weight': 0.9}), (2, 4, {'weight': 0.01}), (4, 5, {'weight': 0.3})])

The attributes of edges and nodes can be iterated directly as if it were a dictionary.

for m, n, w in G.edges(data=True):
    print(
        f"Edge connecting node {m} with node {n} and has a weight of {w['weight']}."
    )

Edge connecting node 1 with node 2 and has a weight of 0.5.
Edge connecting node 1 with node 4 and has a weight of 0.1.
Edge connecting node 1 with node 5 and has a weight of 0.75.
Edge connecting node 2 with node 3 and has a weight of 0.9.
Edge connecting node 2 with node 4 and has a weight of 0.01.
Edge connecting node 4 with node 5 and has a weight of 0.3.

The "Social circles: Facebook" database was created by researchers at Stanford University in 2012. This dataset consists of Facebook friendship networks. The available data includes node characteristics (profiles) and their friendship connections. The data was anonymized by replacing names and other identifications with a numerical index. They can be downloaded from the Stanford website (https://snap.stanford.edu/data/ego-Facebook.html).

Libraries¶

# Libraries
# ==============================================================================
import pandas as pd
import warnings
import matplotlib.pyplot as plt
import networkx as nx
warnings.filterwarnings("ignore")

Data reading¶

# Data reading
# ==============================================================================
facebook = pd.read_csv(
    "../data/facebook_combined.txt.gz",
    header=None,
    sep=" ",
    names=["user_1", "user_2"],
)

The data consists of a list of edges between users, each row is a friendship relationship. To reduce computational requirements, this example uses only the first 2000 connections.

facebook = facebook[:2000]
facebook.head()

	user_1	user_2
0	0	1
1	0	2
2	0	3
3	0	4
4	0	5

Since they are friendship relationships and have no directionality, they are represented by an undirected graph.

# Graph creation
# ==============================================================================
G_facebook = nx.from_pandas_edgelist(facebook, source="user_1", target="user_2", create_using=nx.Graph())

Graph information¶

Information about the structure of the graph is displayed.

print("Number of nodes:", G_facebook.number_of_nodes())
print("Number of links:", G_facebook.number_of_edges())

Number of nodes: 713
Number of links: 2000

Network visualization¶

When visualizing the network, it is observed that there are three communities of users:

Two large sets of highly connected users.
A group of a few central nodes, some of which connect the other two large groups.

fig, ax = plt.subplots(figsize=(9, 5))
spring_pos = nx.spring_layout(G_facebook)
nx.draw_networkx(
    G_facebook, pos=spring_pos, with_labels=False, node_size=15, ax=ax
)

Neighbors and degree¶

For any node in the graph, you can know who its neighbors are and its degree (the number of neighbors the node has).

node_id = 4
neighbors = list(G_facebook.neighbors(node_id))
print("Neighbors of node {node_id}:", neighbors)

degree = G_facebook.degree[node_id]
print("Degree of node {node_id}:", degree)

Neighbors of node {node_id}: [0, 78, 152, 181, 195, 218, 273, 275, 306, 328]
Degree of node {node_id}: 10

Using the subgraph function, you can extract the node and its neighbors, and analyze them in more detail.

nodes = neighbors + [node_id]
G_s = nx.subgraph(G_facebook, nodes)

fig, ax = plt.subplots(figsize=(7, 4))
nx.draw_networkx(G_s, pos=spring_pos, with_labels=True, node_size=200, node_color='r', ax=ax)

Centrality¶

As described above, centrality is a concept used to measure the importance of nodes in a graph.

# Centrality metrics for a single node
# ==============================================================================
node_id = 4

# 1. **Degree centrality**
# It is based on the number of links a node has. It is the number of links that enter and leave a node.
degree_centrality_val = nx.degree_centrality(G_facebook)
print("Degree centrality:", degree_centrality_val[node_id])

# 2. **Betweenness centrality**
# Measures the number of paths that pass through a node.
betweenness_centrality_val = nx.betweenness_centrality(G_facebook)
print("Betweenness centrality:", betweenness_centrality_val[node_id])

# 3. **Closeness centrality**
# Measures the average distance from a node to all other nodes in the graph.
closeness_centrality_val = nx.closeness_centrality(G_facebook)
print("Closeness centrality:", closeness_centrality_val[node_id])

Degree centrality: 0.014044943820224719
Betweenness centrality: 4.345833530871221e-05
Closeness centrality: 0.40022484541877457
Betweenness centrality: 4.345833530871221e-05
Closeness centrality: 0.40022484541877457

# Centrality metrics for all nodes
# ==============================================================================
degree = pd.DataFrame.from_dict(degree_centrality_val, orient='index',columns=["degree"])
betweenness = pd.DataFrame.from_dict(betweenness_centrality_val, orient='index',columns=["betweenness"])
closeness = pd.DataFrame.from_dict(closeness_centrality_val, orient='index',columns=["closeness"])
centrality = pd.concat([degree, betweenness, closeness], axis=1)
centrality

	degree	betweenness	closeness
0	0.487360	0.704019	0.661096
1	0.023876	0.000119	0.401806
2	0.014045	0.000026	0.400225
3	0.023876	0.000079	0.401806
4	0.014045	0.000043	0.400225
...	...	...	...
1222	0.001404	0.000000	0.401127
1223	0.001404	0.000000	0.401127
1224	0.001404	0.000000	0.401127
1225	0.001404	0.000000	0.401127
1226	0.001404	0.000000	0.401127

713 rows × 3 columns

It can be observed that, as a general rule, nodes with high degree centrality also have high betweenness centrality. However, some nodes can be found that have high betweenness but low degree.

In this graph, four nodes with high betweenness and low degree can be observed. These four nodes are the four nodes that were between the two large groups of users.

fig, ax = plt.subplots(figsize=(6, 3))
centrality.plot.scatter(x="degree", y="betweenness", ax=ax)
ax.set_xscale('log')
ax.set_yscale('log')

Advanced network visualization¶

There are other libraries with which much larger networks can be visualized very efficiently and interactively. In future articles, the main alternatives will be shown in detail.

Representation generated with Netwulf.

Session Information¶

import session_info
session_info.show(html=False)

-----
matplotlib          3.10.7
networkx            3.5
pandas              2.3.3
session_info        v1.0.1
-----
IPython             9.6.0
jupyter_client      7.4.9
jupyter_core        5.9.1
notebook            6.5.7
-----
Python 3.13.9 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 19:16:10) [GCC 11.2.0]
Linux-6.14.0-35-generic-x86_64-with-glibc2.39
-----
Session information updated at 2025-11-21 00:13

Bibliography¶

Network Science - Albert-László Barabási

https://towardsdatascience.com/graph-machine-learning-with-python-pt-1-basics-metrics-and-algorithms-cc40972de113

https://towardsdatascience.com/graph-machine-learning-with-python-pt-2-random-graphs-and-diffusion-models-of-cryptopunks-trading-99cd5170b5ea

https://towardsdatascience.com/graph-machine-learning-with-python-part-3-unsupervised-learning-aa2854fe0ff2

https://ericmjl.github.io/Network-Analysis-Made-Simple/01-introduction/03-viz/

http://networksciencebook.com/

Citation Instructions¶

How to cite this document?

If you use this document or any part of it, we would appreciate it if you would cite it. Thank you very much!

Introduction to graphs and networks with Python by Fernando Carazo and Joaquín Amat Rodrigo, available under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0 DEED) license at https://www.cienciadedatos.net/documentos/pygml01-introduction-graphs-networks-python.html

Did you like the article? Your help is important

Your contribution will help me to continue generating free outreach content. Thank you very much! 😊

This document created by Joaquín Amat Rodrigo is licensed under Attribution-NonCommercial-ShareAlike 4.0 International.

You are free to:

Share: copy and redistribute the material in any medium or format.
Adapt: remix, transform, and build upon the material.

Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial: You may not use the material for commercial purposes.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Introduction to Graphs and Networks with Python

Fernando Carazo, Joaquín Amat

December 2020 (last update: November 2025)

Introduction¶

Types of graphs¶

Adjacency Matrix¶

Meta-information of nodes and edges¶

Bipartite Graphs¶

Paths and distances¶

Properties of networks and nodes¶

Node Centrality¶

Graphs with NetworkX¶

Important Functions in NetworkX¶

Creating Undirected Graphs¶

Manual Creation¶

Creation from a DataFrame¶

Creating directed graphs¶

Manual creation¶

Creation from a DataFrame¶

Distance between two nodes¶

Weighted graph¶

Bipartite graph¶

Meta-information: attributes of nodes and edges¶

Libraries¶

Data reading¶

Graph information¶

Network visualization¶

Neighbors and degree¶

Centrality¶

Advanced network visualization¶

Session Information¶

Bibliography¶

Citation Instructions¶

Introduction¶

Types of graphs¶

Adjacency Matrix¶

Meta-information of nodes and edges¶

Bipartite Graphs¶

Paths and distances¶

Properties of networks and nodes¶

Node Centrality¶

Graphs with NetworkX¶

Important Functions in NetworkX¶

Creating Undirected Graphs¶

Manual Creation¶

Creation from a DataFrame¶

Creating directed graphs¶

Manual creation¶

Creation from a DataFrame¶

Distance between two nodes¶

Weighted graph¶

Bipartite graph¶

Meta-information: attributes of nodes and edges¶

Case study: Facebook social circles (Stanford)¶

Libraries¶

Data reading¶

Graph information¶

Network visualization¶

Neighbors and degree¶

Centrality¶

Advanced network visualization¶

Session Information¶

Bibliography¶

Citation Instructions¶