mercury.graph.embeddings
mercury.graph.embeddings.Embeddings(dimension, num_elements=0, mean=0, sd=1, learn_step=3, bidirectional=False)
Bases: BaseClass
This class holds a matrix object that is interpreted as the embeddings for any list of objects, not only the nodes of a graph. You can see this class as the internal object holding the embedding for other classes such as class GraphEmbedding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dimension
|
int
|
The number of columns in the embedding. See note below. |
required |
num_elements
|
int
|
The number of rows in the embedding. You can leave this empty on creation and then use initialize_as() to automatically match the nodes in a graph. |
0
|
mean
|
float
|
The (expected) mean of the initial values. |
0
|
sd
|
float
|
The (expected) standard deviation of the initial values. |
1
|
learn_step
|
float
|
The size of the learning step elements get approached or moved away. Units are hexadecimal degrees in along an ellipse. |
3
|
bidirectional
|
bool
|
Should the changes apply only to the elements of first column (False) or to both. |
False
|
Note
On dimension: Embeddings cannot be zero (that is against the whole concept). Smaller dimension embeddings can only hold few elements without introducing spurious correlations by some form of 'birthday attack' phenomenon as elements increase. Later it is very hard to get rid of that spurious 'knowledge'.
Solution: With may elements, you have to go to high enough dimension even if the structure is simple. Pretending to fit many embeddings in low dimension without them being correlated is like pretending to plot a trillion random points in a square centimeter while keeping them 1 mm apart from each other: It's simply impossible!
Source code in mercury/graph/embeddings/embeddings.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
|
as_numpy()
Return the embedding as a numpy matrix where each row is an embedding.
Source code in mercury/graph/embeddings/embeddings.py
117 118 119 120 121 122 123 124 |
|
fit(converge=None, diverge=None)
Apply a learning step to the embedding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
converge
|
numpy matrix of two columns
|
A matrix of indices to elements meaning (first column) should be approached to (second column). |
None
|
diverge
|
numpy matrix of two columns
|
A matrix of indices to elements meaning (first column) should be moved away from (second column). |
None
|
Returns:
Type | Description |
---|---|
self
|
Fitted self (or raises an error) |
Note
Embeddings start being randomly distributed and hold no structure other than spurious correlations. Each time you apply a learning step by calling this method, you are tweaking the embedding to approach some rows and/or move others away. You can use both converge and diverge or just one of them and call this as many times you want with varying learning step. A proxy of how much an embedding can learn can be estimated by measuring how row correlations are converging towards some asymptotic values.
Source code in mercury/graph/embeddings/embeddings.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
get_most_similar_embeddings(index, k=5, metric='cosine')
Given an index of a vector in the embedding matrix, returns the k most similar embeddings in the matrix
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int
|
index of the vector in the matrix that we want to compute the similar embeddings |
required |
k
|
int
|
Number of most similar embeddings to return |
5
|
metric
|
str
|
metric to use as a similarity. |
'cosine'
|
Returns:
Type | Description |
---|---|
list
|
list of k most similar nodes as indices and list of similarities of the most similar nodes |
Source code in mercury/graph/embeddings/embeddings.py
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
|
mercury.graph.embeddings.GraphEmbedding(dimension=None, n_jumps=None, max_per_epoch=None, learn_step=3, bidirectional=False, load_file=None)
Bases: BaseClass
Create an embedding mapping the nodes of a graph.
Includes contributions by David Muelas Recuenco.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dimension
|
int
|
The number of columns in the embedding. See note the notes in |
None
|
n_jumps
|
int
|
Number of random jumps from node to node. |
None
|
max_per_epoch
|
int
|
Maximum number Number of consecutive random jumps without randomly jumping outside the edges. Note that normal random jumps are not going to explore outside a connected component. |
None
|
learn_step
|
float
|
The size of the learning step elements get approached or moved away. Units are hexadecimal degrees in along an ellipse. |
3
|
bidirectional
|
bool
|
Should the changes apply only to the elements of first column (False) or to both. |
False
|
load_file
|
str
|
(optional) The full path to a binary file containing a serialized GraphEmbedding object. This file must be created using GraphEmbedding.save(). |
None
|
GraphEmbedding class constructor
Source code in mercury/graph/embeddings/graphembeddings.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|
__getitem__(arg)
Method to access rows in the embedding by ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
arg
|
same as node ids in the graph
|
A node ID in the graph |
required |
Returns:
Type | Description |
---|---|
matrix
|
A numpy matrix of one row |
Source code in mercury/graph/embeddings/graphembeddings.py
116 117 118 119 120 121 122 123 124 125 126 127 |
|
embedding()
Return the internal Embeddings object.
Returns:
Type | Description |
---|---|
Embeddings
|
The embedding which is a dense matrix of |
Source code in mercury/graph/embeddings/graphembeddings.py
205 206 207 208 209 210 211 212 213 214 215 |
|
fit(g)
Train the embedding by doing random walks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
g
|
mercury.graph Graph asset
|
A |
required |
Returns:
Type | Description |
---|---|
self
|
Fitted self (or raises an error) |
This does a number of random walks starting from a random node and selecting the edges with a probability that is proportional to the weight of the edge. If the destination node also has outgoing edges, the next step will start from it, otherwise, a new random node will be selected. The edges visited (concordant pairs) will get some reinforcement in the embedding while a randomly selected non-existent edges will get divergence instead (discordant pairs).
Internally, this stores the node IDS of the node visited and calls Embeddings.fit() to transfer the structure to the embedding. Of course, it can be called many times on the same GraphEmbedding.
Source code in mercury/graph/embeddings/graphembeddings.py
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
|
get_most_similar_nodes(node_id, k=5, metric='cosine', return_as_indices=False)
Returns the k most similar nodes and the similarities
Parameters:
Name | Type | Description | Default |
---|---|---|---|
node_id
|
object
|
Id of the node that we want to search the similar nodes. |
required |
k
|
int
|
Number of most similar nodes to return |
5
|
metric
|
str
|
metric to use as a similarity. |
'cosine'
|
return_as_indices
|
bool
|
if return the nodes as indices (False), or as node ids (True) |
False
|
Returns:
Type | Description |
---|---|
list
|
list of k most similar nodes and list of similarities of the most similar nodes |
DataFrame
|
A list of k most similar nodes as a |
Source code in mercury/graph/embeddings/graphembeddings.py
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 |
|
save(file_name, save_embedding=False)
Saves a GraphEmbedding to a compressed binary file with or without the embedding itself. It saves the graph's node names and the adjacency matrix as a sparse matrix.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_name
|
str
|
The name of the file to which the GraphEmbedding will be saved. |
required |
save_embedding
|
bool
|
Since the embedding can be big and, if not trained, it is just a matrix of uniform random numbers it is possible avoiding saving it. In case it is not saved, loading the file will create a new random embedding. This parameter controls if the embedding is saved or not (the default value). |
False
|
Source code in mercury/graph/embeddings/graphembeddings.py
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 |
|
mercury.graph.embeddings.SparkNode2Vec(dimension=None, sampling_ratio=1.0, num_epochs=10, num_paths_per_node=1, batch_size=1000000, w2v_max_iter=1, w2v_num_partitions=1, w2v_step_size=0.025, w2v_min_count=5, path_cache=None, use_cached_rw=False, n_partitions_cache=10, load_file=None)
Bases: BaseClass
Create or reload a SparkNode2Vec embedding mapping the nodes of a graph.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dimension
|
int
|
The number of columns in the embedding. See note the notes in |
None
|
sampling_ratio
|
float
|
The proportion from the total number of nodes to be used in parallel at each step (whenever possible). |
1.0
|
num_epochs
|
int
|
Number of epochs. This is the total number of steps the iteration goes through. At each step, sampling_ratio times the total number of nodes paths will be computed in parallel. |
10
|
num_paths_per_node
|
int
|
The amount of random walks to source from each node. |
1
|
batch_size
|
int
|
This forces caching the random walks computed so far and breaks planning each time this number of epochs is reached. The default value is a high number to avoid this entering at all. In really large jobs, you may want to set this parameter to avoid possible overflows even if it can add some extra time to the process. Note that with a high number of epochs and nodes resource requirements for the active part of your random walks can be high. This allows to "cache a continue" so to say. |
1000000
|
w2v_max_iter
|
int
|
This is the Spark Word2Vec parameter maxIter, the default value is the original default value. |
1
|
w2v_num_partitions
|
int
|
This is the Spark Word2Vec parameter numPartitions, the default value is the original default value. |
1
|
w2v_step_size
|
float
|
This is the Spark Word2Vec parameter stepSize, the default value is the original default value. |
0.025
|
w2v_min_count
|
int
|
This is the Spark Word2Vec parameter minCount, the default value is the original default value (5). Is the minimum number of times that a node has to appear to generate an embedding. |
5
|
path_cache
|
str
|
Folder where random walks will be stored, the default value is None which entails that random walks will not be stored. |
None
|
use_cached_rw
|
bool
|
Flag that indicates if random walks should be read from disk (hence, they will not be computed again). Setting this parameter to True requires a valid path_cache. |
False
|
n_partitions_cache
|
int
|
Number of partitions that will be used when storing the random walks, to optimize read access. The default value is 10. |
10
|
load_file
|
str
|
(optional) The full path to a parquet file containing a serialized SparkNode2Vec object. This file must be created using SparkNode2Vec.save(). |
None
|
Source code in mercury/graph/embeddings/spark_node2vec.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
|
embedding()
Return all embeddings.
Returns:
Type | Description |
---|---|
DataFrame
|
All embeddings as a |
Source code in mercury/graph/embeddings/spark_node2vec.py
196 197 198 199 200 201 202 203 204 205 206 |
|
fit(G)
Train the embedding by doing random walks.
Random walk paths are available in attribute paths_
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
G
|
Graph
|
A |
required |
Returns:
Type | Description |
---|---|
self
|
Fitted self (or raises an error) |
Source code in mercury/graph/embeddings/spark_node2vec.py
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
|
get_most_similar_nodes(node_id, k=5)
Returns the k most similar nodes and a similarity measure.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
node_id
|
str
|
Id of the node we want to search. |
required |
k
|
int
|
Number of most similar nodes to return |
5
|
Returns:
Type | Description |
---|---|
DataFrame
|
A list of k most similar nodes (using cosine similarity) as a |
Source code in mercury/graph/embeddings/spark_node2vec.py
220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 |
|
model()
Returns the Spark Word2VecModel object.
Returns:
Type | Description |
---|---|
Word2VecModel
|
The Spark Word2VecModel of the embedding to use its API directly. |
Source code in mercury/graph/embeddings/spark_node2vec.py
208 209 210 211 212 213 214 215 216 217 218 |
|
save(file_name)
Saves the internal Word2VecModel to a human-readable (JSON) model metadata as a Parquet formatted data file.
The model may be loaded using SparkNode2Vec(load_file='path/file')
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_name
|
str
|
The name of the file to which the Word2VecModel will be saved. |
required |
Source code in mercury/graph/embeddings/spark_node2vec.py
236 237 238 239 240 241 242 243 244 245 246 247 248 |
|