User Guide¶

Modules overview¶

Arboretum consists of multiple python modules:

`arboretum.algo`¶

Intended for typical users.
Access point for launching GRNBoost2 or GENIE3 on local or distributed hardware.

`arboretum.core`¶

Intended for advanced users.
Contains the low-level building blocks of the Arboretum framework.

`arboretum.utils`¶

Contains small utility functions.

Dependencies Overview¶

Arboretum uses well-established libraries from the Python ecosystem. Arboretum avoids being a proverbial “batteries-included” library, as such an approach often entails unnecessary complexity and maintenance. Arboretum aims at doing only one thing, and doing it well.

Concretely, the user will be exposed to one or more of following dependencies:

Pandas or NumPy: the user is expected to provide the input data in an expected

format. Pandas and NumPy are well equipped with functions for data preprocessing. * Dask.distributed: to run Arboretum on a cluster, the user is responsible for setting up a network of a scheduler and workers. * scikit-learn: relevant for advanced users only. Arboretum can run “DIY” inference where the user provides their own parameters for the Random Forest or Gradient Boosting regressors.

Input / Output¶

Arboretum accepts as input:

an expression matrix (rows = observations, columns = genes)
(optionally) a list of gene names in the expression matrix
(optionally) a list of transcription factors (a.k.a. TFs)

Arboretum returns as output:

a Pandas DataFrame (DF) with columns ['TF', 'target', 'importance']

Tip

As data for following code snippets, you can use the data for network 1 from the DREAM5 challenge (included in the resources folder of the Github repository):

<ex_path> = net1_expression_data.tsv
<tf_path> = net1_transcription_factors.tsv

Expression matrix as a Pandas `DataFrame`¶

The input can be specified in a number of ways. Arguably the most straightforward way is to specify the expression matrix as a Pandas DataFrame, which also contains the gene names as the column header.

.-------------.
| expression  |
| matrix (DF) | ---.      .-----------.
'-------------'     \     | GRNBoost2 |     .------------.
                     :--> |    or     | --> | regulatory |
.---------------.   /     | GENIE3    |     | links (DF) |
| transcription | -'      '-----------'     '------------'
| factors       |
'---------------'

In the following code snippet, we launch network inference with grnboost2 by specifying the expression_data as a DataFrame.

Expression matrix as a Pandas DataFrame¶

import pandas as pd

from arboretum.utils import load_tf_names
from arboretum.algo import grnboost2

# ex_matrix is a DataFrame with gene names as column names
ex_matrix = pd.read_csv(<ex_path>, sep='\t')

# tf_names is read using a utility function included in Arboretum
tf_names = load_tf_names(<tf_path>)

network = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names)

Expression matrix as a NumPy `ndarray`¶

Arboretum also supports specifying the expression matrix as a Numpy ndarray (in our case: a 2-dimensional matrix). In this case, the gene names must be specified explicitly.

.-------------.
| expression  |
| matrix (DF) | -----.
'-------------'      |    .-----------.
.-------------.      |    | GRNBoost2 |     .------------.
| gene names  | -----+--> |    or     | --> | regulatory |
'-------------'      |    | GENIE3    |     | links (DF) |
.---------------.    |    '-----------'     '------------'
| transcription | ---'
| factors       |
'---------------'

Caution

You must specify the gene names in the same order as their corresponding columns of the NumPy matrix. Getting this right is the user’s responsibility.

Expression matrix as a NumPy ndarray¶

import numpy as np

from arboretum.utils import load_tf_names
from arboretum.algo import grnboost2

# ex_matrix is a numpy ndarray, which has no notion of column names
ex_matrix = np.genfromtxt(<ex_path>, delimiter='\t', skip_header=1)

# we read the gene names from the first line of the file
with open(<ex_path>) as file:
    gene_names = [gene.strip() for gene in file.readline().split('\t')]

# sanity check to verify the ndarray's nr of columns equals the length of the gene_names list
assert ex_matrix.shape[1] == len(gene_names)

# tf_names is read using a utility function included in Arboretum
tf_names = load_tf_names(<tf_path>)

network = grnboost2(expression_data=ex_matrix,
                    gene_names=gene_names,  # specify the gene_names
                    tf_names=tf_names)

Running with a custom Dask Client¶

Arboretum uses Dask.distributed to parallelize its workloads. When the user doesn’t specify a dask distributed Client explicitly, Arboretum will create a LocalCluster and a Client pointing to it.

Alternatively, you can create and configure your own Client instance and pass it on to Arboretum. Situations where this is useful include:

inferring multiple networks from different datasets
inferring multiple networks using different parameters from the same dataset
the user requires custom configuration for the LocalCluster (memory limit, nr of processes, etc.)

Following snippet illustrates running the gene regulatory network inference multiple times, with different initialization seed values. We create one Client and pass it to the different inference steps.

Running with a custom Dask Client¶

import pandas as pd

from arboretum.utils import load_tf_names
from arboretum.algo import grnboost2
from distributed import LocalCluster, Client

# create custom LocalCluster and Client instances
local_cluster = LocalCluster(n_workers=10,
                             threads_per_worker=1,
                             memory_limit=8e9)
custom_client = Client(local_cluster)

# load the data
ex_matrix = pd.read_csv(<ex_path>, sep='\t')
tf_names = load_tf_names(<tf_path>)

# run GRN inference multiple times
network_666 = grnboost2(expression_data=ex_matrix,
                        tf_names=tf_names,
                        client=custom_client,  # specify the custom client
                        seed=666)

network_777 = grnboost2(expression_data=ex_matrix,
                        tf_names=tf_names,
                        client=custom_client,  # specify the custom client
                        seed=777)

# close the Client and LocalCluster after use
client.close()
local_cluster.close()

Running with a Dask distributed scheduler¶

Arboretum was designed to run gene regulatory network inference in a distributed setting. In distributed mode, some effort by the user or a systems administrator is required to set up a dask.distributed scheduler and some workers.

Tip

Please refer to the Dask distributed network setup documentation for instructions on how to set up a Dask distributed cluster.

Following diagram illustrates a possible topology of a Dask distributed cluster.

                     .=[node_2]==============.          .=[node_3]=========.
.=[node_1]======.    |  .--------------.     |          |  .------------.  |
|  .--------.   |    |  | Dask         |<----+----------+--| 10 workers |  |
|  | Client |---+----+->| distributed  |<----+--.       |  '------------'  |
|  '--------'   |    |  | scheduler    |<-.  |   \      '=================='
'==============='    |  '--------------'  |  |    \
                     |                    |  |     \    .=[node_4]=========.
                     |  .------------.    |  |      \   |  .------------.  |
                     |  | 10 workers |----'  |       '--+--| 10 workers |  |
                     |  '------------'       |          |  '------------'  |
                     '======================='          '=================='

node_1 runs a Python script, console or a Jupyter notebook server, a Client instance is configured with the TCP address of the distributed scheduler, running on node_2
node_2 runs a distributed scheduler and 10 workers pointing to the scheduler
node_3 runs 10 distributed workers pointing to the scheduler
node_4 runs 10 distributed workers pointing to the scheduler

With a small modification to the code, we can infer a regulatory network using all workers connected to the distributed scheduler. We specify a Client that is connected to the Dask distributed scheduler and pass it as an argument to the inference function.

Running with a Dask distributed scheduler¶

import pandas as pd

from arboretum.utils import load_tf_names
from arboretum.algo import grnboost2
from distributed import Client

ex_matrix = pd.read_csv(<ex_path>, sep='\t')
tf_names = load_tf_names(<tf_path>)

scheduler_address = 'tcp://10.118.224.134:8786'  # example address of the remote scheduler
cluster_client = Client(scheduler_address)       # create a custom Client

network = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names,
                    client=cluster_client)  # specify Client connected to the remote scheduler