Data Preparation Guide

The package was developed to support the most common data shapes and file formats used in bioinformatics. This guide describes how to structure each file used by the portal.

The expected format for a matrix is sample identifiers in columns and symbols in rows. The optimal file format for the portal is an R object of matrix type saved as an .rds file, because it will load faster, although it is also possible to save the matrix as a CSV or TSV file. If using the latter, ensure that the first row (with exception of the first column) contain the sample identifiers and that the first column contains the transcript identifiers.

The measures table follows the format of one row per subject (even if they have more than one sample collected) with the measures across columns. A data.frame can be saved in an .rds file, but CSV or TSV files are also supported.

	S1_01	S1_02	S2_01	S2_02	S3_01	S3_02
ABC	2.0376590	-0.1768367	0.8253947	-0.0705087	0.6926885	-1.2121502
BCD	0.3239560	-1.5068365	-1.2093268	0.8129618	-1.4672367	0.9092987
CDE	1.2046363	-0.3626300	-0.1275494	-2.3511435	-0.9656587	0.9908292
DEF	2.4009177	-1.0866808	0.6950054	0.7699526	0.1906538	1.7070437
EFG	-2.1464461	1.0376751	-1.2693569	-0.7006440	1.0708847	2.0515956
FGH	-1.0192418	0.4986830	1.1139692	-1.2737894	0.2450935	-0.3905544
GHI	0.3026560	0.1924001	-0.6198241	-0.5189834	0.8785241	0.4883541
HIJ	-1.7404140	0.3375572	1.0659999	-0.3770236	-0.8395515	-0.4042909
IJK	0.3681634	-1.1588354	0.2511265	0.1662243	1.1440126	0.7103129
JKL	0.5101581	-1.2043604	-1.8345244	-0.0596510	-0.0566549	0.8097529

Measures collected over time should be represented in separate columns, with the convention (enforced by default) of a time code as a suffix for measure names, separated by underscore (_) – this means that underscore cannot be used in long measure names as well. For example, for disease activity collected over four time points, the expected names are: diseaseActivity_Baseline, diseaseActivity_Week1, diseaseActivity_Week2 and diseaseActivity_Week3. Using the default settings, it is invalid to use a name such as Disease_Activity_Baseline. The time separator can be modified in the configuration file by setting timesep to the desired separator.

For datasets where a subject has more than one sample (e.g. samples over time, from different tissues or combinations thereof), a lookup table should be constructed and saved as a data.frame in an .rds file, CSV or TSV.

Patient_ID	Platelets_m01	Platelets_m02	Age	drugNaive
p01	182.6504	160.8753	44	Yes
p02	178.4293	231.0919	70	Yes
p03	217.7025	191.2578	75	No

This table maps subject identifiers to sample identifiers, with the expression matrix containing data for all samples in the dataset. The table should also contain metadata that allows subsetting samples. For example, if subjects and samples vary over time, drug groups and tissues, the lookup table should have one column for each category. The following is an example of such a table:

In the case above, patients p01 and p02 belong to drug group d1, while patient p03 belongs to group d2. All patients have samples collected at months 1 and 2 (encoded as m01 and m02), and all samples are from the same tissue (A). In this example, drug groups samples from different subjects (i.e. there is no overlap of subjects between the two drug groups).

The lookup table can also be enriched with other characteristics of subjects that can be use to partition samples, such as age, sex, or others. Outputs of methods such as clustering can also be added to the table: this enables the exploration of correlations in different clusters, or comparing trajectories across different clusters, for example.

Sample_ID	Time	Tissue	Drug	Patient_ID
S1_01	m01	A	d1	p01
S1_02	m02	A	d1	p01
S2_01	m01	A	d1	p02
S2_02	m02	A	d1	p02
S3_01	m01	A	d2	p03
S3_02	m02	A	d2	p03

Validation checks

The package does a very lightweight validation of the loaded files, only checking if subjects and samples match. It does not ensure that the correct transformations have been applied to the expression data, nor does it warn about or modify missing data – a subject is not included in calculations if they have a missing value for a particular measure.

The following checks ARE made:

Matching samples and subjects: the package will confirm that every sample in the expression matrix is matched to at least one subject in the lookup table. It also checks that all subjects in the measures table match to at least one sample in the lookup table. That is, there can be no excess of samples or subjects in each table.

Matrix format: if using an .rds file, the package will check that the expression matrix was indeed saved as a matrix object in the rds file. This is to ensure that the rownames are read properly.

Additional files

Differential expression analysis results

The package includes two modules to showcase results of differential expression analysis (see config for more details). These modules read files created using limma, edgeR or deseq2. All files should be saved with column names and the column names must not be changed – the only exception is you want to mix models from different packages, then you should rename the columns so that all results have the same column names (e.g. p-values are identified in the same way across all files).

These modules require the creation of a table that lists all model results and they support the use of additional columns in the table to organize results from different types of models or subsets of samples. All model results file should be placed into a models folder within the project folder.

The table should look like the following and saved in a CSV or TSV file:

Model	Time	Drug	File
Linear	m01	d1	Model_1.txt
Linear	m02	d2	Model_2.txt
Nonlinear	m01	d1	Model_3.txt
Nonlinear	m02	d2	Model_4.txt

Gene modules/lists

The heatmap module requires the creation of a table containing lists of names such as gene symbols (see config for more details). In this table, each row will have a column that contains the gene lists, with symbols separated by a comma. If you have a table where you have a list identifier and a symbol in each column, you can use a group-by operation with paste-collapse to create the required list, as follows:

Original file:

module	gene
A	ABC
A	DEF
A	GHI
B	JKL
B	MNO
B	PQR

Code to transform:

table <- data %>%
  dplyr::group_by(module) %>%
  dplyr::transmute(list = paste(gene, collapse = ","))

Asking for help

If you have any issues with data preparation, please post it as an issue on the package GitHub.