Data Preparation Guide

The package was developed to support the most common data shapes and file formats used in bioinformatics. This guide describes how to structure each file used by the portal.

Expression matrix

The expected format for a matrix is sample identifiers in columns and symbols in rows. The optimal file format for the portal is an R object of matrix type saved as an .rds file, because it will load faster, although it is also possible to save the matrix as a CSV or TSV file. If using the latter, ensure that the first row (with exception of the first column) contain the sample identifiers and that the first column contains the transcript identifiers.

Saving an RDS file: to export a matrix as an rds file, simply run:
saveRDS(matrix_object, "matrix_object.rds")

The following is a valid example of an expression matrix:

S1_01 S1_02 S2_01 S2_02 S3_01 S3_02
ABC 1.2808789 -0.3403144 0.7405878 -1.2682231 1.7612731 0.1989516
BCD 1.5544347 1.1616882 0.1789133 0.2862653 -0.7242087 -0.4038258
CDE -2.0469364 0.8210368 -1.0798965 -1.0908267 1.5410485 -0.6829172
DEF -0.5139870 -0.9428223 -0.1451265 -0.5743365 0.8751267 0.0278288
EFG 0.6853862 1.6749440 -0.6704449 -0.0808082 -0.4222591 -0.9299531
FGH 1.1945784 1.2741069 -0.4209291 -0.6328691 0.5741515 1.7751814
GHI -0.4131409 0.8284786 -0.1080951 -1.3438086 -0.2667228 1.1090296
HIJ 1.4878176 0.4515905 -2.5116917 0.8068692 -1.1254763 0.2984453
IJK -0.9682975 -1.6839452 0.6029759 0.2985707 -0.6114281 1.0904955
JKL -0.8593043 -0.0634085 -1.9162324 -1.2169159 0.3318130 1.4729358

Measures table

The measures table follows the format of one row per subject (even if they have more than one sample collected) with the measures across columns. A data.frame can be saved in an .rds file, but CSV or TSV files are also supported.

Measures collected over time should be represented in separate columns, with the convention (enforced by default) of a time code as a suffix for measure names, separated by underscore (_) – this means that underscore cannot be used in long measure names as well. For example, for disease activity collected over four time points, the expected names are: diseaseActivity_Baseline, diseaseActivity_Week1, diseaseActivity_Week2 and diseaseActivity_Week3. Using the default settings, it is invalid to use a name such as Disease_Activity_Baseline. The time separator can be modified in the configuration file by setting timesep to the desired separator.

The following is a valid example of a measures table:

Patient_ID Platelets_m01 Platelets_m02 Age drugNaive
p01 201.0261 221.2232 32 Yes
p02 230.6424 164.8883 88 Yes
p03 213.7581 209.0771 72 No

Lookup table

For datasets where a subject has more than one sample (e.g. samples over time, from different tissues or combinations thereof), a lookup table should be constructed and saved as a data.frame in an .rds file, CSV or TSV.

This table maps subject identifiers to sample identifiers, with the expression matrix containing data for all samples in the dataset. The table should also contain metadata that allows subsetting samples. For example, if subjects and samples vary over time, drug groups and tissues, the lookup table should have one column for each category. The following is an example of such a table:

Sample_ID Time Tissue Drug Patient_ID
S1_01 m01 A d1 p01
S1_02 m02 A d1 p01
S2_01 m01 A d1 p02
S2_02 m02 A d1 p02
S3_01 m01 A d2 p03
S3_02 m02 A d2 p03

In the case above, patients p01 and p02 belong to drug group d1, while patient p03 belongs to group d2. All patients have samples collected at months 1 and 2 (encoded as m01 and m02), and all samples are from the same tissue (A). In this example, drug groups samples from different subjects (i.e. there is no overlap of subjects between the two drug groups).

The lookup table can also be enriched with other characteristics of subjects that can be use to partition samples, such as age, sex, or others. Outputs of methods such as clustering can also be added to the table: this enables the exploration of correlations in different clusters, or comparing trajectories across different clusters, for example.

Validation checks

The package does a very lightweight validation of the loaded files, only checking if subjects and samples match. It does not ensure that the correct transformations have been applied to the expression data, nor does it warn about or modify missing data – a subject is not included in calculations if they have a missing value for a particular measure.

The following checks ARE made:

Matching samples and subjects: the package will confirm that every sample in the expression matrix is matched to at least one subject in the lookup table. It also checks that all subjects in the measures table match to at least one sample in the lookup table. That is, there can be no excess of samples or subjects in each table.

Matrix format: if using an .rds file, the package will check that the expression matrix was indeed saved as a matrix object in the rds file. This is to ensure that the rownames are read properly.

Additional files

Differential expression analysis results

The package includes two modules to showcase results of differential expression analysis (see config for more details). These modules read files created using limma, edgeR or deseq2. All files should be saved with column names and the column names must not be changed – the only exception is you want to mix models from different packages, then you should rename the columns so that all results have the same column names (e.g. p-values are identified in the same way across all files).

These modules require the creation of a table that lists all model results and they support the use of additional columns in the table to organize results from different types of models or subsets of samples. All model results file should be placed into a models folder within the project folder.

The table should look like the following and saved in a CSV or TSV file:

Model Time Drug File
Linear m01 d1 Model_1.txt
Linear m02 d2 Model_2.txt
Nonlinear m01 d1 Model_3.txt
Nonlinear m02 d2 Model_4.txt

Gene modules/lists

The heatmap module requires the creation of a table containing lists of names such as gene symbols (see config for more details). In this table, each row will have a column that contains the gene lists, with symbols separated by a comma. If you have a table where you have a list identifier and a symbol in each column, you can use a group-by operation with paste-collapse to create the required list, as follows:

Original file:

module gene
A ABC
A DEF
A GHI
B JKL
B MNO
B PQR

 

Code to transform:

table <- data %>%
  dplyr::group_by(module) %>%
  dplyr::transmute(list = paste(gene, collapse = ","))

Asking for help

If you have any issues with data preparation, please post it as an issue on the package GitHub.