## 1 Introduction

This is a tutorial vignette to provide a quick and easy introduction to the use of MonoPhy. It will make use of the inbuilt example files and lead through the most important functions of the package.

In order to be aware of discrepancies between a newly generated phylogeny and the current taxonomy, it is necessary to assess monophyly of the taxa in the tree. MonoPhy allows doing this in a quick and simple way, both using the phylogeny only and including a file assigning tips to taxonomic groups. Notably, such a file can contain any type of information for which to test monophyly in a tree ( e.g. traits or geographic occurrence), as long as attention is paid on that the assessment makes sense biologically.

## 2 Installation

### 2.1 CRAN

In order to install the stable CRAN version of the package:

install.packages("MonoPhy")

### 2.2 Development Version from GitHub

If - for any reason - you would want to install the development version from GitHub, I suggest doing so by installing it temporarily with the help of the package devtools:

# 1. Install package 'devtools' (if not already installed):
install.packages("devtools")

# 2. Load 'devtools' and temporarily install and load 'MonoPhy' in developers mode:
library(devtools)
dev_mode(on=T)
install_github("oschwery/MonoPhy")  # install the package from GitHub
library(MonoPhy)  # load the package

#3. Leave developers mode after done trying out 'MonoPhy':
dev_mode(on=F)
# the package will not stay on your system permanently like that, which make sense in case of the
# development version, as opposed to the stable version

## 3 Use

When we load the package, all its functions and the other required packages are imported.

library(MonoPhy)

### 3.1 Load Data

#### 3.1.1 Example Files

The package comes with example files for you to try out its functions, which will be used throughout this tutorial. The phylogeny is of the plant family Ericaceae, as published in Schwery et al. (2015) pruned down to 77 taxa. Two taxon files assign tribes or both tribes and subfamilies to the tips.

#load data
data(Ericactree)
data(Ericactribes)
data(Ericacsubfams)

It is generally a good idea (especially with own data) to check whether the format of the data is correct:

#check data
Ericactree
##
## Phylogenetic tree with 77 tips and 76 internal nodes.
##
## Tip labels:
##  Agapetes_buxifolia, Agapetes_hosseana, Agapetes_serpens, Allotropa_virgata, Andromeda_polifolia, Bruckenthalia_spiculifolia, ...
##
## Rooted; includes branch lengths.
head(Ericactribes)
##                           V1          V2
## 1         Agapetes_buxifolia  Vaccinieae
## 2          Agapetes_hosseana  Vaccinieae
## 3           Agapetes_serpens  Vaccinieae
## 4          Allotropa_virgata Monotropeae
## 5        Andromeda_polifolia Andromedeae
## 6 Bruckenthalia_spiculifolia     Ericeae
head(Ericacsubfams)
##                           V1          V2             V3
## 1         Agapetes_buxifolia  Vaccinieae  Vaccinioideae
## 2          Agapetes_hosseana  Vaccinieae  Vaccinioideae
## 3           Agapetes_serpens  Vaccinieae  Vaccinioideae
## 4          Allotropa_virgata Monotropeae Monotropoideae
## 5        Andromeda_polifolia Andromedeae  Vaccinioideae
## 6 Bruckenthalia_spiculifolia     Ericeae     Ericoideae

#### 3.1.2 Own Files

If you want to analyze your own data (which should ultimately be the goal), you can load your phylogeny like this:

phy <- read.tree(file="/your_path/your_phylogeny.tre")

At this point, the package requires the input tree to be rooted, though it may be multifurcating; also, it can only handle single input trees so far. This may change in the future. Polytomies will be dealt with in a conservative way: if different taxa occur in a polytomy, it will be assumed that they are non-monophyletic.

The taxonomy file should be a data frame with or without header and one row per tip in the phylogeny. The first column should be the tip names, any further column should assign the respective tips to taxonomic groups, in the format of one column per group (testing monophyly on several taxonomic levels is possible). If the file has a header, the names therein will be used to name the taxlevels within the function. If it is unknown or not applicable what group a tip belongs to on a certain taxonomic level, it should be encoded as ‘NA’. You can load the taxonomy file like this:

your_clades <- read.csv(file="/your_path/your_taxonomy_file.csv", header=FALSE)

Again to check whether the format of the data is correct:

#check data
phy
head(your_clades)

### 3.2 Run Main Analysis

The command for the main analysis-step is AssessMonophyly. This will check the monophyly of all taxa in the tree, identify which intruders and outliers are responsible for the non-monophyly of a taxon and determine their status for later plotting. Depending on the number of tips in the tree and the number of monophyly-issues, this step will take more or less time. For the example files, this step should only take seconds, for a few thousand tips a couple of minutes, but for a conflict-rich tree with several 10k tips it could easily end up being a few hours. Accessing the information afterwards however is fairly quick.

#### 3.2.1 Taxonomy Based on Tip Labels

If we run the analysis using only the tree, the function will check whether the tip labels are in the format Genus_species, and if so, will extract the genus name from each tip label and use it as the taxonomic group associated to this tip.

solution0 <- AssessMonophyly(Ericactree)

The resulting output object will be a list, which will contain all the output information for all taxon levels in a nested fashion. We will explore in the following how to access this information effectively.

#### 3.2.2 Taxonomy Based on File

If the tip labels have a format different from Genus_species, or if we want to assess a different taxonomic level, we can do that by adding a taxonomy file to the command. The taxonomy file should be a simple table (data frame, maybe easiest based off a .csv file) with or without header, at least two columns and one row per tip. The first column contains the tip labels and all further columns (one or more) the names of the taxa associated to the respective tip, whereas every column stands for a different taxon level (the file has a header, its entries will be used to name the levels). The order of tip names in the file can be different from the order of the tip labels in the tree, but they have to contain the exact same names. If taxonomic levels are unknown for certain tips, they can be coded as NAs (but they cannot stay empty). Those tips will be considered when assessing monophyly of other groups, but the monophyly of the NA group will not be assessed.

solution1 <- AssessMonophyly(Ericactree, Ericacsubfams)

If a taxonomy file cannot be generated easily, specifying taxonomy as taxize allows to use the package with the same name to download taxonomic names of groups indicated under taxizelevel from the online databases 'ncbi', 'itis' or 'both', as set under taxizedb and taxizepref. The downloaded records will then be used like a taxonomy file.

Note: While the function is implemented in a way to remove duplicates from both databases and only retain the ones that yielded a record, specifying 'both' can lead to errors by the taxize package. Also be aware that - compared to the rest of the function - downloading taxonomic information can take a fair amount of time for medium sized trees already.

#### 3.2.4 Other Options

There are a few remaining specifications for this function. verbosity allows to customize the maximal number of intruders/outliers to be displayed in the results table. outliercheck, if set to TRUE, will let the function consider taxon members which are ‘too far away’ in the tree from the rest of their taxon as outliers, instead of all tips ‘in between them’ as intruders, thereby hopefully making the result biologically more meaningful. It will start to look for outliers if the ratio of taxon members in a clade is below the value specified as outlierlevel and will try and find a ‘core clade’ with a ratio higher than that. The arguments taxask and taxverbose allow to adjust the settings of the called taxize function tax_name: taxverbose allows to set whether the function should display which taxon is currently queried and whether an entry is found; taxask allows to set whether the function prompts the user to decide among several potential matches in the database, or to just pick the most likely one instead.

The default values for this command are as follows:

AssessMonophyly <-
function(tree, taxonomy=NULL, verbosity=5, outliercheck=TRUE, outlierlevel=0.5,
taxizelevel= NULL, taxizedb='ncbi', taxizepref='ncbi', taxask=FALSE, taxverbose=FALSE)

### 3.3 Accessing the Results

#### 3.3.1 First Impression - Summary and Results Tables

To get a first look on the results, we can pull the summary table out of the output object:

GetSummaryMonophyly(solution0)  # pull out summary table
## $Genera ## Taxa Tips ## Total 32 77 ## Monophyletic 9 25 ## Non-Monophyletic 6 35 ## Monotypic 17 17 ## Intruder 7 11 ## Outlier 1 2 We can now see in the first column how many taxa are either monophyletic, non-monophyletic or monotypic (consisting of only one representative), as well as how many have outliers or are intruders to another taxon. In the second column, we can see how many of the tips fall in those respective categories. If we used a taxonomy file and had several taxonomic levels, the summary and result functions will by default return all of them. If we only want to look at a specific one, we can specify the number of the one we want under taxlevels. GetResultMonophyly(solution1, taxlevels=2) # pull out summary only for taxlevel 2, subfamilies in this case ##$Taxlevel_2
##                Monophyly MRCA #Tips Delta-Tips #Intruders
## Vaccinioideae         No   89    34         28          2
## Monotropoideae       Yes   80    10          0          0
## Ericoideae            No  123    23          1          1
## Cassiopoideae        Yes  146     5          0          0
## Enkianthoideae       Yes  150     5          0          0
##                                Intruders #Outliers Outliers
## Vaccinioideae  Ericoideae, Cassiopoideae         0
## Monotropoideae                                  NA
## Ericoideae                 Vaccinioideae         0
## Cassiopoideae                                   NA
## Enkianthoideae                                  NA

The result table shows us the analysis outcomes for each taxonomic unit considered. The row ‘Monophyly’ states for each taxon whether it is monophyletic, ‘MRCA’ gives the node number of the most recent common ancestor of all its members, ‘#Tips’ gives the number of tips assigned to this taxon, and ‘Delta-Tips’ gives the number of tips that do not belong to that taxon, but which are also descendants of its MRCA. The columns ‘#Intruders’ and ‘#Outliers’ tell us how many other taxa or tips are intruders or outliers for this taxon respectively. ‘Intruders’ and ‘Outliers’ give us their names.

#### 3.3.2 Further Look - Customized Access to Detailed Results

In order to really dig into the analysis outputs, we can access information on which taxa and tips cause the monophyly issues as either intruders or outliers and which nodes have been inferred as MRCA of each taxon.

When exploring intruders, we can either list them for all taxa or limit the list to taxa we’re interested in, e.g. only for the genus Erica:

GetIntruderTaxa(solution0, taxa="Erica")  # get list of genera that cause monophyly-issues for Erica
## $Genera ##$Genera$Erica ## [1] "Bruckenthalia" We learn that the genus Bruckenthalia invades Erica, which turns out not to be problematic, as Bruckenthalia is now classified to belong to Erica as well. Similarly, if we have multiple taxonomic levels, we can limit the output to the one of interest: GetIntruderTips(solution1, taxa="Ericoideae", taxlevels=2) # get list of species causing monophyly-issues for the subfamily Ericoideae ##$Taxlevel_2
## $Taxlevel_2$Ericoideae
## [1] "Rhodothamnus_chamaecistus"

This works similarly for MRCA-nodes, outlier tips and outlier taxa (note that the later cannot be constrained to focal taxa, because a tip can only be an outlier to its own taxon). Instead of numbers or ‘ALL’, we could allso select taxlevels using their names.

GetAncNodes(solution1, taxa = NULL, taxlevels='ALL')
## $Taxlevel_1 ## MRCA ## Vaccinieae 91 ## Monotropeae 81 ## Andromedeae 91 ## Ericeae 125 ## Bryantheae 137 ## Cassiopoideae 146 ## Pyroleae 82 ## Enkianthoideae 150 ## Phyllodoceae 139 ## Oxydendreae NA ## ##$Taxlevel_2
##                MRCA
## Vaccinioideae    89
## Monotropoideae   80
## Ericoideae      123
## Cassiopoideae   146
## Enkianthoideae  150

Note: The NA for Oxydendreae reflects that it is monotypic and thus doesn’t have an MRCA.

GetOutlierTaxa(solution0, taxlevels='ALL')
## $Genera ## [1] "Vaccinium" GetOutlierTips(solution1, taxa = NULL, taxlevels='ALL') ##$Taxlevel_1
## $Taxlevel_1$Andromedeae
## [1] "Vaccinium_vitis_idaea"
##
##
## \$Taxlevel_2
## list()

Note: list() of course means that there aren’t any outliers in Taxlevel_2.

### 3.4 Plotting

Needless to say, the most intuitive way to get an impression of the issues in the monophyly within a phylogeny is to visually inspect it. Thus, there are a number of ways to look at our results as well.

#### 3.4.1 Monophyly Plot

The monophyly plot is the standard plot for this package. It will colour tips (and clades) according to their monophyly status, allowing to quickly spot areas of the tree where problems occur. The standard colours are:

• green: monophyletic
• light purple: non-monophyletic
• dark purple: intruder/outlier

A number of other ColorBrewer palettes are available (as listed in helpfile) and using customized colours is possible as well. By default, the tree will be ladderized.

The simplest way of plotting monophyly is like this:

PlotMonophyly(solution0, Ericactree, plot.type='monophyly', ladderize=TRUE, cex=0.5)

This shows us the monophyly situation on the genus level. It is easy to see, how Orthilia invades Pyrola, Kalmiopsis invades Phyllodoce, Bruckenthalia invades Erica and how Dimorphanthera and Paphia reciprocally invade each other. We further see how Vaccinium is invaded by Agapetes and Gaylussacia, while having two outlier species of its own as well. Note how a more conservative ( i.e. higher) outlierlevel would tell us that only Gaylussacia invades Vaccinium, which would then have Vaccinium bracteatum as additional outlier instead.

#### 3.4.2 Taxonomy Plot

This rather auxiliary plot type allows viewing clades coloured by the taxon they belong to. Here we want to look at a different taxonomic level, the subfamilies, hence we can specify in the command to display taxlevels=2 (as subfamilies was the second taxonomy column in our input file) and pick the output object solution1, which contains the results of the analysis incorporating the taxonomy file. The default colours are rainbow, which means that every taxon will get a different colour from the rainbow spectrum.

PlotMonophyly(solution1, Ericactree, taxlevels=2, plot.type='taxonomy', cex=0.5)

This plot shows us nicely how the subfamilies are mostly monophyletic, apart from one rogue in red, which is Rhodothamnus chamaecystus, which notably was mislabeled as a member of the Vaccinioideae (whereas in reality - belonging to Ericoideae - it is in the right place).

#### 3.4.3 Monophyly-Taxonomy Mirror Plot

This plot type simply combines the two previous ones and displays them on two mirrored trees. We use the subfamilies again in this case.

PlotMonophyly(solution1, Ericactree, taxlevels=2, plot.type='monoVStax', cex=0.4, label.offset=18)

Seeing the previous plot mirrored with the corresponding monophyly plot lets us understand how the whole Ericoideae and Cassiopoideae come out as intruders to Vaccinioideae: The intruder Rhodothamnus chamaecystus pulls the MRCA node of the Vaccinioideae down to inlcude both other subfamilies. Once again, a more conservative outlierlevel would have led to the biologically more meaningful result that only Rhodothamnus is an outlier to Vaccinioideae (and an intruder to Ericoideae), while everything else is well.

#### 3.4.4 Intruder Plot

For a more condensed view on the monophyly issues in your tree and who is causing them, you might want to choose the intruder plot, which will especially highlight intruders and outliers and which taxon they belong to. In order to focus the plot on the issues even more, we set monocoll=TRUE, to collapse monophyletic taxa to be represented by one tip each. This time, we display tribes. The default colouring is:

• gray: monophyletic
• black: non-monophyletic
• rainbow: Intruders/Outliers
PlotMonophyly(solution1, Ericactree, taxlevels=1, plot.type='intruders', monocoll=TRUE, cex=0.5, label.offset=5)

Collapsing all monophyletic tribes slimmed this plot down and increased the overview. We immediately see that the issues are confined to the Vaccinieae and their sister clade, the Andromedeae, and the colouration reveals that Andromeda polifolia, Vaccinium vitis-ideae and Zenobia pulverulenta are causing it. Apparently, Andromeda and the Vaccinium are supposed to belong to the same tribe (Andromedeae), whereas Zenobia should belong to a different one (being an intruder to Andromedeae from Vaccinieae). The solution behind this issue is, that the tribe names assigned to Vaccinium vitis-ideae and Zenobia pulverulenta where (intentionally) switched.

Note: Intruders are coloured using rainbow as well, but the palette will be newly created and the colours will not correspond to the ones in the taxonomy plot.

#### 3.4.5 Solutions for Large Trees

A difficulty for viewing phylogenies in R is that trees can be too large to see much detail in an R plotting window. To this end, we can choose to print the plot directly to a PDF file instead, in which we can then easily search and zoom into the desired portions of the tree. To achieve this, we simply set PDF=TRUE and set our desired filename in PDF_filename='ourdesiredfilename.pdf'. Space can additionally be saved by using monocoll=TRUE. The size of the PDF can be specified using PDF_width and PDF_height, the default for which is "auto".

Note: Be aware that plotting very large trees can take quite some time, in some cases longer than running the main analysis.

Here two examples that are printed to PDF, first a monophyly plot with collapsed monophyletic clades on genus level:

PlotMonophyly(solution0, Ericactree, taxlevels=1, plot.type='monophyly', monocoll=TRUE,
ladderize=TRUE, PDF=TRUE, PDF_filename='Monophylyplot_Ericac_Example.pdf')

and second an intruder plot on subfamily level. A good idea for large trees is, to display them in circular shape using type="fan":

PlotMonophyly(solution1, Ericactree, taxlevels=2, plot.type='intruders', ladderize=TRUE, PDF=TRUE,
PDF_filename='Intruderplot_Ericac_Example.pdf', type="fan")

Note: If PDF_height and PDF_width are set to "auto", the command will automatically scale the PDF document to be created according to the number of taxa in the tree. However, PDF files seem to be constrained to a maximum size of 200 inches, and the function will thus cap the size to that (both for automatic and manual sizes). If your tree is very large, the resulting PDF might be difficult to inspect and you might want to consider chopping your tree up into subtrees to plot ( e.g. using treeSlice from the package phytools).

Also, if the automatic scaling moves tip labels etc. in a way you don’t like, you can adjust most parameters by specifying the respective arguments yourself. More details on that can be found in the helpfiles.

## 4 Final Notes

Further information on all the functions of the package and their use can be found in the helpfiles help(package=MonoPhy). For any further issues and questions send an email with subject ‘MonoPhy support’ to oschwery@vols.utk.edu.

The data for the examples is based on:

Schwery, O., Onstein, R. E., Bouchenak-Khelladi, Y., Xing, Y., Carter, R. J. and Linder, H. P. (2015), As old as the mountains: the radiations of the Ericaceae. New Phytologist. doi: 10.1111/nph.13234 Link to PDF