Cluster genomes at 4 levels from a pangenome gene PA matrix Needs parallelDist, igraph, genomes as rows and genes as columns?
Source:R/pangenome_tools.R
cluster_genomes.Rd
Cluster genomes at 4 levels from a pangenome gene PA matrix Needs parallelDist, igraph, genomes as rows and genes as columns?
Usage
cluster_genomes(
dat_mat,
pcut = 0,
scut = 0.5,
tcut = 0.75,
qcut = 0.75,
DIST_METHOD = "binary",
output_directory = NULL,
write_dist = TRUE,
write_graph = TRUE
)
Arguments
- dat_mat
pan genome presence absence matrix (rows are genes)
- pcut
weight for graph pruning at 1st level
- scut
weight for graph pruning at 2nd level
- tcut
weight for graph pruning at 3rd level
- qcut
weight for graph pruning at 4th level
- DIST_METHOD
distance method for building graph (passed to parallelDist())
- output_directory
output directory to save the graph in
- write_dist
logical, should the dist object be written?
- write_graph
logical, should the graph be written to an rds file?
Examples
# note the t() because the dist() function finds dists between rows of a matrix
cluster_genomes(dat_mat=t(example_pangenome_matrix),
pcut=.80,
scut=.85,
tcut=.90,
qcut=.95,
write_dist=FALSE,
write_graph=FALSE)
#> [1] "calculating distances"
#> [1] "converting to similarities"
#> [1] "building graph"
#> # A tibble: 100 × 5
#> asm_acc primary_cluster secondary_cluster tertiary_cluster quat_cluster
#> <chr> <membrshp> <membrshp> <membrshp> <membrshp>
#> 1 genome_1 1 1 1 1
#> 2 genome_2 2 2 2 2
#> 3 genome_3 3 3 3 3
#> 4 genome_4 1 4 4 4
#> 5 genome_5 4 5 5 5
#> 6 genome_6 4 6 6 6
#> 7 genome_7 2 7 7 7
#> 8 genome_8 3 8 8 8
#> 9 genome_9 3 9 9 9
#> 10 genome_10 2 10 10 10
#> # ℹ 90 more rows