Cluster genomes at 4 levels from a pangenome gene PA matrix Needs parallelDist, igraph, genomes as rows and genes as columns?

Usage

cluster_genomes(
  dat_mat,
  pcut = 0,
  scut = 0.5,
  tcut = 0.75,
  qcut = 0.75,
  DIST_METHOD = "binary",
  output_directory = NULL,
  write_dist = TRUE,
  write_graph = TRUE
)

Arguments

dat_mat: pan genome presence absence matrix (rows are genes)
pcut: weight for graph pruning at 1st level
scut: weight for graph pruning at 2nd level
tcut: weight for graph pruning at 3rd level
qcut: weight for graph pruning at 4th level
DIST_METHOD: distance method for building graph (passed to parallelDist())
output_directory: output directory to save the graph in
write_dist: logical, should the dist object be written?
write_graph: logical, should the graph be written to an rds file?

Value

returns a tibble with a cluster designation for each genome at 4 levels

Examples

# note the t() because the dist() function finds dists between rows of a matrix
cluster_genomes(dat_mat=t(example_pangenome_matrix),
                pcut=.80,
                scut=.85,
                tcut=.90,
                qcut=.95,
                write_dist=FALSE,
                write_graph=FALSE)
#> [1] "calculating distances"
#> [1] "converting to similarities"
#> [1] "building graph"
#> # A tibble: 100 × 5
#>    asm_acc   primary_cluster secondary_cluster tertiary_cluster quat_cluster
#>    <chr>     <membrshp>      <membrshp>        <membrshp>       <membrshp>  
#>  1 genome_1  1                1                 1                1          
#>  2 genome_2  2                2                 2                2          
#>  3 genome_3  3                3                 3                3          
#>  4 genome_4  1                4                 4                4          
#>  5 genome_5  4                5                 5                5          
#>  6 genome_6  4                6                 6                6          
#>  7 genome_7  2                7                 7                7          
#>  8 genome_8  3                8                 8                8          
#>  9 genome_9  3                9                 9                9          
#> 10 genome_10 2               10                10               10          
#> # ℹ 90 more rows