Skip to contents

Cluster genomes at 4 levels from a pangenome gene PA matrix Needs parallelDist, igraph, genomes as rows and genes as columns?

Usage

cluster_genomes(
  dat_mat,
  pcut = 0,
  scut = 0.5,
  tcut = 0.75,
  qcut = 0.75,
  DIST_METHOD = "binary",
  output_directory = NULL,
  write_dist = TRUE,
  write_graph = TRUE
)

Arguments

dat_mat

pan genome presence absence matrix (rows are genes)

pcut

weight for graph pruning at 1st level

scut

weight for graph pruning at 2nd level

tcut

weight for graph pruning at 3rd level

qcut

weight for graph pruning at 4th level

DIST_METHOD

distance method for building graph (passed to parallelDist())

output_directory

output directory to save the graph in

write_dist

logical, should the dist object be written?

write_graph

logical, should the graph be written to an rds file?

Value

returns a tibble with a cluster designation for each genome at 4 levels

Examples

# note the t() because the dist() function finds dists between rows of a matrix
cluster_genomes(dat_mat=t(example_pangenome_matrix),
                pcut=.80,
                scut=.85,
                tcut=.90,
                qcut=.95,
                write_dist=FALSE,
                write_graph=FALSE)
#> [1] "calculating distances"
#> [1] "converting to similarities"
#> [1] "building graph"
#> # A tibble: 100 × 5
#>    asm_acc   primary_cluster secondary_cluster tertiary_cluster quat_cluster
#>    <chr>     <membrshp>      <membrshp>        <membrshp>       <membrshp>  
#>  1 genome_1  1                1                 1                1          
#>  2 genome_2  2                2                 2                2          
#>  3 genome_3  3                3                 3                3          
#>  4 genome_4  1                4                 4                4          
#>  5 genome_5  4                5                 5                5          
#>  6 genome_6  4                6                 6                6          
#>  7 genome_7  2                7                 7                7          
#>  8 genome_8  3                8                 8                8          
#>  9 genome_9  3                9                 9                9          
#> 10 genome_10 2               10                10               10          
#> # ℹ 90 more rows