Included Plugins

CountESS is designed to be easy to write plugins for, but plugins are included for common file formats and simple data transformations.

Source code for the included plugins can be found in the repository at countess/plugins

CountESS is not limited to these included plugins: anyone can write and publish CountESS plugins. See Other Plugins for some examples or Writing CountESS Plugins to write your own.

File Formats

CSV Reader

Loads data from CSV (and similar delimited text formats) into a dataframe. Unfortunately CSV is not well standardized so there are quite a few options.

Note that pandas can’t represent null values in integer columns, so columns containing null values and set to integer data type will be promoted to float columns.

Parameters

delimiter: whether to use ,, ;, |, tab or whitespace as the column delimiter
comment: are comments allowed, and if so is # or ; used to delimit them?
header: is the first row a header row? If so it will be used to populate default column names
filename_column: if this is set, the filename will be added as a separate column. This is useful when reading many CSV files at once
columns: for each column, specify a name and data type, and optionally index this column

CSV Writer

Writes dataframe data out to CSV (or TSV, etc). Unfortunately CSV is not well standardized so there are quite a few options.

Parameters

header: include a header row with column headers
filename: the filename to write (if this ends with “.gz”, data will be gzipped as it is written)
delimiter: whether to use ,, ;, |, tab or whitespace as the column delimiter
quoting: whether to quote all string values, even if they don’t appear to need it. This can reduce the chance of data corruption with software which can otherwise confuse string values for dates or numbers.

Data Table

Sometimes you need a little bit of extra data joined into your tables but don’t really want to have a separate CSV file. Data Table lets you embed small amounts of data into the configuration directly.

Parameters

Columns: Add columns, giving them names and types. Columns may also be indexed to improve join performance.
Rows: Add rows to the data table.

Regex Reader

If the CSV Reader isn’t flexible enough for your file format, the Regex Reader can be used to parse each line of your file with a python regular expression, splitting each line into multiple columns.

Note that pandas can’t represent null values in integer columns, so columns containing null values and set to integer data type will be promoted to float columns.

Parameters

regex: a Python Regular Expression which describes the expected format
skip: how many lines to skip at the start of reading, useful for header rows.
output: for each regex group, what colum name and data type should it be stored as.

Data Manipulation

Collate

Group and sort records, keeping only the first N records in each group.

Filter

Filter records using simple expressions. Multiple filters can be chained together. Rows which match a filter can have column values set before being passed on to the next plugin. Rows which do not match the filter are sent to the next filter. Rows which do not match any filter are dropped.

Each filter consists of one or more expressions, each of which compares a column to a value, using the following operators:

operator	type(s)
`equals`	numbers, strings, booleans
`greater than`	numbers, strings
`less than`	numbers, strings
`starts with`	strings
`ends with`	strings
`matches regex`	strings

Each operator can also be negated, so for example a filter can select values which do not start with a substring by negating the starts with operator.

Operators are then combined with either All or Any filters matching.

Group By

Groups data by one or more columns, and aggregates the rest of the columns.

For each column of the input, select either “Index” or zero or more aggregation functions.

`name`	`score`
`A`	`7`
`A`	`8`
`B`	`9`

For example the above can be indexed by name and both count and mean evaluated for score, resulting in:

`name`	`score__count`	`score__mean`
`A`	`2`	`7.5`
`B`	`1`	`9`

Join

Join two datatables on indexes or a column.

Parameters

For each input datatable, there are three parameters:

Join On: Select index to use the table index, or a column
Required: Is a match on this table required to output the row?
Drop Column: Select if you no longer need the joining column and want to drop it.

The “required” flag on the input datatables lets you select the type of join:

Input 1 Required	Input 2 Required	Type of Join
✓	✓	Inner
✓	✗	Left
✗	✓	Right
✗	✗	Full / Outer

Regex Tool

“Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.” – Jamie Zawinski

Splits a string value into multiple parts using a regular expression.

Specify a pattern to match and columns to output the matching parts to.

In “multi” mode, the regex can match multiple times, each of which will generate its own output row.

Pivot Tool

Performs a “pivot” operation on the data table.

Each column can be specified as an “Index”, “Pivot” or “Expand” column, or dropped.

The result will have one row per Index value, with each of the “Expand” columns expanded to include each distinct value of the “Pivot” column(s). Duplicate values are summed and missing values are set to 0.

For example this data set:

`variant`	`replicate`	`count`
`1`	`1`	`1`
`1`	`2`	`2`
`2`	`1`	`3`
`2`	`2`	`4`
`2`	`2`	`5`
`3`	`2`	`6`

when pivoted with index on Variant, pivot on Replicate and expanding Count becomes:

`variant`	`count__replicate_1`	`count__replicate_2`
`1`	`1`	`2`
`2`	`3`	`9`
`3`	`0`	`6`

Expression

This lets you embed simple Python-like expressions into your data processing.

Each line of the “Expressions” block is a separate expression. Expressions can either be statements, like:

new_column = old_column * 2

or filters, like:

variant_class == 'M'

Basic arithmetic operations are available, as well as many useful functions:

Operators

+, -, *, **, /, and, or, not, <, <=, >, >=, !=, ==

Constants

True, False: Boolean values (displayed as —T and —F in table views)
None, NULL: Represent a missing value (displayed as — in table views)

Math Functions

ABS: Absolute value
COS, SIN, TAN, ACOS, ASIN, ATAN, ATAN2: Trigonometric functions and their inverses
EXP, LOG, LOG2, LOG10: Exponentials and Logarithms
AVG(...), MEDIAN(...), MIN(...), MAX(...): Average, Median, Min and Max of multiple values
STD_POP(...), STD_SAMP(...), VAR_POP(...), VAR_SAMP(...): Population or Sample Standard Deviation / Variance of multiple values
CEIL, FLOOR: Round up or down to nearest integer

String Functions

CONCAT: Concatenate two strings
CONTAINS(string1, string2): True if string1 contains string2
ENDS_WITH(string1, string2): True if string1 ends with string2
STARTS_WITH(string1, string2): True if string1 starts with string2
LEN(string): Length of the string

Bioinformatics

FASTQ Load

Read one (or more) FASTQ files, optionally gzipped. Returns a datatable with columns:

sequence: the sequence, as a DNA or RNA string of A, C, G, T.
header: the header string from the FASTQ read (the line starting with @)

A minimum average quality filter can be applied.

If “Group by sequence?” is selected, an additional column “count” is added and the sequences are grouped. The header is reduced to the common prefix of all headers in the FASTQ file.

Mutagenize

Takes a sequence in configuration and returns all possible mutations of that sequence of various types. Only one kind is applied at a time.

Parameters

All Single Mutations?: Include all SNVs of the sequence
All Single Deletes?: Include all single base deletions of the sequence
All Triple Deletes?: Include all triple base deletions, eg: aligned deletions, of the sequence
All Single Inserts?: Include all single base insertions of the sequence
All Triple Inserts?: Include all triple base insertions, eg: aligned insertions, of the sequence
Remove Duplicates?: Deletions and Insertions into the sequence can end up with the same sequence from multiple operations, for example inserting A into CAAT can produce CAAAT in three different ways, and single deletions can produce CAT in two different ways. If “Remove Duplicates” is selected, duplicates are removed but an additional column “count” is added to indicate how many ways this sequence can have been produced.

Variant Caller

Turns a DNA sequence into an HGVS variant code by comparing it to a reference sequence. The reference sequence can either be provided directly as a configuration parameter or in a selected column.

See also: countess-minimap2 plugin, a variant caller which uses ‘minimap2’ to find sequences within a genome.

Parameters

Input Column: the input column with the variant sequence
Reference Sequence: Select column which contains the reference sequence, or enter a reference sequence as a string.
DNA Variant: Column name for HGVS string of the nucleotide variant
Protein Variant: Column name for HGVS string of the protein variant
Classification Output Column: Column name for a classification of the variant (see Variant Classifier for details)
Max Mutations: Maximum number of mutations, if no variant with this number or less mutations is found then return a null value for the output
Drop: Drop rows which would have null values for output

Variant Classifier

Takes a column of protein variants and classifies them into broad types. This is easier than writing the regular expression yourself.

Short format:

format	type	explanation
`WT` `_WT`	`W`	Wild type
`A107H`	`M`	Missense
`A107A` `A107=`	`S`	Synonymous
`A107*` `A107X`	`N`	Nonsense
`A107-`	`D`	Deletion

There is currently no support for insertions in short format. Invalid amino acid letters will generate a warning and the type will be set to ?.

HGVS format:

format	type	explanation
`p.=`	`W`	Wild type
`p.Ala107His`	`M`	Missense
`p.Ala107=`	`S`	Synonymous
`p.Ala107Ter`	`N`	Nonsense
`p.Ala107del` `p.Ala107_His109del`	`D`	Deletion
`p.Ala107dup` `p.Ala107_Glu108dup` `p.Ala107_Glu108insHis`	`I`	Insertion / Duplication

Invalid amino acid codes will generate a warning and the type will be set to ?. Other issues such as invalid locations will not necessarily be detected. More complicated scenarios, like delins and multiple changes, are not currently supported.

Unrecognized Formats

Other variant formats will generate a warning and the type will be set to ?.

Scoring

The scoring plugin implements the same scoring algorithms as used in Enrich2 and described in A statistical framework for analyzing deep mutational scanning data.

Replicate Column: Used to split the results into multiple replicates, each replicate will have its frequencies counted separately.
Input Columns: Select a group of columns which provide the counts for each variant at each time point.
Wildtype: Choose either a boolean or string column to identify wild types for use in the scoring equation. If it’s a string column, the values W, p.= or g.= are regarded as marking wild types. If no column is chosen, the sum of all variants is used in the scoring equation instead.
Score Column: A name for the column to output the score in.
Standard Deviation Column: A name for the column to output the standard deviation of the score in.
Drop Input Columns?: If selected, drop all the input count columns from the output.

Random Effects Model

The Random Effects Model plugin combines scores using their score values and standard deviations, as implemented in Enrich2 and described in A statistical framework for analyzing deep mutational scanning data.

Score Columns: Columns containing scores
Stddev Columns: Columns containing corresponding standard deviations