Included Plugins
CountESS is designed to be easy to write plugins for, but plugins are included for common file formats and simple data transformations.
Source code for the included plugins can be found in the repository at countess/plugins
CountESS is not limited to these included plugins: anyone can write and publish CountESS plugins. See Other Plugins for some examples or Writing CountESS Plugins to write your own.
File Formats
CSV Reader
Loads data from CSV (and similar delimited text formats) into a dataframe. Unfortunately CSV is not well standardized so there are quite a few options.
Note that pandas can’t represent null values in integer columns, so columns containing null values and set to integer data type will be promoted to float columns.
Parameters
- delimiter
- whether to use
,
,;
,|
, tab or whitespace as the column delimiter - comment
- are comments allowed, and if so is
#
or;
used to delimit them? - header
- is the first row a header row? If so it will be used to populate default column names
- filename_column
- if this is set, the filename will be added as a separate column. This is useful when reading many CSV files at once
- columns
- for each column, specify a name and data type, and optionally index this column
CSV Writer
Writes dataframe data out to CSV (or TSV, etc). Unfortunately CSV is not well standardized so there are quite a few options.
Parameters
- header
- include a header row with column headers
- filename
- the filename to write (if this ends with “.gz”, data will be gzipped as it is written)
- delimiter
- whether to use
,
,;
,|
, tab or whitespace as the column delimiter - quoting
- whether to quote all string values, even if they don’t appear to need it. This can reduce the chance of data corruption with software which can otherwise confuse string values for dates or numbers.
Data Table
Sometimes you need a little bit of extra data joined into your tables but don’t really want to have a separate CSV file. Data Table lets you embed small amounts of data into the configuration directly.
Parameters
- Columns
- Add columns, giving them names and types. Columns may also be indexed to improve join performance.
- Rows
- Add rows to the data table.
Regex Reader
If the CSV Reader isn’t flexible enough for your file format, the Regex Reader can be used to parse each line of your file with a python regular expression, splitting each line into multiple columns.
Note that pandas can’t represent null values in integer columns, so columns containing null values and set to integer data type will be promoted to float columns.
Parameters
- regex
- a Python Regular Expression which describes the expected format
- skip
- how many lines to skip at the start of reading, useful for header rows.
- output
- for each regex group, what colum name and data type should it be stored as.
Data Manipulation
Collate
Group and sort records, keeping only the first N records in each group.
Filter
Filter records using simple expressions. Multiple filters can be chained together. Rows which match a filter can have column values set before being passed on to the next plugin. Rows which do not match the filter are sent to the next filter. Rows which do not match any filter are dropped.
Each filter consists of one or more expressions, each of which compares a column to a value, using the following operators:
operator | type(s) |
---|---|
equals |
numbers, strings, booleans |
greater than |
numbers, strings |
less than |
numbers, strings |
starts with |
strings |
ends with |
strings |
matches regex |
strings |
Each operator can also be negated, so for example a filter can select values which do
not start with a substring by negating the starts with
operator.
Operators are then combined with either All
or Any
filters matching.
Group By
Groups data by one or more columns, and aggregates the rest of the columns.
For each column of the input, select either “Index” or zero or more aggregation functions.
name |
score |
---|---|
A |
7 |
A |
8 |
B |
9 |
For example the above can be indexed by name
and both count and mean evaluated for score
, resulting in:
name |
score__count |
score__mean |
---|---|---|
A |
2 |
7.5 |
B |
1 |
9 |
Join
Join two datatables on indexes or a column.
Parameters
For each input datatable, there are three parameters:
- Join On
- Select index to use the table index, or a column
- Required
- Is a match on this table required to output the row?
- Drop Column
- Select if you no longer need the joining column and want to drop it.
The “required” flag on the input datatables lets you select the type of join:
Input 1 Required | Input 2 Required | Type of Join |
---|---|---|
✓ | ✓ | Inner |
✓ | ✗ | Left |
✗ | ✓ | Right |
✗ | ✗ | Full / Outer |
Regex Tool
“Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.” – Jamie Zawinski
Splits a string value into multiple parts using a regular expression.
Specify a pattern to match and columns to output the matching parts to.
In “multi” mode, the regex can match multiple times, each of which will generate its own output row.
Pivot Tool
Performs a “pivot” operation on the data table.
Each column can be specified as an “Index”, “Pivot” or “Expand” column, or dropped.
The result will have one row per Index value, with each of the “Expand” columns expanded to include each distinct value of the “Pivot” column(s). Duplicate values are summed and missing values are set to 0
.
For example this data set:
variant |
replicate |
count |
---|---|---|
1 |
1 |
1 |
1 |
2 |
2 |
2 |
1 |
3 |
2 |
2 |
4 |
2 |
2 |
5 |
3 |
2 |
6 |
when pivoted with index on Variant, pivot on Replicate and expanding Count becomes:
variant |
count__replicate_1 |
count__replicate_2 |
---|---|---|
1 |
1 |
2 |
2 |
3 |
9 |
3 |
0 |
6 |
Python Code
This lets you embed simple Python expressions into your data processing.
Add one or more Python expressions to the “Python Code”. Each row is processed separately, with column values appearing as local variables.
Functions available:
abs acos acosh all any ascii asin asinh atan atan2 atanh
bin bool cbrt ceil chr comb compile copysign cos cosh
degrees dist erf erfc escape exp exp2 expm1
fabs factorial findall finditer float floor fmod
format frexp frozenset fsum fullmatch gamma gcd
hash hex hypot inf int isclose isfinite isinf isinf
isnan isnan isqrt lcm ldexp len lgamma log log10 log1p log2
match max mean median min modf nan nextafter ord
perm pow prod purge radians range remainder round
search sin sinh sorted split sqrt std str sub subn
sum sumprod tan tanh template trunc type ulp var zip
Bioinformatics
FASTQ Load
Read one (or more) FASTQ files, optionally gzipped. Returns a datatable with columns:
- sequence
- the sequence, as a DNA or RNA string of A, C, G, T.
- header
- the header string from the FASTQ read (the line starting with
@
)
A minimum average quality filter can be applied.
If “Group by sequence?” is selected, an additional column “count” is added and the sequences are grouped. The header is reduced to the common prefix of all headers in the FASTQ file.
Mutagenize
Takes a sequence in configuration and returns all possible mutations of that sequence of various types. Only one kind is applied at a time.
Parameters
- All Single Mutations?
- Include all SNVs of the sequence
- All Single Deletes?
- Include all single base deletions of the sequence
- All Triple Deletes?
- Include all triple base deletions, eg: aligned deletions, of the sequence
- All Single Inserts?
- Include all single base insertions of the sequence
- All Triple Inserts?
- Include all triple base insertions, eg: aligned insertions, of the sequence
- Remove Duplicates?
- Deletions and Insertions into the sequence can end up with the same sequence from multiple operations, for example inserting
A
intoCAAT
can produceCAAAT
in three different ways, and single deletions can produceCAT
in two different ways. If “Remove Duplicates” is selected, duplicates are removed but an additional column “count” is added to indicate how many ways this sequence can have been produced.
Variant Caller
Turns a DNA sequence into an HGVS variant code by comparing it to a reference sequence. The reference sequence can either be provided directly as a configuration parameter or in a selected column.
See also: countess-minimap2 plugin, a variant caller which uses ‘minimap2’ to find sequences within a genome.
Parameters
- Input Column
- the input column with the variant sequence
- Reference
- (optional) select column which contains the reference sequence …
- Sequence
- (optional) … or supply a reference sequence as a value
- Output Column
- Column name for HGVS string
- Max Mutations
- Maximum number of mutations, if no variant with this number or less mutations is found then return a null value for the output
- Drop
- Drop rows which would have null values for output