Skip to main content

Sequence Filtering

The sequence filtering method works at the sequence level, which differs from the SEGUL alignment filtering feature, which works on the alignment level. Using the alignment filtering feature will filter the entire alignment that does not meet the filtering criteria. However, the sequence filtering feature will remove sequences that do not meet the criteria while retaining the same alignment if at least one sequence is left in the alignment. The feature works on many alignments simultaneously and will never overwrite your original datasets; it will create new files with the filtered sequences.

Available filtering methods:

  1. Sequence length
  2. Proportion of gaps

The command is structured as below:

segul sequence filter <input-option> [alignment-path] <filtering-option> <value>

Filtering based on sequence length

Given a collection of alignments, SEGUL will remove sequences with non-gapped characters less than the specified length in each alignment. For example, we have an alignment with three sequences:

>seq_1
agtctgatc
>seq_2
agtc-----
>seq_3
agtcgatct

We want to filter sequences that contain at least 5 bp of sequences, excluding gaps. We can use the --min-length option:

segul sequence filter --dir alignments/ --min-length 5

It will filter out the seq_2 because it only has 4 bp sequences. The result will be:

>seq_1
agtctgatc
>seq_3
agtcgatct

Filtering based on the maximum proportion of gaps

SEGUL considers - and ? as gap characters. The app will remove sequences with a proportion of gaps greater than the specified value. To filter sequences, use the --max-gap option with a value between 0 and 1.

segul sequence filter --dir alignments/ --max-gap 0.3