atactk package

Submodules

atactk.command module

Code used in command-line applications.

atactk.command.check_bins_for_overlap(bins)[source]

Make sure bins don’t overlap.

Parameters:bins (list) – A list of tuples containing (start, end, resolution).
Raises:argparse.ArgumentTypeError – If any of the bins overlap.
atactk.command.parse_bins(bins_string)[source]

Parse the string representing template size groups.

The bins are specified as a list of groups, each group comprising one or more bins, and ending with a resolution value, which controls how many individual cuts in the extended region around the feature are aggregated. Within the feature itself, we always count the cut points for each base. A complete example:

(36-149 1) (150-224 225-324 2) (325-400 5)

With a resolution of 1, every base in the extended region around motifs overlapping templates of length 36-149 would be scored independently; each base’s cut count would be added to the matrix.

The second group, for templates of length 150-224 or 225-324, with a resolution of 2, would result in every two bases in the extended region around motifs being added together. Then the aggregate scores of the two bins in the group would be summed, and the result would be added to the matrix.

The last bin group, (325-400 5), with a resolution of 5, would also produce aggregate scores in the extended region, each being the sum of five bases’ cut counts.

To illustrate, assume these settings and an imaginary motif 5 bases long, with a 10-base extended region on either side, and for the sake of example pretend that each template length bin had the same counts of cut points around the motif, shown here:

extended region     motif     extended region
------------------- --------- -------------------
0 1 2 3 3 4 4 4 4 5 9 2 0 2 7 5 4 4 4 4 3 3 2 1 0

The scores for the first bin group, (36-149 1):

extended region     motif     extended region
------------------- --------- -------------------
0 1 2 3 3 4 4 4 4 5 9 2 0 2 7 5 4 4 4 4 3 3 2 1 0

The scores for the first bin group, (150-224 225-324 2):

e.r.      motif     e.r.
--------- --------- ---------
1 5 7 8 9 9 2 0 2 7 9 8 7 5 1

The scores for the last bin group, (325-400 5):

e.r. motif     e.r.
---- --------- ----
9 21 9 2 0 2 7 21 9
Parameters:bins_string (str) – A list of S-expressions representing groups of bin start and end positions and resolutions.
Returns:A list of lists of tuples of (start, end, resolution).
Return type:list

atactk.data module

Code for reading and manipulating data commonly used in ATAC-seq pipelines.

class atactk.data.ExtendedFeature(reference=None, start=None, end=None, name=None, score=0, strand=None, extension=100)[source]

Bases: object

A feature plus a fixed extended region.

You can define the region by passing the extension parameter to the constructor, e.g.:

feature = ExtendedFeature(extension=100, **bed_record)

Most of ExtendedFeature‘s attributes map to the first six fields in a BED file. Where our names for the fields differ, the BED format name from https://genome.ucsc.edu/FAQ/FAQformat.html is included in parentheses below.

reference

str

The reference sequence on which the feature is located. (chrom)

feature_start

int

The starting position of the feature in the reference sequence, zero-based. (chromStart)

feature_end

int

The ending position of the feature in the reference sequence, which is one past the last base in the feature. (chromEnd)

name

str

The name of the feature.

score

float

A numeric score.

strand

str

Either + or -.

feature_length
region_length
atactk.data.complement(seq)[source]

Return the complement of the supplied nucleic sequence.

Nucleic of course implies that the only recognized bases are A, C, G, T and N. Case will be preserved.

Parameters:seq (str) – A nucleic sequence.
Returns:The complement of the given sequence.
Return type:str
atactk.data.count_features(filename)[source]
atactk.data.filter_aligned_segments(aligned_segments, include_flags, exclude_flags, quality)[source]

Filter aligned segments using SAM flags and mapping quality.

Parameters:
  • aligned_segments (list) – Aligned reads to filter.
  • include_flags (list) – Reads matching any include flag will be returned.
  • exclude_flags (list) – Reads matching any exclude flag will not be returned.
  • quality (int) – Only reads with at least this mapping quality will be returned.
Returns:

filtered_aligned_segments – The set of the aligned segments supplied to the function which meet the specified criteria.

Return type:

list

Examples

You probably want include_flags of [83, 99, 147, 163] and exclude_flags of [4, 8].

Flag 4 means the read is unmapped, 8 means the mate is unmapped.

Properly paired and mapped forward aligned segments have flags in [99, 163]

99:
  • 1: read paired
  • 2: read mapped in proper pair
  • 32: mate reverse strand
  • 64: first in pair
163:
  • 1: read paired
  • 2: read mapped in proper pair
  • 32: mate reverse strand
  • 128: second in pair

Properly paired and mapped reverse aligned segments have flags in [83, 147].

83:
  • 1: read paired
  • 2: read mapped in proper pair
  • 16: read reverse strand
  • 64: first in pair
147:
  • 1: read paired
  • 2: read mapped in proper pair
  • 16: read reverse strand
  • 128: second in pair
atactk.data.make_fastq_pair_reader(fastq_file1, fastq_file2)[source]

Return a generator producing pairs of records from two FASTQ files.

The intent is to produce read pairs from paired-end sequence data.

Parameters:
  • fastq_file1 (str) – The name of the first FASTQ file.
  • fastq_file2 (str) – The name of the second FASTQ file.
Yields:

tuple – A tuple containing two 4-element lists, one for each FASTQ record, representing the ID, sequence, comment, and quality lines.

atactk.data.open_alignment_file(alignment_filename)[source]
atactk.data.open_maybe_gzipped(filename)[source]

Open a possibly gzipped file.

Parameters:filename (str) – The name of the file to open.
Returns:An open file object.
Return type:file
atactk.data.read_features(filename, extension=100, feature_class=<class 'atactk.data.ExtendedFeature'>)[source]

Return a generator of ExtendedFeature instances from the named tab-separated value file.

Most BED-like files should work; we read the three required and first three optional BED fields to get coordinates, and any extra fields are ignored.

Parameters:
  • filename (str) – The (optionally gzipped) tab-separated value file from which to read features. Use ‘-‘ to read from standard input.
  • extension (int) – The number of bases to score on either side of each feature.
  • feature_class (class) – Each row of the file will be instantiated with this class.
Yields:

feature – An ExtendedFeature instance for each row of the file.

atactk.data.reverse_complement(seq)[source]

Return the reverse complement of the supplied nucleic sequence.

Parameters:seq (str) – A nucleic sequence.
Returns:The reverse complement of the given sequence.
Return type:str

See also

complement()

atactk.metrics module

Code for making quantitative observations about ATAC-seq experiments.

atactk.metrics.add_cut_points_to_region_tree(region_tree, group_key, strand, cut_points)[source]

Record cut points by position, template size, and strand.

The tree consists of nested dictionaries. The first level keys are positions in a region. The second level keys are the names of template size groups. The third level keys are the strand, and the values are the cut point counts for that position, template size group, and strand.

Parameters:
  • region_tree (dict) – The tree to which the cut point counts will be added.
  • group_key (str) – The key corresponding to the template size group.
  • strand (str) – The strand of the cut point’s aligned segment: + or -.
  • cut_points (list) – The count of cut points at each position in the region that matched the template size group and strand.

Notes

Only positive counts are recorded. It takes a lot of time and space to record so many zeroes, and it’s better to produce them on demand via collections.defaultdict. So instead, collect all the scores, and after that work is done, update a tree built with collections.defaultdict, then work with that. See the make_cut_matrix script included with atactk for an example.

atactk.metrics.aggregate_scores(scores, extension, resolution)[source]

Adjust scores in the extended region around a feature.

Given a sequence containing the score at each base in a region, the size of the extended region around the feature, and the desired resolution in that extended region, reduce the extended scores.

Parameters:
  • scores (list) – A list containing a score for each base in a region around a feature.
  • extension (int) – The number of bases at the beginning and end of the list considered the extended region.
  • resolution (int) – The desired scoring resolution in the extended region.

See also

reduce_scores()
Reduce scores by summing every resolution values.
atactk.metrics.count_cut_points(aligned_segments, start, end, cut_point_offset=4)[source]

Return any cut points in the region from the aligned segments.

Parameters:
  • aligned_segments (list) – A list of pysam.AlignedSegment.
  • start (int) – The start of the region of interest.
  • end (int) – The end of the region of interest.
Returns:

A list of counts, one for each position from start to end, of cut points in the aligned segments that fell between the start and end..

Return type:

list

atactk.metrics.find_cut_point(aligned_segment, cut_point_offset=4)[source]

Return the position of the given aligned segment’s ATAC-seq cut point.

Parameters:aligned_segment (pysam.AlignedSegment) – https://pysam.readthedocs.org/en/latest/api.html#pysam.AlignedSegment
Returns:Position of the ATAC-seq cut point.
Return type:int
atactk.metrics.reduce_scores(scores, resolution)[source]

Reduce a sequence of scores by summing every resolution values.

Called with scores of [0, 1, 1, 4, 2], you’d get the following results at various resolutions:

Resolution Result
1 [0, 1, 1, 4, 2]
2 [1, 5, 2]
3 [2, 6]
4 [6, 2]
10 [8]
atactk.metrics.score_feature(alignment_filename, bin_groups, include_flags, exclude_flags, quality, cut_point_offset, feature)[source]

Count the number of transposition events around the given feature.

Parameters:
  • alignment_filename (str) – The BAM file containing aligned reads.
  • bin_groups (iterable) – A sequence of iterables containing bins and the resolution with which they should be scored.
  • include_flags (iterable) – The SAM flags to use when selecting aligned segments to score.
  • exclude_flags (iterable) – The SAM flags to use when excluding aligned segments to score; any flag present on a read excludes it.
  • quality (int) – The minimum mapping quality a read must have to be scored.
  • feature (ExtendedFeature) – The feature to score.
Returns:

A tuple of (row, tree) where

  • row is a tab-separated list of scores in the region around the feature

  • tree is a three-level dict holding a score for each position, in each of the template size bins given, on each strand, e.g.:

    >>> tree[0]['36_149']['F']
    22
    >>> tree[0]['36_149']['R']
    15
    

Return type:

tuple

See also

add_cut_points_to_region_tree()
Where the tree for the aggregate matrix is described more fully.

atactk.util module

Utility code used in atactk.

atactk.util.add_lists(l1, l2)[source]

Adds the values of two lists, entrywise.

>>> add_lists([0, 1, 2], [3, 4, 5])
[3, 5, 7]
Parameters:
  • l1 (list) – The first list.
  • l2 (list) – The second list.
Returns:

sum – The list of the entrywise sums of the two lists’ elements.

Return type:

list

atactk.util.partition(count, seq)[source]

Create a generator of lists of count elements from seq.

>>> list(partition(3, range(1, 10)))
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

If seq isn’t a multiple of count, the last list will contain the remaining items.

>>> list(partition(3, range(1, 9)))
[[1, 2, 3], [4, 5, 6], [7, 8]]
Parameters:
  • count (int) – The number of elements of seq to put in each partition.
  • seq (iterator-or-iterable) – An iterator or iterable to be partitioned.
Yields:

list – A list representing a partition of count elements.

atactk.util.take(count, seq)[source]

Return a list of up to count elements from the iterable seq.

Parameters:
  • count (int) – The number of elements to take from seq.
  • seq (iterator-or-iterable) – An iterator or iterable from which to take elements.
Returns:

A list of up to count elements. There may be fewer if seq has been exhausted.

Return type:

list

Module contents