atactk package¶

Submodules¶

atactk.command module¶

Code used in command-line applications.

atactk.command.check_bins_for_overlap(bins)[source]¶

Make sure bins don’t overlap.

Parameters:	bins (list) – A list of tuples containing (start, end, resolution).
Raises:	`argparse.ArgumentTypeError` – If any of the bins overlap.

atactk.command.parse_bins(bins_string)[source]¶

Parse the string representing template size groups.

The bins are specified as a list of groups, each group comprising one or more bins, and ending with a resolution value, which controls how many individual cuts in the extended region around the feature are aggregated. Within the feature itself, we always count the cut points for each base. A complete example:

(36-149 1) (150-224 225-324 2) (325-400 5)

With a resolution of 1, every base in the extended region around motifs overlapping templates of length 36-149 would be scored independently; each base’s cut count would be added to the matrix.

The second group, for templates of length 150-224 or 225-324, with a resolution of 2, would result in every two bases in the extended region around motifs being added together. Then the aggregate scores of the two bins in the group would be summed, and the result would be added to the matrix.

The last bin group, (325-400 5), with a resolution of 5, would also produce aggregate scores in the extended region, each being the sum of five bases’ cut counts.

To illustrate, assume these settings and an imaginary motif 5 bases long, with a 10-base extended region on either side, and for the sake of example pretend that each template length bin had the same counts of cut points around the motif, shown here:

extended region     motif     extended region
------------------- --------- -------------------
0 1 2 3 3 4 4 4 4 5 9 2 0 2 7 5 4 4 4 4 3 3 2 1 0

The scores for the first bin group, (36-149 1):

extended region     motif     extended region
------------------- --------- -------------------
0 1 2 3 3 4 4 4 4 5 9 2 0 2 7 5 4 4 4 4 3 3 2 1 0

The scores for the first bin group, (150-224 225-324 2):

e.r.      motif     e.r.
--------- --------- ---------
1 5 7 8 9 9 2 0 2 7 9 8 7 5 1

The scores for the last bin group, (325-400 5):

e.r. motif     e.r.
---- --------- ----
9 21 9 2 0 2 7 21 9

Parameters:	bins_string (str) – A list of S-expressions representing groups of bin start and end positions and resolutions.
Returns:	A list of lists of tuples of (start, end, resolution).
Return type:	list

atactk.data module¶

Code for reading and manipulating data commonly used in ATAC-seq pipelines.

class atactk.data.ExtendedFeature(reference=None, start=None, end=None, name=None, score=0, strand=None, extension=100)[source]¶

Bases: object

A feature plus a fixed extended region.

You can define the region by passing the extension parameter to the constructor, e.g.:

feature = ExtendedFeature(extension=100, **bed_record)

Most of ExtendedFeature‘s attributes map to the first six fields in a BED file. Where our names for the fields differ, the BED format name from https://genome.ucsc.edu/FAQ/FAQformat.html is included in parentheses below.

reference¶

str

The reference sequence on which the feature is located. (chrom)

feature_start¶

int

The starting position of the feature in the reference sequence, zero-based. (chromStart)

feature_end¶

int

The ending position of the feature in the reference sequence, which is one past the last base in the feature. (chromEnd)

name¶

str

The name of the feature.

score¶

float

A numeric score.

strand¶

str

Either + or -.

feature_length¶

region_length¶

atactk.data.complement(seq)[source]¶

Return the complement of the supplied nucleic sequence.

Nucleic of course implies that the only recognized bases are A, C, G, T and N. Case will be preserved.

Parameters:	seq (str) – A nucleic sequence.
Returns:	The complement of the given sequence.
Return type:	str

atactk.data.count_features(filename)[source]¶

atactk.data.filter_aligned_segments(aligned_segments, include_flags, exclude_flags, quality)[source]¶

Filter aligned segments using SAM flags and mapping quality.

Parameters:	aligned_segments (list) – Aligned reads to filter. include_flags (list) – Reads matching any include flag will be returned. exclude_flags (list) – Reads matching any exclude flag will not be returned. quality (int) – Only reads with at least this mapping quality will be returned.
Returns:	filtered_aligned_segments – The set of the aligned segments supplied to the function which meet the specified criteria.
Return type:	list

Examples

You probably want include_flags of [83, 99, 147, 163] and exclude_flags of [4, 8].

Flag 4 means the read is unmapped, 8 means the mate is unmapped.

Properly paired and mapped forward aligned segments have flags in [99, 163]

99:

1: read paired
2: read mapped in proper pair
32: mate reverse strand
64: first in pair

163:

1: read paired
2: read mapped in proper pair
32: mate reverse strand
128: second in pair

Properly paired and mapped reverse aligned segments have flags in [83, 147].

83:

1: read paired
2: read mapped in proper pair
16: read reverse strand
64: first in pair

147:

1: read paired
2: read mapped in proper pair
16: read reverse strand
128: second in pair

atactk.data.make_fastq_pair_reader(fastq_file1, fastq_file2)[source]¶

Return a generator producing pairs of records from two FASTQ files.

The intent is to produce read pairs from paired-end sequence data.

Parameters:	fastq_file1 (str) – The name of the first FASTQ file. fastq_file2 (str) – The name of the second FASTQ file.
Yields:	tuple – A tuple containing two 4-element lists, one for each FASTQ record, representing the ID, sequence, comment, and quality lines.

atactk.data.open_alignment_file(alignment_filename)[source]¶

atactk.data.open_maybe_gzipped(filename)[source]¶

Open a possibly gzipped file.

Parameters:	filename (str) – The name of the file to open.
Returns:	An open file object.
Return type:	file

atactk.data.read_features(filename, extension=100, feature_class=<class 'atactk.data.ExtendedFeature'>)[source]¶

Return a generator of ExtendedFeature instances from the named tab-separated value file.

Most BED-like files should work; we read the three required and first three optional BED fields to get coordinates, and any extra fields are ignored.

Parameters:	filename (str) – The (optionally gzipped) tab-separated value file from which to read features. Use ‘-‘ to read from standard input. extension (int) – The number of bases to score on either side of each feature. feature_class (class) – Each row of the file will be instantiated with this class.
Yields:	feature – An `ExtendedFeature` instance for each row of the file.

atactk.data.reverse_complement(seq)[source]¶

Return the reverse complement of the supplied nucleic sequence.

Parameters:	seq (str) – A nucleic sequence.
Returns:	The reverse complement of the given sequence.
Return type:	str

atactk.metrics module¶

Code for making quantitative observations about ATAC-seq experiments.

atactk.metrics.add_cut_points_to_region_tree(region_tree, group_key, strand, cut_points)[source]¶

Record cut points by position, template size, and strand.

The tree consists of nested dictionaries. The first level keys are positions in a region. The second level keys are the names of template size groups. The third level keys are the strand, and the values are the cut point counts for that position, template size group, and strand.

Parameters:	region_tree (dict) – The tree to which the cut point counts will be added. group_key (str) – The key corresponding to the template size group. strand (str) – The strand of the cut point’s aligned segment: `+` or `-`. cut_points (list) – The count of cut points at each position in the region that matched the template size group and strand.

Notes

Only positive counts are recorded. It takes a lot of time and space to record so many zeroes, and it’s better to produce them on demand via collections.defaultdict. So instead, collect all the scores, and after that work is done, update a tree built with collections.defaultdict, then work with that. See the make_cut_matrix script included with atactk for an example.

atactk.metrics.aggregate_scores(scores, extension, resolution)[source]¶

Adjust scores in the extended region around a feature.

Given a sequence containing the score at each base in a region, the size of the extended region around the feature, and the desired resolution in that extended region, reduce the extended scores.

Parameters:	scores (list) – A list containing a score for each base in a region around a feature. extension (int) – The number of bases at the beginning and end of the list considered the extended region. resolution (int) – The desired scoring resolution in the extended region.

atactk.util module¶

Utility code used in atactk.

atactk.util.add_lists(l1, l2)[source]¶

Adds the values of two lists, entrywise.

>>> add_lists([0, 1, 2], [3, 4, 5])
[3, 5, 7]

Parameters:	l1 (list) – The first list. l2 (list) – The second list.
Returns:	sum – The list of the entrywise sums of the two lists’ elements.
Return type:	list

atactk.util.partition(count, seq)[source]¶

Create a generator of lists of count elements from seq.

>>> list(partition(3, range(1, 10)))
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

If seq isn’t a multiple of count, the last list will contain the remaining items.

>>> list(partition(3, range(1, 9)))
[[1, 2, 3], [4, 5, 6], [7, 8]]

Parameters:	count (int) – The number of elements of seq to put in each partition. seq (iterator-or-iterable) – An iterator or iterable to be partitioned.
Yields:	list – A list representing a partition of count elements.

atactk.util.take(count, seq)[source]¶

Return a list of up to count elements from the iterable seq.

Parameters:	count (int) – The number of elements to take from seq. seq (iterator-or-iterable) – An iterator or iterable from which to take elements.
Returns:	A list of up to count elements. There may be fewer if seq has been exhausted.
Return type:	list

Resolution	Result
1	[0, 1, 1, 4, 2]
2	[1, 5, 2]
3	[2, 6]
4	[6, 2]
10	[8]

atactk package¶

Submodules¶

atactk.command module¶

atactk.data module¶

atactk.metrics module¶

atactk.util module¶

Module contents¶