In [1]:
import khmer

Exploring graphs and graph structure

Graphs can be a tremendously powerful way to look at sequencing data, but unlike k-mers, operating on graphs can be very computationally expensive. This is largely because of combinatorics: for example, forks and loops in the graph make simple traversal expensive. Some of khmer’s functionality is aimed at simplifying, or at least making possible, some basic traversal approaches.

Some basic traversal functions.

Let’s start with some initial code that builds us a linear path in a graph:

In [2]:
>>> K = 21
>>> ng = khmer.Countgraph(K, 1e6, 4)
>>> seq = 'CCCTGTTAGCTACGTCCGTCTAAGGATATTAACATAGTTGCGACTGCGTCCTGTGCTCA'
>>> ng.consume(seq)
Out[2]:
39

In general, we use reads or sequences as an index into the De Bruijn graph, and most of the graph functions take either a k-mer or a sequence as input.

For example, if you wanted to get the degree (number of neighbors) for each k-mer in that sequence, you could ask:

In [3]:
>>> for i in range(len(seq) - K + 1):
...    kmer = seq[i:i + K]
...    print(ng.kmer_degree(kmer), end=' ')
1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1

This tells you that the sequence is linear and disconnected - no node has more than two neighbors, and the end nodes have one neighbor.

Question:

How would you introduce a higher-degree node into this graph?

You can also ask for the identity of the neighbors for each k-mer with the function ‘ng.neighbors(kmer)’.

We don’t actually do too much exploration of graphs at the Python level, because, well, for large data sets it’s slow.

Instead, we provide a few slightly higher level functions, such as calc_connected_graph_size, which tells you how many k-mers connect to the seed kmer in the graph:

In [4]:
>>> ng.calc_connected_graph_size(seq[:K])
Out[4]:
39

Keep in mind that for really large graphs, this may take a really, really long time to run.

Question:

How would you increase the size of the connected graph?

Building waypoints in De Bruijn graphs

Sometimes we have to deal with extremely large graphs.

khmer provides a technique called ‘tagging’, in which a subset of k-mers in the graph are chosen as representatives of the whole graph - and we choose these representatives so that every k-mer in the entire graph is within distance D of at least one representative.

(khmer does this by using the reads as a guide into the graph;khmer makes sure that each read has a tag every D k-mers, and, since the graph is entirely constructed from reads, then the graph ends up being tagged at distance D as well.)

Tagging has a few nice features:

  • there are many fewer tags than there are k-mers (2D fold fewer, approximately.)
  • the structure and connectivity of the tags serves as a good proxy for the connectivity of the overall graph; in this way they serve as a sort of sparse graph.
  • tags are somewhat refractory to coverage: highly covered regions will get many fewer tags than the number of reads from that region.
  • tags provide a good entry point into the graph for traversal; in essense, they provide waypoints for other work.
  • tags could, in theory, be used as an index into the reads. (@CTB expand.)

Graph theory friends tell me that this is actually a D-dominating set, and we’ll revisit this concept later on (c.f. spacegraphcats).

Right now we don’t do a very efficient job of creating the tags. I have a few ideas for improving this using streaming approaches.

Our tag data structure implementation is pretty bad - we use an STL set to store the tags...

To make use of tags, the tags need to be built first. Here’s an example.

First, construct a Nodegraph:

In [5]:
>>> K = 21
>>> ng = khmer.Nodegraph(K, 1e6, 4)

Create a sequence longer than K:

In [6]:
>>> seq = 'CCCTGTTAGCTACGTCCGTCTAAGGATATTAACATAGTTGCGACTGCGTCCTGTGCTCA'

Now, instead of adding it to ng with consume, use consume_and_tag:

In [7]:
>>> ng.consume_and_tag(seq)
Out[7]:
39

You can now look at the tags with get_tags_and_positions:

In [8]:
>>> ng.get_tags_and_positions(seq)
Out[8]:
[(19, 156582046028), (39, 1984875433400)]

Here, the first element of each tuple is the position of the tag within the sequence, and the second element is the actual tag – you can convert those back into DNA with reverse_hash:

In [9]:
>>> x = ng.get_tags_and_positions(seq)
>>> ng.reverse_hash(x[0][1])
Out[9]:
'AACTATGTTAATATCCTTAGA'

Points to make:

  • you can use any query sequence you want, not just the one you used to create the tags ;).
  • the tagging distance D is 40 by default, and can be adjusted with _set_tag_density.

Partitioning

The motivation behind tags came from a desire to partition data sets: back when, our intution was that metagenome data sets should split up into disconnected components based on their origin species. (You can read more in Pell et al., 2012.)

This is challenging because you essentially want an all-by-all connectivity map for the graph - you want to know, does read A connect (transitively) to read Z, by any path, and you want to know this for all the reads in the data set. This is hard!

We initially developed tagging for the purpose of partitioning, and we do the following:

  • tag the graph systematically (as explained above);
  • explore to a distance D from each tag, and identify all neighboring tags within distance D;
  • assign partition IDs to each set of connected tags;
  • split the reads up based on their partition IDs.

This does work, mostly, and it turns out to split the graph up into bins that group reads by species. (See Howe et al., 2014.)

For an example of partitioning, let’s construct an artifical sequence.

In [10]:
>>> longseq = ("AGGAGGAGCATTTGTGCGATGCCCTATGGGGAGACCTATCTGCCGGGGAAATGCGCACA",
...    "TAACATAATCTAATCTACCACATTATGAACCCCCAGTGGGCACGTGTTCATTGCGTACGATCGCATTC",
...    "TACTTGATTCCCGCAGTGGTACGACGCTATGTA")
>>> longseq = "".join(longseq)

If we break this 160-bp sequence up into three bits,

In [11]:
>>> begin = longseq[:70]
>>> middle = longseq[40:120]
>>> end = longseq[90:]

we can see that begin and end shouldn’t connect without middle. Let’s see if partitioning agrees!

First, let’s build a graph and then add the beginning and end bits:

In [12]:
>>> K = 21
>>> ng = khmer.Nodegraph(K, 1e6, 4)
>>> ng.consume_and_tag(begin)
>>> ng.consume_and_tag(end)
Out[12]:
50

Now, run the partitioning code, get back a partition object, and ask for a partition count:

In [13]:
>>> p = ng.do_subset_partition()
>>> n_partitions, _ = p.count_partitions()
>>> print(n_partitions)
2

Yep - 2 partitions!

Now let’s add the middle bit, which will connect the two isolated partitions:

In [14]:
>>> ng.consume_and_tag(middle)
Out[14]:
40

If we redo the partitioning, we should now get 1 partition:

In [15]:
>>> p = ng.do_subset_partition()
>>> n_partitions, _ = p.count_partitions()
>>> print(n_partitions)
1

...and we do!

Partitioning hasn’t, in the end, proven to be that useful in practice (also see blog post). The downsides of the approach are:

  • it turns out that any source of systematic bias in sequencing will result in a completely connected graph (Howe et al., 2012; such biases are common.
  • it also turns out that most metagenome graphs are, in reality, completely connected.

However, tags are still potentially useful for other things than straight up partitioning.

Camille Scott extended tagging with labels - the idea is that each tag can be labelled with 1 or more arbitrary identifiers. This lets you do things like build colored De Bruijn graphs, equiv. do pangenome analysis.

References:

http://ivory.idyll.org/blog/2015-wok-labelhash.html

https://github.com/dib-lab/2015-khmer-wok4-multimap/blob/master/do-counting.py

For an example, let’s take the sequence we used in partitioning, and add labels – here, ‘1’ and ‘2’.

In [16]:
>>> K = 21
>>> ng = khmer.Nodegraph(K, 1e6, 4)
>>> lh = khmer._GraphLabels(ng)
>>> lh.consume_sequence_and_tag_with_labels(begin, 1)
>>> lh.consume_sequence_and_tag_with_labels(end, 2)
Out[16]:
50

We can now ask for labels by overlap with longseq:

In [17]:
>>> tags = ng.get_tags_and_positions(longseq)
>>> for (_, tag) in tags:
...    print(lh.get_tag_labels(tag))
[1]
[1]
[2]
[2]

This lets us add arbitrary information to the graph by indexing using the tags or the labels.

One particular use case for this is comparing several different data sets by looking at how the graphs overlap, or do not. This is explored a bit in a blog post.

We’ll show another prototype use case belong, for labeling longer paths for use in assembly.

De Bruijn graphs themselves are often quite large, because they have as many nodes in them as there are unique k-mers in the data set. Many of these k-mers may be collapsable into linear paths. The result of doing this graph contraction is called a “compact De Bruijn graph”, and we have some functionality for building them in khmer.

This functionality consists of two functions on graphs: find_high_degree_nodes and traverse_and_mark_linear_paths.

There’s a script sandbox/extract-compact-dbg.py that you can use with genomes or reads. For this, you’ll need a copy of the script from khmer’s sandbox directory –

In [18]:
>>> !curl -L -O https://github.com/dib-lab/khmer/raw/zaok/sandbox/extract-compact-dbg.py
>>> !curl -L -O https://github.com/dib-lab/khmer/raw/zaok/sandbox/graph_writer.py

>>> !chmod +x extract-compact-dbg.py
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   149  100   149    0     0    155      0 --:--:-- --:--:-- --:--:--   155
100  5697  100  5697    0     0   3358      0  0:00:01  0:00:01 --:--:-- 11918
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   142  100   142    0     0    194      0 --:--:-- --:--:-- --:--:--   194
100  2546  100  2546    0     0   2663      0 --:--:-- --:--:-- --:--:--  2663

This script produces GML files that can be viewed with Gephi.

In [19]:
>>> !./extract-compact-dbg.py -x 1e6 data/genome.fa.gz -o genome.gml
building graphs and loading files
finding high degree nodes
traversing linear segments from 4 nodes
... 0 of 4
13 segments, containing 10094 nodes
saving to genome.gml
In [20]:
>>> !./extract-compact-dbg.py -x 1e8 data/reads.fa.gz -o reads.gml
building graphs and loading files
... data/reads.fa.gz 10000
... data/reads.fa.gz 20000
finding high degree nodes
...2 data/reads.fa.gz 10000
...2 data/reads.fa.gz 20000
traversing linear segments from 16554 nodes
... 0 of 16554
... 10000 of 16554
29382 segments, containing 737769 nodes
saving to reads.gml
In [21]:
>>> !./extract-compact-dbg.py -x 1e6 data/mixed-species.fa.gz -o mixed.gml
building graphs and loading files
finding high degree nodes
traversing linear segments from 3 nodes
... 0 of 3
11 segments, containing 10033 nodes
saving to mixed.gml

Visualizing compact De Bruijn graphs with Gephi

(Gephi demo here)

Assembling linear sequences

For a long time, we avoided anything that smacked of actual assembly. But, well, inspired by the compact DBG stuff, we recently added some assembly functionality – you’ve already seen one of the functions:

In [22]:
>>> input_seq = "TGAGCACAGGACGCAGTCGCAACTATGTTAATATCCTTAGACGGACGTAGCTAACAGGG"
>>> ng = khmer.Nodegraph(21, 1e6, 4)
>>> ng.consume(input_seq)

>>> ng.assemble_linear_path('AGACGGACGTAGCTAACAGGG')
Out[22]:
'TGAGCACAGGACGCAGTCGCAACTATGTTAATATCCTTAGACGGACGTAGCTAACAGGG'

but this only works on linear segments of graphs – for example, if we add a neighbor in the middle, it confounds the graph and you get a short assembly:

In [23]:
>>> ng.consume('G' + input_seq[30:30+K - 1])
>>> ng.assemble_linear_path('AGACGGACGTAGCTAACAGGG')
Out[23]:
'ATATCCTTAGACGGACGTAGCTAACAGGG'

There are several ways to handle this, but they all require simplifying the graph (by trimming errors, or finding short branches, or what have you). However you decide to simplify the graph, you can delete nodes by creating a “stopgraph” and passing that in - this tells the assembly function to ignore that node completely.

In [24]:
>>> stopgraph = khmer.Nodegraph(21, 1e6, 4)
>>> stopgraph.consume('G' + input_seq[30:30+K - 1])
>>> ng.assemble_linear_path('AGACGGACGTAGCTAACAGGG', stopgraph)
Out[24]:
'TGAGCACAGGACGCAGTCGCAACTATGTTAATATCCTTAGACGGACGTAGCTAACAGGG'

Assembling across high-degree nodes with labeled paths

You can also use labels to identify situations where you have evidence for crossing a branch point. For example,

In [25]:
>>> input_seq = "TGAGCACAGGACGCAGTCGCAACTATGTTAATATCCTTAGACGGACGTAGCTAACAGGG"
>>> ng = khmer.Nodegraph(21, 1e6, 4)
>>> ng.consume(input_seq)
>>> ng.consume('G' + input_seq[30:30+K - 1])

Out[25]:
1

First, find high degree nodes:

In [26]:
>>> hdn = ng.find_high_degree_nodes(input_seq)

Then, use labels to build paths across high degree nodes:

In [27]:
>>> lh = khmer._GraphLabels(ng)
>>> lh.label_across_high_degree_nodes(input_seq, hdn, 1)

And then assemble using the labels:

In [28]:
>>> lh.assemble_labeled_path('AGACGGACGTAGCTAACAGGG')
Out[28]:
['TGAGCACAGGACGCAGTCGCAACTATGTTAATATCCTTAGACGGACGTAGCTAACAGGG']

Next: Summarizing the basics