Publications

Below you can find my publications (including preprints), with corresponding links in the grey boxes. Toggle the abstract (resp. bibtex) button to see the abstract (resp. bibtex information).
At the bottom of the page my MSc and BSc theses are listed.

Clicke here for a list of all co-authors.

You can also browse my Google Scholar profile.

2026

[11]

      
        preprint

Limits of Kernelization and Parametrization for Phylogenetic Diversity with Dependencies
Niels Holtgrefe, Jannik Schestag, Norbert Zeh

arXiv:2602.12959, 2026. arXiv

In the Maximize Phylogenetic Diversity problem, we are given a phylogenetic tree that represents the genetic proximity of species, and we are asked to select a subset of species of maximum phylogenetic diversity to be preserved through conservation efforts, subject to budgetary constraints that allow only \( k \) species to be saved. This neglects that it is futile to preserve a predatory species if we do not also preserve at least a subset of the prey it feeds on. Thus, in the Optimizing PD with Dependencies (\( \varepsilon \)-PDD) problem, we are additionally given a food web that represents the predator–prey relationships between species. The goal is to save a set of \( k \) species of maximum phylogenetic diversity such that for every saved species, at least one of its prey is also saved. This problem is NP-hard even when the phylogenetic tree is a star.
The \( \alpha \)-PDD problem alters \( \varepsilon \)-PDD by requiring that at least some fraction \( \alpha \) of the prey of every saved species are also saved. In this paper, we study the parameterized complexity of \( \alpha \)-PDD. We prove that the problem is W[1]-hard and in XP when parameterized by the solution size \( k \), the diversity threshold \( D \), or their complements. When parameterized by the vertex cover number of the food web, \( \alpha \)-PDD is fixed-parameter tractable (FPT). A key measure of the computational difficulty of a problem that is FPT is the size of the smallest kernel that can be obtained. We prove that, when parameterized by the distance to clique, 1-PDD admits a linear kernel. Our main contribution is to prove that \( \alpha \)-PDD does not admit a polynomial kernel when parameterized by the vertex cover number plus the diversity threshold \( D \), even if the phylogenetic tree is a star. This implies the non-existence of a polynomial kernel for \( \alpha \)-PDD also when parameterized by a range of structural parameters of the food web, such as its distance to cluster, treewidth, pathwidth, feedback vertex set, and others.

@misc{holtgrefe2026limits,
  author = {Holtgrefe, Niels and Schestag, Jannik and Zeh, Norbert},
  title = {Limits of Kernelization and Parametrization for Phylogenetic Diversity with Dependencies},
  year={2026},
  archivePrefix={arXiv},
  eprint={2602.12959}
}

[10]

      
        journal article

Characterizing semi-directed phylogenetic networks and their multi-rootable variants
Niels Holtgrefe, Katharina T. Huber, Leo van Iersel, Mark Jones, Vincent Moulton

Theory in Biosciences, 145(1):4, 2026. journal arXiv

In evolutionary biology, phylogenetic networks are graphs that provide a flexible framework for representing complex evolutionary histories that involve reticulate evolutionary events. Recently phylogenetic studies have started to focus on a special class of such networks called semi-directed networks. These graphs are defined as mixed graphs that can be obtained by de-orienting some of the arcs in some rooted phylogenetic network, that is, a directed acyclic graph whose leaves correspond to a collection of species and that has a single source or root vertex. However, this definition of semi-directed networks is implicit in nature since it is not clear when a mixed-graph enjoys this property or not. In this paper, we introduce novel, explicit mathematical characterizations of semi-directed networks, and also multi-semi-directed networks, that is, mixed graphs that can be obtained from directed phylogenetic networks that may have more than one root. In addition, through extending foundational tools from the theory of rooted networks into the semi-directed setting---such as cherry picking sequences, omnians, and path partitions---we characterize when a (multi-)semi-directed network can be obtained by de-orienting some rooted network that is contained in one of the well-known classes of tree-child, orchard, tree-based or forest-based networks. These results address structural aspects of (multi-)semi-directed networks and pave the way to improved theoretical and computational analyses of such networks, for example, within the development of algebraic evolutionary models that are based on such networks.

@article{holtgrefe2025charachterizing,
  author = {Holtgrefe, Niels and Huber, Katharina T and van Iersel, Leo and Jones, Mark and Moulton, Vincent},
  title = {Characterizing semi-directed phylogenetic networks and their multi-rootable variants},
  journal={Theory in Biosciences},
  doi={10.1007/s12064-025-00453-8},
  volume = {145},
  pages = {4},
  number = {1},
  year = {2026}
}

2025

[9]

      
        preprint

Bounds on the sequence length sufficient to reconstruct level-1 phylogenetic networks
Martin Frohn, Niels Holtgrefe, Leo van Iersel, Mark Jones, Steven Kelk

arXiv:2511.22736, 2025. arXiv

Phylogenetic trees and networks are graphs used to model evolutionary relationships, with trees representing strictly branching histories and networks allowing for events in which lineages merge, called reticulation events. While the question of data sufficiency has been studied extensively in the context of trees, it remains largely unexplored for networks. In this work we take a first step in this direction by establishing bounds on the amount of genomic data required to reconstruct binary level-1 semi-directed phylogenetic networks, which are binary networks in which reticulation events are indicated by directed edges, all other edges are undirected, and cycles are vertex-disjoint. For this class, methods have been developed recently that are statistically consistent. Roughly speaking, such methods are guaranteed to reconstruct the correct network assuming infinitely long genomic sequences. Here we consider the question whether networks from this class can be uniquely and correctly reconstructed from finite sequences. Specifically, we present an inference algorithm that takes as input genetic sequence data, and demonstrate that the sequence length sufficient to reconstruct the correct network with high probability, under the Cavender-Farris-Neyman model of evolution, scales logarithmically, polynomially, or polylogarithmically with the number of taxa, depending on the parameter regime. As part of our contribution, we also present novel inference rules for quartet data in the semi-directed phylogenetic network setting.

@misc{frohn2025bounds,
  author = {Frohn, Martin and Holtgrefe, Niels and van Iersel, Leo and Jones, Mark and Kelk, Steven},
  title = {Bounds on the sequence length sufficient to reconstruct level-1 phylogenetic networks},
  year={2025},
  archivePrefix={arXiv},
  eprint={2511.22736}
}

[8]

      
        preprint

PaNDA: Efficient Optimization of Phylogenetic Diversity in Networks
Niels Holtgrefe, Leo van Iersel, Ruben Meuwese, Yukihiro Murakami, Jannik Schestag

bioRxiv:10.1101/2025.11.14.688467, 2025. bioRxiv software

Phylogenetic diversity plays an important role in biodiversity, conservation, and evolutionary studies by measuring the diversity of a set of taxa based on their phylogenetic relationships. In phylogenetic trees, a subset of \( k \) taxa with maximum phylogenetic diversity can be found by a simple and efficient greedy algorithm. However, this algorithmic tractability is lost when considering phylogenetic networks, which incorporate reticulate evolutionary events such as hybridization and horizontal gene transfer. To address this challenge, we introduce PaNDA (Phylogenetic Network Diversity Algorithms), the first software package and interactive graphical user-interface for exploring, visualizing and maximizing diversity in phylogenetic networks. PaNDA includes a novel algorithm to find a subset of \( k \) taxa with maximum diversity, running in polynomial time for networks of bounded scanwidth, a measure of tree-likeness of a network that grows slower than the well-known level measure. This algorithm considers the variant of phylogenetic diversity on networks in which the branch lengths of all paths from the root to the selected taxa contribute towards their diversity. We demonstrate the scalability of this algorithm on simulated networks, successfully analyzing level-15 networks with up to 200 taxa in seconds. We also provide a proof-of-concept analysis using a phylogenetic network on Xiphophorus species, illustrating how the tool can support diversity studies based on real genomic data. The software is easily installable and freely available at https://github.com/nholtgrefe/panda. Additionally, we extend the definition of phylogene- tic diversity to semi-directed phylogenetic networks, which are mixed graphs increasingly used in phylogenetic analysis to model uncertainty of the root location. We prove that finding a subset of k taxa with maximum diversity remains NP-hard on semi-directed networks, but do present a polynomial-time algorithm for networks with bounded level.

@misc{holtgrefe2025panda,
  author = {Holtgrefe, Niels and van Iersel, Leo and Meuwese, Ruben and Murakami, Yukihiro and Schestag, Jannik},
  title = {PaNDA: Efficient Optimization of Phylogenetic Diversity in Networks},
  year={2025},
  archivePrefix={bioRxiv},
  eprint={10.1101/2025.11.14.688467}
}

[7]

      
        journal article

Distinguishing Phylogenetic Level-2 Networks with Quartets and Inter-Taxon Quartet Distances
Niels Holtgrefe, Elizabeth S. Allman, Hector Baños, Leo van Iersel, Vincent Moulton, John A. Rhodes, Kristina Wicke

Bulletin of Mathematical Biology, 87(12):168, 2025. journal arXiv

The inference of phylogenetic networks, which model complex evolutionary processes including hybridization and gene flow, remains a central challenge in evolutionary biology. Until now, statistically consistent inference methods have been limited to phylogenetic level-1 networks, which allow no interdependence between reticulate events. In this work, we establish the theoretical foundations for a statistically consistent inference method for a much broader class: semi-directed level-2 networks that are outer-labeled planar and galled. We precisely characterize the features of these networks that are distinguishable from the topologies of their displayed quartet trees. Moreover, we prove that an inter-taxon distance derived from these quartets is circular decomposable, enabling future robust inference of these networks from quartet data, such as concordance factors obtained from gene tree distributions under the Network Multispecies Coalescent model. Our results also have novel identifiability implications across different data types and evolutionary models, applying to any setting in which displayed quartets can be distinguished.

@article{holtgrefe2025distinguishing,
  author = {Holtgrefe, Niels and Allman, Elizabeth S and Baños, Hector and van Iersel, Leo and Moulton, Vincent and Rhodes, John A and Wicke, Kristina},
  title = {Distinguishing Phylogenetic Level-2 Networks with Quartets and Inter-Taxon Quartet Distances},
  journal={Bulletin of Mathematical Biology},
  doi={10.1007/s11538-025-01549-4},
  volume = {87},
  pages = {168},
  number = {12},
  year = {2025}
}

[6]

      
        journal article

Algebraic Invariants for Inferring 4-leaf Semi-Directed Phylogenetic Networks
Samuel Martin, Niels Holtgrefe, Vincent Moulton, Richard M. Legget

Systematic Biology:syaf071, 2025. journal bioRxiv software

A core goal of phylogenomics is to determine the evolutionary history of a set of species from biological sequence data. Phylogenetic networks are able to describe more complex evolutionary phenomena than phylogenetic trees but are more difficult to accurately reconstruct. Recently, there has been growing interest in developing methods to infer semi-directed phylogenetic networks. As computing such networks can be computationally intensive, one approach to building such networks is to puzzle together smaller networks. Thus, it is essential to have robust methods for inferring semi-directed phylogenetic networks on small numbers of taxa. In this paper, we investigate an algebraic method for performing phylogenetic network inference from nucleotide sequence data on 4-leaf semi-directed phylogenetic networks by analysing the distribution of leaf-pattern probabilities. On simulated data, we found that we can correctly identify with high accuracy the undirected phylogenetic network for sequences of length at least 10kbp. We found that identifying the semi-directed network is more challenging and requires sequences of length approaching 10Mbp. We are also able to use our approach to identify tree-like evolution and determine the underlying tree. Finally, we employ our method on a real dataset from Xiphophorus species and use the results to build a phylogenetic network.

@article{martin2025algebraic,
  title={Algebraic Invariants for Inferring 4-leaf Semi-Directed Phylogenetic Networks},
  author={Martin, Samuel and Holtgrefe, Niels and Moulton, Vincent and Legget, Richard M},
  journal={Systematic Biology},
  doi={10.1093/sysbio/syaf071},
  pages = {syaf071},
  year = {2025}
}

[5]

      
        preprint

Identifiability of Phylogenetic Level-2 Networks under the Jukes-Cantor Model
Aviva K. Englander, Martin Frohn, Elizabeth Gross, Niels Holtgrefe, Leo van Iersel, Mark Jones, Seth Sullivant

bioRxiv:10.1101/2025.04.18.649493, 2025. bioRxiv

We consider the fundamental question of which evolutionary histories can potentially be reconstructed from sufficiently long DNA sequences, by studying the identifiability of phylogenetic networks from data generated under Markov models of DNA evolution. This topic has previously been studied for phylogenetic trees and for phylogenetic networks that are level-1, which means that reticulate evolutionary events were restricted to be independent in the sense that the corresponding cycles in the network are non-overlapping.
In this paper, we study the identifiability of phylogenetic networks from DNA sequence data under Markov models of DNA evolution for more general classes of networks that may contain pairs of tangled reticulations. Our main result is generic identifiability, under the Jukes-Cantor model, of binary semi-directed level-2 phylogenetic networks that satisfy two additional conditions called triangle-free and strongly tree-child.
We also consider level-1 networks and show stronger identifiability results for this class than what was known previously. In particular, we show that the number of reticulations in a level-1 network is identifiable under the Jukes-Cantor model.
Moreover, we prove general identifiability results that do not restrict the network level at all and hold for the Jukes-Cantor as well as for the Kimura-2-Parameter model. We show that any two binary semi-directed phylogenetic networks are distinguishable if they do not display exactly the same 4-leaf subtrees, called quartets. This has direct consequences regarding the blobs of a network, which are its reticulated components. We show that the tree-of-blobs of a network, the global branching structure of the network, is always identifiable, as well as the circular ordering of the subnetworks around each blob, for networks in which edges do not cross and taxa are on the outside.

@misc{englander2025identifiability,
  author = {Englander, Aviva K and Frohn, Martin and Gross, Elizabeth and Holtgrefe, Niels and van Iersel, Leo and Jones, Mark and Sullivant, Seth},
  title = {Identifiability of Phylogenetic Level-2 Networks under the Jukes-Cantor Model},
  year={2025},
  archivePrefix={bioRxiv},
  eprint={10.1101/2025.04.18.649493}
}

[4]

      
        journal article

Reconstructing semi-directed level-1 networks using few quarnets
Martin Frohn, Niels Holtgrefe, Leo van Iersel, Mark Jones, Steven Kelk

Journal of Computer and System Sciences, 152:103655, 2025. journal arXiv

Semi-directed networks are partially directed graphs that model evolution where the directed edges represent reticulate evolutionary events. We present an algorithm that reconstructs binary \(n\)-leaf semi-directed level-1 networks in \( O( n^2) \) time from its quarnets (4-leaf subnetworks). Our method assumes we have direct access to all quarnets, yet uses only an asymptotically optimal number of \( O(n \log n) \) quarnets. When the network is assumed to contain no triangles, our method instead relies only on four-cycle quarnets and the splits of the other quarnets. A variant of our algorithm works with quartets rather than quarnets and we show that it reconstructs most of a semi-directed level-1 network from an asymptotically optimal \( O(n \log n) \) of the quartets it displays. Additionally, we provide an \( O(n^3) \) time algorithm that reconstructs the tree-of-blobs of any binary \( n\)-leaf semi-directed network with unbounded level from \( O(n^3) \) splits of its quarnets.

@article{frohn2025reconstructing,
  title={Reconstructing semi-directed level-1 networks using few quarnets},
  author={Frohn, Martin and Holtgrefe, Niels and van Iersel, Leo and Jones, Mark and Kelk, Steven},
  journal={Journal of Computer and System Sciences},
  doi={10.1016/j.jcss.2025.103655},
  volume = {152},
  pages = {103655},
  year = {2025}
}

[3]

      
        journal article

Squirrel: Reconstructing Semi-directed Phylogenetic Level-1 Networks from Four-Leaved Networks or Sequence Alignments
Niels Holtgrefe, Katharina T. Huber, Leo van Iersel, Mark Jones, Samuel Martin, Vincent Moulton

Molecular Biology and Evolution, 42(4):msaf067, 2025. journal bioRxiv software

With the increasing availability of genomic data, biologists aim to find more accurate descriptions of evolutionary histories influenced by secondary contact, where diverging lineages reconnect before diverging again. Such reticulate evolutionary events can be more accurately represented in phylogenetic networks than in phylogenetic trees. Since the root location of phylogenetic networks cannot be inferred from biological data under several evolutionary models, we consider semi-directed (phylogenetic) networks: partially directed graphs without a root in which the directed edges represent reticulate evolutionary events. By specifying a known outgroup, the rooted topology can be recovered from such networks. We introduce the algorithm Squirrel (Semi-directed Quarnet-based Inference to Reconstruct Level-1 Networks), which constructs a semi-directed level-1 network from a full set of quarnets (four-leaf semi-directed networks). Our method also includes a heuristic to construct such a quarnet set directly from sequence alignments. We demonstrate Squirrel's performance through simulations and on real sequence datasets, the largest of which contains 29 aligned sequences close to 1.7 Mbp long. The resulting networks are obtained on a standard laptop within a few minutes. Lastly, we prove that Squirrel is combinatorially consistent: given a full set of quarnets coming from a triangle-free semi-directed level-1 network, it is guaranteed to reconstruct the original network. Squirrel is implemented in Python, has an easy-to-use graphical user interface that takes sequence alignments or quarnets as input, and is freely available at https://github.com/nholtgrefe/squirrel.

@article{holtgrefe2025squirrel,
  title={Squirrel: Reconstructing semi-directed phylogenetic level-1 networks from four-leaved networks or sequence alignments},
  author={Holtgrefe, Niels and Huber, Katharina T and van Iersel, Leo and Jones, Mark and Martin, Samuel and Moulton, Vincent},
  journal={Molecular Biology and Evolution},
  doi={10.1093/molbev/msaf067},
  volume = {42},
  number = {4},
  pages = {msaf067},
  year = {2025}
}

2024

[2]

      
        preprint

Invariants for level-1 phylogenetic networks under the random walk 4-state Markov model
Martin Frohn, Niels Holtgrefe, Leo van Iersel, Mark Jones, Steven Kelk

arXiv:2407.1172, 2024. arXiv

Phylogenetic networks can represent evolutionary events that cannot be described by phylogenetic trees, such as hybridization, introgression, and lateral gene transfer. Studying phylogenetic networks under a statistical model of DNA sequence evolution can aid the inference of phylogenetic networks. Most notably Markov models like the Jukes-Cantor or Kimura-3 model can been employed to infer a phylogenetic network using phylogenetic invariants. In this article we determine all quadratic invariants for sunlet networks under the random walk 4-state Markov model, which includes the aforementioned models. Taking toric fiber products of trees and sunlet networks, we obtain a new class of invariants for level-1 phylogenetic networks under the same model. Furthermore, we apply our results to the identifiability problem of a network parameter. In particular, we prove that our new class of invariants of the studied model is not sufficient to derive identifiability of quarnets (4-leaf networks). Moreover, we provide an efficient method that is faster and more reliable than the state-of-the-art in finding a significant number of invariants for many level-1 phylogenetic networks.

@misc{frohn2024invariants,
  author = {Frohn, Martin and Holtgrefe, Niels and van Iersel, Leo and Jones, Mark and Kelk, Steven},
  title = {Invariants for level-1 phylogenetic networks under the random walk 4-state Markov model},
  year={2024},
  archivePrefix={arXiv},
  eprint={2407.11720}
}

[1]

      
        preprint

Exact and Heuristic Computation of the Scanwidth of Directed Acyclic Graphs
Niels Holtgrefe, Leo van Iersel, Mark Jones

arXiv:2403.12734, 2024. arXiv software

To measure the tree-likeness of a directed acyclic graph (DAG), a new width parameter that considers the directions of the arcs was recently introduced: scanwidth. We present the first algorithm that efficiently computes the exact scanwidth of general DAGs. For DAGs with one root and scanwidth \( k \), it runs in \( O(k \cdot n^k \cdot m) \) time. The algorithm also functions as an FPT algorithm with complexity \( O(2^{4\ell - 1} \cdot \ell \cdot n + n^2) \) for phylogenetic networks of level-\( \ell \), a type of DAG used to depict evolutionary relationships among species. Our algorithm performs well in practice, being able to compute the scanwidth of synthetic networks up to 30 reticulations and 100 leaves within 500 seconds. Furthermore, we propose a heuristic that obtains an average practical approximation ratio of 1.5 on these networks. While we prove that the scanwidth is bounded from below by the treewidth of the underlying undirected graph, experiments suggest that for networks the parameters are close in practice.

@misc{holtgrefe2024exact,
  author = {Holtgrefe, Niels and van Iersel, Leo and Jones, Mark},
  title = {Exact and Heuristic Computation of the Scanwidth of Directed Acyclic Graphs},
  year={2024},
  archivePrefix={arXiv},
  eprint={2403.12734}
}

Theses

•

      
        thesis

Computing the Scanwidth of Directed Acyclic Graphs
Niels Holtgrefe

MSc Thesis, Delft University of Technology, Delft, The Netherlands, 2023. repository software

Phylogenetic networks are a specific type of directed acyclic graph (DAG), used to depict evolutionary relationships among, for example, species or other groups of organisms. To solve computationally hard problems, treewidth has been used to parametrize algorithms in phylogenetics. In the hope of simplifying the algorithmic design process, Berry, Scornavacca and Weller [BSW20] recently proposed a new measure of tree-likeness that takes into account the directions of the arcs: scanwidth. They showed that the corresponding decision problem of this parameter - which can be seen as a variant of directed cutwidth, using a tree instead of a linear ordering - is NP-complete. This thesis aims to widen the structural knowledge of scanwidth and to find efficient ways of computing it on general DAGs, both by exact and heuristic algorithms.
With the help of reduction rules, we construct an explicit dynamic programming algorithm that computes scanwidth exactly, along with its corresponding tree extension, in \( O(k \cdot n^k \cdot m) \) time for rooted DAGs of scanwidth \( k \). This slicewise polynomial algorithm proves that computing the scanwidth is in the complexity class XP. The algorithm also functions as an FPT algorithm for networks of level-\(\ell\), with the complexity bounded by \(O(2^{4 \ell - 1} \cdot \ell \cdot n + n^2) \). It performs well in practice, being able to compute the scanwidth of networks up to 30 reticulations and 100 leaves within 500 seconds.
On the heuristic side, an algorithm that repeatedly splits at a specific type of smallest cut is proposed. Enhanced with simulated annealing, this heuristic shows promising results, obtaining an average approximation ratio of 1.5 for large synthetic networks of 30 reticulations and 100 leaves. Applied to a real-world dataset of networks, the heuristic performs near-optimal. Although we prove that the scanwidth is always greater than or equal to the treewidth, experiments show that they are close to each other in practice. This further motivates the use of scanwidth over treewidth as a parameter in algorithms.

@masterthesis{holtgrefe2023computing,
  title={Computing the Scanwidth of Directed Acyclic Graphs},
  author={Niels Holtgrefe},
  year={2023},
  address={Delft, The Netherlands},
  note={Available at \url{http://resolver.tudelft.nl/uuid:9c82fd2a-5841-4aac-8e40-d4d22542cdf5}},
  school={Delft University of Technology},
  type= {MSc thesis}
}

•

      
        thesis

Maximum Coverage Problems: Theory and Application in Optimal Sensor Selection
Niels Holtgrefe

BSc Thesis, Delft University of Technology, Delft, The Netherlands, 2021. repository

In this thesis we analyse the class of maximum coverage problems. For all discussed problems, linear programs are formulated. Using the notion of submodularity, we prove that for the weighted version of the basic Maximum Coverage problem, where the weights differ per set, a polynomial-time greedy algorithm guarantees a \((1 - 1/e)\)-approximation of the optimal solution. This improves the already known bound of \((1 - 1/e - \varepsilon)\), for all \(\varepsilon > 0) \). We then show that the same result holds true, if we allow elements to be covered by multiple sets. Furthermore, a completely novel extension is introduced, where weights differ per combination of sets. It is proved that, under the assumption that the weights are submodular and increasing, a greedy algorithm still provides a \((1 – 1/e)\)-approximation.
The latter algorithm is tested in the framework of optimal sensor selection. To this end, we consider all official weather stations in the Netherlands as our sensor group. We test the performance of the approximation algorithm, if some of the assumptions do not hold and no theoretical bounds exists. Corresponding weights are calculated, using two approaches: inverse distance weighting and multiple linear regression. For both approaches the in practice performance of the greedy algorithm is shown to be even higher than \((1 – 1/e)\), even though not all assumptions hold. Finally, the corresponding selection of weather stations is shown.

@bachelorthesis{holtgrefe2021maximum,
  title={Maximum Coverage Problems: Theory and Application in Optimal Sensor Selection},
  author={Niels Holtgrefe},
  year={2021},
  address={Delft, The Netherlands},
  note={Available at \url{https://resolver.tudelft.nl/uuid:9d222076-f3e6-4a36-9325-40de3dde0e9d}},
  school={Delft University of Technology},
  type= {BSc thesis}
}