Large pharmacological databases are valuable because they preserve complex relationships between drugs, targets, pathways, products, adverse effects and interactions. They are difficult to analyse for exactly the same reason. DrugBank arrives as deeply nested XML; OnSIDES as relational CSV files; TWOSIDES as compressed interaction data. dbparser converts those different sources into consistent R objects and traceable integration workflows.
Data access and licensingdbparser parses databases that the researcher is authorised to access. It does not redistribute restricted DrugBank content. Reproducibility still requires recording the source database release, access conditions and the exact package version used.
01The problem is structural, not cosmetic
A pharmacological database is not a spreadsheet with too many columns. Drug records connect to targets, enzymes, carriers, transporters, pathways, products, references and external identifiers. A parser that only flattens the file can make the result easier to load while silently destroying the relationships that give the data meaning.
The sources also disagree on formats and identifiers. DrugBank uses a large XML hierarchy. OnSIDES distributes related CSV tables derived from drug labels. TWOSIDES uses a compressed flat representation of adverse events associated with drug pairs. Ad-hoc scripts can bridge one analysis, but they usually hide assumptions about joins, versions and missing values.
XML hierarchyMechanisms, drug records, targets, pathways and identifiers.
CSV tablesAdverse drug events extracted from FDA drug labels.
CSV.GZAdverse events associated with pairs of drugs.
02A common object without erasing the source
dbparser introduces the dvobject—a drugverse object implemented as an R list with consistent access patterns. It retains tidy tables for analysis, metadata about the database release and parse process, and mappings that describe how tables relate to one another.
For a single DrugBank release, the object can expose drug information, salts, products, references and the connected carrier–enzyme–target–transporter structures. When sources are merged, the same object gains nested database components and integrated tables rather than becoming an undocumented collection of joins.
What a dvobject keeps together
analysis-ready object03From parser to integration engine
The current package uses DrugBank as the mechanistic hub. OnSIDES contributes adverse drug events extracted from FDA labels, while TWOSIDES contributes adverse events associated with drug combinations. The hub-and-spoke decision reduces the number of identifier mappings that must be maintained and makes the integration path explicit.
That design is a trade-off: multi-database workflows depend on DrugBank identifiers and mappings. But it is a visible, testable trade-off rather than an implicit assumption buried inside a one-off notebook.
04The software engineering around the parser
A useful research package needs more than working parsing functions. It needs a stable public interface, tests, documentation, metadata, examples, versioned releases and a process for reviewing changes. dbparser is distributed through CRAN, documented through rOpenSci, released under the MIT licence and maintained in a public repository.
The package was peer-reviewed through rOpenSci and its software paper was published in the Journal of Open Source Software in February 2026. That review record matters because it makes quality claims inspectable: users can see the repository, review thread, archived release, documentation and issue tracker.
CRAN archives make package evolution and the exact release used in an analysis visible.
rOpenSci review and the JOSS record expose documentation, testing and software-design decisions.
Metadata keeps source versions and parse details alongside the analysis-ready object.
05A compact, reproducible workflow
The code below shows the architectural idea without hiding it behind a graphical interface. Each source is parsed independently. The resulting objects are then merged through explicit, chainable operations. The code is deliberately unchanged from the package documentation.
library(dbparser)
library(dplyr)
drugbank_db <- parseDrugBank("data/drugbank.xml")
onsides_db <- parseOnSIDES("data/onsides/")
twosides_db <- parseTWOSIDES("data/TWOSIDES.csv.gz")
final_db <- drugbank_db %>%
merge_drugbank_onsides(onsides_db) %>%
merge_drugbank_twosides(twosides_db)
head(final_db$integrated_data$drug_drug_interactions)06From student project to research infrastructure
CRAN records the first dbparser release in December 2018. The public archive then shows a sequence of maintained releases rather than a one-off upload. By version 2.2.1, published in January 2026, the package had moved beyond DrugBank-only parsing to support integrated pharmacovigilance workflows across three sources.
The project documentation identifies use in more than ten peer-reviewed publications spanning drug repurposing, biomarkers, pathway modelling and clinical-trial analysis. The stronger story is therefore not that a student wrote a parser. It is that the work survived contact with other researchers, changing source databases, package review and long-term maintenance.
07The alumnus behind the open-source project
Mohammed Ali is a DSTI alumnus, author and maintainer of dbparser. Ali Ezzat is co-author of the package and the JOSS paper. Their public record lets readers inspect the software at several levels: the stable CRAN release, the full reference manual, the rOpenSci documentation, the source repository, the software-review discussion and the archived JOSS publication.
That openness is part of the engineering result. A reproducible research tool should make it possible to trace not only the output of an analysis, but also the software version, the source data release and the decisions that transformed one into the other.
Software, documentation and publication
Mohammed Ali
DSTI alumnus, author and maintainer of dbparser. His work brings R software engineering, pharmacological data integration and reproducible research infrastructure together.
Source and editorial note: Article developed from DSTI’s former student-project record, a supplied manuscript and the current CRAN, rOpenSci, repository and JOSS sources. Technical names, package functions, code and publication titles are preserved exactly.