AlumniResearch, careers and impact beyond graduation
On this page
OverviewThe structural problemThe dvobjectIntegration engineSoftware qualityReproducible workflowProject evolutionOpen-source record
DSTI TechBlog  /  Alumni
AlumniOpen-source research software

dbparser: from complex drug databases to reproducible R workflows

A student project became a maintained, peer-reviewed piece of research infrastructure. DSTI alumnus Mohammed Ali built dbparser to turn incompatible pharmacological databases into consistent R objects and reproducible integration workflows.

Rdbparserpharmacovigilancebioinformaticsopen-sourcereproducible-research

Large pharmacological databases are valuable because they preserve complex relationships between drugs, targets, pathways, products, adverse effects and interactions. They are difficult to analyse for exactly the same reason. DrugBank arrives as deeply nested XML; OnSIDES as relational CSV files; TWOSIDES as compressed interaction data. dbparser converts those different sources into consistent R objects and traceable integration workflows.

i

Data access and licensingdbparser parses databases that the researcher is authorised to access. It does not redistribute restricted DrugBank content. Reproducibility still requires recording the source database release, access conditions and the exact package version used.

The useful abstraction is not merely a flatter file. It is a stable object model that preserves relationships, release information and provenance while giving analysts one consistent way to work.

01The problem is structural, not cosmetic

A pharmacological database is not a spreadsheet with too many columns. Drug records connect to targets, enzymes, carriers, transporters, pathways, products, references and external identifiers. A parser that only flattens the file can make the result easier to load while silently destroying the relationships that give the data meaning.

The sources also disagree on formats and identifiers. DrugBank uses a large XML hierarchy. OnSIDES distributes related CSV tables derived from drug labels. TWOSIDES uses a compressed flat representation of adverse events associated with drug pairs. Ad-hoc scripts can bridge one analysis, but they usually hide assumptions about joins, versions and missing values.

DrugBankXML hierarchy

Mechanisms, drug records, targets, pathways and identifiers.

OnSIDESCSV tables

Adverse drug events extracted from FDA drug labels.

TWOSIDESCSV.GZ

Adverse events associated with pairs of drugs.

02A common object without erasing the source

dbparser introduces the dvobject—a drugverse object implemented as an R list with consistent access patterns. It retains tidy tables for analysis, metadata about the database release and parse process, and mappings that describe how tables relate to one another.

For a single DrugBank release, the object can expose drug information, salts, products, references and the connected carrier–enzyme–target–transporter structures. When sources are merged, the same object gains nested database components and integrated tables rather than becoming an undocumented collection of joins.

What a dvobject keeps together

analysis-ready object
drugscore drug tables
cettcarriers, enzymes, targets, transporters
productscommercial products
referencesarticles, links and books
metadatarelease and provenance

03From parser to integration engine

The current package uses DrugBank as the mechanistic hub. OnSIDES contributes adverse drug events extracted from FDA labels, while TWOSIDES contributes adverse events associated with drug combinations. The hub-and-spoke decision reduces the number of identifier mappings that must be maintained and makes the integration path explicit.

That design is a trade-off: multi-database workflows depend on DrugBank identifiers and mappings. But it is a visible, testable trade-off rather than an implicit assumption buried inside a one-off notebook.

OnSIDESAdverse drug events extracted from FDA drug labels.
DrugBankDrugBank as the mechanistic hub
TWOSIDESAdverse events associated with pairs of drugs.

04The software engineering around the parser

A useful research package needs more than working parsing functions. It needs a stable public interface, tests, documentation, metadata, examples, versioned releases and a process for reviewing changes. dbparser is distributed through CRAN, documented through rOpenSci, released under the MIT licence and maintained in a public repository.

The package was peer-reviewed through rOpenSci and its software paper was published in the Journal of Open Source Software in February 2026. That review record matters because it makes quality claims inspectable: users can see the repository, review thread, archived release, documentation and issue tracker.

Versioned releases

CRAN archives make package evolution and the exact release used in an analysis visible.

Peer review

rOpenSci review and the JOSS record expose documentation, testing and software-design decisions.

Reproducible inputs

Metadata keeps source versions and parse details alongside the analysis-ready object.

05A compact, reproducible workflow

The code below shows the architectural idea without hiding it behind a graphical interface. Each source is parsed independently. The resulting objects are then merged through explicit, chainable operations. The code is deliberately unchanged from the package documentation.

RIntegration pipeline
library(dbparser)
library(dplyr)

drugbank_db <- parseDrugBank("data/drugbank.xml")
onsides_db  <- parseOnSIDES("data/onsides/")
twosides_db <- parseTWOSIDES("data/TWOSIDES.csv.gz")

final_db <- drugbank_db %>%
  merge_drugbank_onsides(onsides_db) %>%
  merge_drugbank_twosides(twosides_db)

head(final_db$integrated_data$drug_drug_interactions)

06From student project to research infrastructure

CRAN records the first dbparser release in December 2018. The public archive then shows a sequence of maintained releases rather than a one-off upload. By version 2.2.1, published in January 2026, the package had moved beyond DrugBank-only parsing to support integrated pharmacovigilance workflows across three sources.

The project documentation identifies use in more than ten peer-reviewed publications spanning drug repurposing, biomarkers, pathway modelling and clinical-trial analysis. The stronger story is therefore not that a student wrote a parser. It is that the work survived contact with other researchers, changing source databases, package review and long-term maintenance.

2018dbparser 1.0.0 enters CRAN.
2023–24The 2.x series modernises the package and its data model.
2026Version 2.2.1 supports DrugBank, OnSIDES and TWOSIDES integration.
2026The software paper is published in JOSS after open review.

07The alumnus behind the open-source project

Mohammed Ali is a DSTI alumnus, author and maintainer of dbparser. Ali Ezzat is co-author of the package and the JOSS paper. Their public record lets readers inspect the software at several levels: the stable CRAN release, the full reference manual, the rOpenSci documentation, the source repository, the software-review discussion and the archived JOSS publication.

That openness is part of the engineering result. A reproducible research tool should make it possible to trace not only the output of an analysis, but also the software version, the source data release and the decisions that transformed one into the other.

Software, documentation and publication

Mohammed Ali

DSTI alumnus, author and maintainer of dbparser. His work brings R software engineering, pharmacological data integration and reproducible research infrastructure together.

LinkedIn

Source and editorial note: Article developed from DSTI’s former student-project record, a supplied manuscript and the current CRAN, rOpenSci, repository and JOSS sources. Technical names, package functions, code and publication titles are preserved exactly.