Analyzing Debian packages with Neo4j
- Overview on the blog series
- The Ultimate Debian Database UDD collects a variety of data around Debian and Ubuntu: Packages and sources, bugs, history of uploads, just to name a few. (Ultimate Debian Database UDD)
The database scheme reveals a highly de-normalized RDB. In this on-going work we extract (some) data from UDD and represent it as a graph database.
In the following series of blog entries we will report on this work. Part 1 (this one) will give a short introduction to Debian and the life time and structure of Debian packages. Part 2 will develop the graph database scheme (nodes and relations) from the inherent properties of Debian packages. The final part 3 will describe how to get the data from the UDD into Neo4j, give some sample queries, and discuss further work.
This work has been presented at the Neo4j Online Meetup and a video recording of the presentation is available on YouTube.
- Part 1 – Debian
- Debian is an open source Linux distribution, developed mostly by volunteers. With a history of already more than 20 years, Debian is one of the oldest Linux distributions. It sets itself apart from many other Linux distributions by a strict set of license rules that guarantees that everything within Debian is free according to the Debian Free Software Guidelines.
Debian also gave rise to a large set of off-springs, most widely known one is Ubuntu.
Debian contains not only the underlying operating system (Linux) and the necessary tools, but also a huge set of programs and applications, currently about 50000 software packages. All of these packages come with full source code but are already pre-compiled for easy consumption.
To understand what information we have transferred into Neo4j we need to take a look at how Debian is structured, and how a packages lives within this environment.
Debian employs release based software management, that is, a new Debian version is released in more or less regular intervals. The current stable release is Debian stretch (Debian 9.2) and was released first in June 2017, with the latest point release on October 7th, 2017.
To prepare packages for the next stable release, they have to go through a set of suites to make sure they conform to quality assurance criteria. These suites are:
・Development (sid): the entrance point for all packages, where the main development takes place;
・Testing: packages that are ready to be released as the next stable release;
・Stable: the status of the current stable release.
There are a few other suites like experimental or targeting security updates, but we leave their discussion out here.
Package and suite transitions
Packages have a certain life cycle within Debian. Consider the following image:
Packages and Suites (Youhei Sasaki, CC-NC-SA)
Packages normally are uploaded into the unstable suite and remain there at least for 5 days. If no release critical bug has been reported, after these 5 days the package transitions automatically from unstable into the testing suite, which will be released as stable by the release managers at some point in the future.
Structure of Debian packages
Debian packages come as source packages and binary packages. Binary packages are available in a variety of architectures: amd64, i386, powerpc just to name a few.
Debian developers upload source packages (and often own’s own architecture’s binary package), and for other architectures auto-builders compile and package binary packages.
Debian auto-builders (from Debian Administrator’s Handbook, GPL)
Components of a package
Debian packages are not only a set of files, but contain a lot more information, let us listen a few important ones:
・Maintainer: the entity (person, mailing list) responsible for the package
・Uploaders: other developers who can upload a new version of the package
・Version: a Debian version number (see below)
・Dependency declarations (see below)
There are many further fields, but we want to concentrate here on the fields that we are representing the in the Graph database.
The Maintainer and Uploaders are standard emails, most commonly including a name. In the case of the packages I maintain the maintainer is set to a mailing list (debian-tex-maint AT ...) and myself put into the Uploaders field. This way bug reports will go not only to me but to the whole list – a very common pattern in Debian.
Next let us look at the version numbers: Since for a specific upstream release we sometimes do several packages in Debian (to fix packaging bugs, for different suites), the Debian version string is a bit more complicated then just the simple upstream version:
Here the upstream_version is the usual version under which a program is released. Taking for example one of the packages I maintain, asymptote, it currently has version number 2.41-4, indicating that upstream version is 2.41, and there have been four Debian revisions for it. A bit more complicated example would be musixtex which currently has the version 1:1.20.ctan20151216-4.
Some caveats concerning source and binary packages, and versions:
・one source package can build many different binary packages
・the names of source package and binary package are not necessary the same (necessarily different
when building multiple binary packages)
・binary packages of the same name (but different version) can be built from different source packages
Let us finally look at the most complicated part of the package meta-fields, the dependencies: There are two different sets of dependencies, one for source packages and one for binary packages:
・source package relations: Build-Depends, Build-Depends-Indep, Build-Depends-Arch, Build-Conflicts,
・binary package relations: Depends, Pre-Depends, Recommends, Suggests, Enhances, Breaks, Conflicts
The former one specify package relations during package build, while the later package dependencies on the installed system.
A single package relation can take a variety of different forms providing various constraints on the relation:
・Relation: pkg: no constraints at all
・Relation: pkg (<< version): constraints on the version, can be strictly less, less or equal, etc
・Relation: pkg | pkg: alternative relations
・Relation: pkg [arch1 arch2]: constraints on the architectures
When properly registered for a package, these relations allow Debian to provide smooth upgrades between releases and guarantee functionality if a package is installed.
- Next is ...
- This concludes the short introduction to Debian and its packages. In the next blog entry we will describe the Ultimate Debian Database UDD and how to map the information presented here from the UDD into a Graph Database.
TeX User Group (取締役会員)、Kurt Godel Society (取締役会員)
ACM, ACM SigLog, 日本数式処理学会、ドイツ数学論理学会