03-5211-7750 平日|09:30~18:00

Analyzing Debian packages with Neo4j

           

サービス資料や
ホワイトペーパーはこちら

           資料を【無料】ダウンロードFREE

Overview on the blog series

The Ultimate Debian Database UDD collects a variety of data around Debian and Ubuntu: Packages and sources, bugs, history of uploads, just to name a few. (Ultimate Debian Database UDD)

The database scheme reveals a highly de-normalized RDB. In this on-going work we extract (some) data from UDD and represent it as a graph database.

In the following series of blog entries we will report on this work. Part 1 (this one) will give a short introduction to Debian and the life time and structure of Debian packages. Part 2 will develop the graph database scheme (nodes and relations) from the inherent properties of Debian packages. The final part 3 will describe how to get the data from the UDD into Neo4j, give some sample queries, and discuss further work.

This work has been presented at the Neo4j Online Meetup and a video recording of the presentation is available on YouTube.

Part 1 – Debian

Debian is an open source Linux distribution, developed mostly by volunteers. With a history of already more than 20 years, Debian is one of the oldest Linux distributions. It sets itself apart from many other Linux distributions by a strict set of license rules that guarantees that everything within Debian is free according to the Debian Free Software Guidelines.

Debian also gave rise to a large set of off-springs, most widely known one is Ubuntu.

Debian contains not only the underlying operating system (Linux) and the necessary tools, but also a huge set of programs and applications, currently about 50000 software packages. All of these packages come with full source code but are already pre-compiled for easy consumption.

To understand what information we have transferred into Neo4j we need to take a look at how Debian is structured, and how a packages lives within this environment.


Debian releases
Debian employs release based software management, that is, a new Debian version is released in more or less regular intervals. The current stable release is Debian stretch (Debian 9.2) and was released first in June 2017, with the latest point release on October 7th, 2017.

To prepare packages for the next stable release, they have to go through a set of suites to make sure they conform to quality assurance criteria. These suites are:

 ・Development (sid): the entrance point for all packages, where the main development takes place;
 ・Testing: packages that are ready to be released as the next stable release;
 ・Stable: the status of the current stable release.

There are a few other suites like experimental or targeting security updates, but we leave their discussion out here.


Package and suite transitions
Packages have a certain life cycle within Debian. Consider the following image:

Packages and Suites (Youhei Sasaki, CC-NC-SA)

Packages normally are uploaded into the unstable suite and remain there at least for 5 days. If no release critical bug has been reported, after these 5 days the package transitions automatically from unstable into the testing suite, which will be released as stable by the release managers at some point in the future.


Structure of Debian packages
Debian packages come as source packages and binary packages. Binary packages are available in a variety of architectures: amd64, i386, powerpc just to name a few.

Debian developers upload source packages (and often own’s own architecture’s binary package), and for other architectures auto-builders compile and package binary packages.


Debian auto-builders (from Debian Administrator’s Handbook, GPL)



Components of a package
Debian packages are not only a set of files, but contain a lot more information, let us listen a few important ones:

 ・Maintainer: the entity (person, mailing list) responsible for the package
 ・Uploaders: other developers who can upload a new version of the package
 ・Version: a Debian version number (see below)
 ・Dependency declarations (see below)

There are many further fields, but we want to concentrate here on the fields that we are representing the in the Graph database.

The Maintainer and Uploaders are standard emails, most commonly including a name. In the case of the packages I maintain the maintainer is set to a mailing list (debian-tex-maint AT ...) and myself put into the Uploaders field. This way bug reports will go not only to me but to the whole list – a very common pattern in Debian.

Next let us look at the version numbers: Since for a specific upstream release we sometimes do several packages in Debian (to fix packaging bugs, for different suites), the Debian version string is a bit more complicated then just the simple upstream version:

[epoch:]upstream_version[-debian_revision]

Here the upstream_version is the usual version under which a program is released. Taking for example one of the packages I maintain, asymptote, it currently has version number 2.41-4, indicating that upstream version is 2.41, and there have been four Debian revisions for it. A bit more complicated example would be musixtex which currently has the version 1:1.20.ctan20151216-4.
Some caveats concerning source and binary packages, and versions:

 ・one source package can build many different binary packages
 ・the names of source package and binary package are not necessary the same (necessarily different
   when building multiple binary packages)
 ・binary packages of the same name (but different version) can be built from different source packages

Let us finally look at the most complicated part of the package meta-fields, the dependencies: There are two different sets of dependencies, one for source packages and one for binary packages:

 ・source package relations: Build-Depends, Build-Depends-Indep, Build-Depends-Arch, Build-Conflicts,
  Build-Conflicts-Indep, Build-Conflicts-Arch
 ・binary package relations: Depends, Pre-Depends, Recommends, Suggests, Enhances, Breaks, Conflicts

The former one specify package relations during package build, while the later package dependencies on the installed system.

A single package relation can take a variety of different forms providing various constraints on the relation:

 ・Relation: pkg: no constraints at all
 ・Relation: pkg (<< version): constraints on the version, can be strictly less, less or equal, etc
 ・Relation: pkg | pkg: alternative relations
 ・Relation: pkg [arch1 arch2]: constraints on the architectures

When properly registered for a package, these relations allow Debian to provide smooth upgrades between releases and guarantee functionality if a package is installed.

Next is ...

This concludes the short introduction to Debian and its packages. In the next blog entry we will describe the Ultimate Debian Database UDD and how to map the information presented here from the UDD into a Graph Database.

■関連ページ
【アクセリアのサービス一覧】
 ・サービスNAVI

Norbert Preining

アクセリア株式会社 研究開発部社員
北陸先端科学技術大学院大学ソフトウェア検証研究センター 研究員
ウィーン工科大学 研究員
デビアン開発者
TeX User Group (取締役会員)、Kurt Godel Society (取締役会員)
ACM, ACM SigLog, 日本数式処理学会、ドイツ数学論理学会
Contact usお問い合わせ

サービスにご興味をお持ちの方は
お気軽にお問い合わせください。

Webからお問い合わせ

お問い合わせ

お電話からお問い合わせ

03-5211-7750

平日09:30 〜 18:00

Download資料ダウンロード

製品紹介やお役立ち資料を無料でご活用いただけます。

Magazineメルマガ登録

最新の製品情報などタイムリーな情報を配信しています。

Free Service

PageSpeed Insights シミュレータ

CDNによるコンテンツの最適化を行った場合のPageSpeed Insightsのスコアをシミュレートしてレポートします。