`portindexdb` — Doing SQL to the FreeBSD Ports Index

This is the result of a little personal project of mine, to explore how useful using a full-blown RDBMS would be when building an INDEX for the FreeBSD ports, plus teach myself something about designing database schemas and writing SQL in general. I have previous form in this area, so fiddling about with data from ports seemed an ideal model system. One eventual result might be to add alternate backing stores for the FreeBSD::Portindex cache, but that's something for another day. It has resulted in some interesting ideas about how I can improve FreeBSD::Portindex in other ways though.

Requirements

A PostgreSQL database. portindexdb was developed using version 9.1, but it should work for anything version 8.4 or above. Earlier than 8.4 definitely won't work, as it needs recursive SQL queries to create some important tables.
Perl. Tested with 5.12.4 and 5.14.2 but anything 5.10.1 or above should work.
The DBI and DBD::Pg perl modules.

Download

The current version is 1090 released on 11^th June 2012.

[Download]

[Checksum]

Changes

Revision 1090 (2012-06-11) [Download] [Checksum]

Fix dumb mistake in the previous release. This gives correct UNIQUENAME results now.

Revision 1087 (2012-06-09) [Download] [Checksum]

Track the UNIQUENAME of ports. Add uniquename and uniquename_dependency tables, plus a sample SQL query listing all the ports using non-unique UNIQUENAME values.

Revision 1069 (2012-06-01) [Download] [Checksum]

Some make variables are lists of keywords: occasionally a keyword appears multiple times in a list (which is an error, really.) This causes the scan phase to fail with a uniqueness constraint violation. Unique the elements of some list-variables that have been seen to be affected.

Revision 972 (2012-01-29) [Download] [Checksum]

Note: if you are upgrading, you will need to drop and recreate the database schema as described below.

Order makefile_variables_type ENUM alphabetically.
Add two new tables -- distfile and distfile_dependency containing size and sha256 checksum data derived from distinfo files, plus SQL code to populate those tables. As part of this, add DISTINFO_FILE, _size and _sha256 to the makefile_variables_type ENUM. _size and _sha256 are read from distinfo files rather than being actual Makefile variables.
Add new flatten subroutine to turn an array of array_refs into the concatenation of the referenced arrays. Apply this to _makefile_mtime (ie. don't join() what will almost immediately be split() apart again), _size and _sha256 data.
New sample query showing how to use distfile data to detect conflicting size or sha256 values between different ports.

Revision 962 (2012-01-20) [Download] [Checksum]

Initial release.

Note: I'm just being lazy and using the SVN revision number as the version, hence the inflated value and irregular sequence.

Setup and Use

Environment

Optionally set some environment variables. These allow you to chose which ports tree to analyse and which database to populate with the results. If left unset, the default values are as shown.

% export PORTSDIR=/usr/ports
% export PGDATABASE=portindex

Create database

% createdb -U pgsql -O ${USER} -l C -T template0 -E SQL_ASCII $PGDATABASE

You don't have to use SQL_ASCII encoding — UTF8 (which the usual default) will be fine. There are rumours that SQL_ASCII is faster though. In any case, using the C locale is important so you get the correct collation order.

Edit ${PGDATA}/data/pg_hba.conf if necessary to set access rights — the scripts default to connecting to a postgresql server on localhost via unix domain socket using the same username as the Unix account and no password. However, all of those defaults are modifiable from the command line or by various standard methods used by PostgreSQL. See perldoc portindexdb for details.

Load the Schema

Note: relies on the PGDATABASE variable to connect to the intended database.

% psql < schema.sql

Run the Script

% portindexdb

This will scan your ports tree and then build the database from the information it collects. It takes about 1.2ks (20 minutes) on my system, most of which time is spent scanning through all the ports tree extracting the raw data.

Enjoy!

There are a number of sample queries in queries.sql, including one that will generate a standard INDEX file, if you turn off the column headings and result padding in psql(1).

Feedback

If you find this useful, or have suggestions for improvements or bug reports, or in fact just want to comment on it in any way, I would be glad to hear from you.

portindexdb — Doing SQL to the FreeBSD Ports Index