gcooper: Modifications to pkg_install — the positive and (slightly) negative implications

Commentary:

What’s being profiled:
In my simple profiling Perl scripts I count the number of iterations I see a given header tag, along with count the time reported by clock_gettime(3). I’ve placed simple printf’s around the following subroutines/sections of code, as I believe they are critical (high number of iterations, large amount of disk access time) pkg_install sections:

add/perform.c:

  • setenv(_TOP)
  • +CONTENTS fopen(3)

lib/file.c:

  • copy_file(..)
  • move_file(..)
  • copy_hierarchy(..)
  • unpack(..)

lib/plist.c:

  • real_plist(..)
  • write_plist(..)
  • delete_package(..)
  • delete_hierarchy(..)
  • plist_cmd(..)
  • add_plist(..)
  • add_plist_top(..)

Changes made over last couple weeks:
A couple of weeks ago I tried to put a 10 line buffer in read_plist(..). This is what I refer to as the “old results” in the graph.
Over this last week I took the read_plist(..) function and converted it from a function like this (“old”):

while(!feof(file_p)) {
    /** fgets(file_p) **/
    /** process line in plist_cmd(..) **/
}

to a function like this (“new”):

/** malloc buffer large enough to hold +CONTENTS **/
while(!feof(file_p)) {
    /** add content via fgetc **/
}

/** split lines via strtok("n") and process in plist_cmd(..) **/

I also took the liberty of converting the plist_cmd(..) to what I consider a faster branch selection process. Instead of doing something like this (“old”):

if(!strcmp({command string directive 1}, command_str_p)) {
    return PLIST_{command string directive 1};/** command string directives 2 through n-1 are here **/
} else if(!strcmp({command string directive n}, command_str_p)) {
    return PLIST_{command string directive n};
} else
    return PLIST_FAIL;

I tried to unroll strcmp a bit by checking for the first character of the command, then run strcmp on a shortened version of the string, decreasing the search radix of the string accordingly.

This results in code similar to this (I’m using adapted bits of the real thing because it’s easier) (“new”):

if(command_str_p[0] == 'c') {
    if(command_str_p[1] == 'o') {
        if( !strcmp("mment", &command_str_p[2]) ) { /* comment */
            /** In the real code it checks for DEPORIGIN and ORIGIN data here. **/
            return PLIST_COMMENT;
        } else if(!strcmp("nflicts", &command_str_p[2])) /* conflicts */
            return PLIST_CONFLICTS;
    }
}

/** process the rest of the command string directives here **/
return PLIST_FAIL;

I also modified the heuristics so that what I thought the higher probability directives were processed first, then the lower probability heuristics last.

Results interpretation:
I’ve taken the rounds from my old modifications (10 line buffer, then process line by line), and compared them against my new modifications (buffer all lines, then process line by line; modified directive processing heuristics).

Most of the values appear to be constant and the mins and maxes are remaining in the same ballpark as one another (*) (which I would expect). However, the amount of time for read_plist has been decreased a large amount 1-0.000957/0.000793 = ~20%, and the other raw amounts of time are consistent with this value (see throughput ratios off to right side of the HTML file).

So, while my overall results appear to have benefited little in the sections that I’ve modified (mostly read_plist), I appear to have positively affected all other sections of code that I’m profiling by the same ratio (20% decrease, or Old/New = ~1.2 in the Throughput Ratio column). Also, although my debug statements have blown up the size of plist_cmd iterations and time spent in that section, the Time Spent Throughput Ratio hasn’t increased significantly (~17%), and I believe once I solve the bug I’m seeing (*), the numbers encountered should drop as well (as long as gcc isn’t doing some smart loop unrolling or other optimizations with strcmp when compiled with -O2 -march=pentium4; if true, I’ll need to revise my heuristics).

Summary:
My results, although small appear to have positively affected all profiled instructions apart from plist_cmd (*). Once that issue is solved there should be a decrease across the board for all profiled functions.

Similar optimizations may be possible in other sections of code. I’ll need to research whether or not that’s possible.

Notes:
(*) Unfortunately I’ve come across some sort of bug where my debug/profiling statements are making it into plist_cmd(..) and I’ve estimated that it’s tripled the amount of iterations. I think that the bug is being caused by piped commands read in via STDIN to pkg_add. I plan to solve this by using libarchive and specifying packages via arguments on the command line, instead of via STDIN like it’s done currently.

All results are available at http://students.washington.edu/youshi10/posted/atk-results.tgz, and more results will be forthcoming (in particular after I come up with a more streamlined means of converting data from CSV to Excel / HTML formatted files).

gcooper: The inefficiencies of pkgdb.db…

Continuing on with my analysis of the Ruby tools, I find that pkgdb.db is really inefficiently mapping files to packages, and packages to origins.

In the given pkgdb.db I have (which is toy in comparison to my server and other desktops’ pkgdb.db files — only 25 installed packages), I am seeing the following pattern (this is completely random because the BDB Hash DB is dumped in sequential order, and I assume the metadata contents are completely unordered, to some extent):

{?Package Origin

Package Name}

{File attached to package

Package Name}

:origins

{Package Origins}

:pkgnames

{Package Names}

:db_version

{Stamped database version}

:mtime

{The time when the database was made}

Now, that’s all fine and dandy, except for the fact that the pkgdb.db file is extremely bloated. I quickly made a Perl script to analyze the contents of the raw pkgdb.db file. (Link).

The output I obtained was:

analyze_pkgdb.pl < pkgdb-hash.raw

Number of origins found: 48

Package SUMMARY:
autoconf-2.59_2, File count: 63
bash-3.1.17, File count: 86
compat5x-amd64-5.4.0.8_8, File count: 198
db41-4.1.25_4, File count: 910
gamin-0.1.8, File count: 10
gettext-0.16.1_1, File count: 354
glib-2.12.12, File count: 180
gmake-3.81_1, File count: 26
help2man-1.36.4_1, File count: 12
libiconv-1.9.2_2, File count: 19
libtool-1.5.22_4, File count: 25
localedata-5.4, File count: 327
m4-1.4.9, File count: 36
p5-gettext-1.05_1, File count: 5
pciutils-2.2.3, File count: 7
perforce-06.1_6,1, File count: 7
perl-5.8.8, File count: 1970
pkg-config-0.21, File count: 3
pkg_cutleaves-20061113, File count: 2
portupgrade-2.2.6_3,2, File count: 44
ruby-1.8.6,1, File count: 769
ruby18-bdb1-0.2.3, File count: 1
samba-3.0.24,1, File count: 626
vim-lite-7.0.224, File count: 1306

Metadata SUMMARY:
Key: db_version,
Data: 48[7:cFreeBSDic
Key: mtime,
Data: 48u:9Timed\8c\d2\1a\80@O\d8$
Key: origins,
Data: converters/libiconv databases/db41 databases/ruby-bdb1 devel/autoconf259 devel/gamin devel/gettext devel/glib20 devel/gmake devel/libtool15 devel/m4 devel/p5-Locale-gettext devel/perforce devel/pkg-config editors/vim-lite lang/perl5.8 lang/ruby18 misc/compat5x misc/help2man misc/localedata net/samba3 ports-mgmt/pkg_cutleaves ports-mgmt/portupgrade shells/bash sysutils/pciutils
Key: pkgnames,
Data: autoconf-2.59_2 bash-3.1.17 compat5x-amd64-5.4.0.8_8 db41-4.1.25_4 gamin-0.1.8 gettext-0.16.1_1 glib-2.12.12 gmake-3.81_1 help2man-1.36.4_1 libiconv-1.9.2_2 libtool-1.5.22_4 localedata-5.4 m4-1.4.9 p5-gettext-1.05_1 pciutils-2.2.3 perforce-06.1_6,1 perl-5.8.8 pkg-config-0.21 pkg_cutleaves-20061113 portupgrade-2.2.6_3,2 ruby-1.8.6,1 ruby18-bdb1-0.2.3 samba-3.0.24,1 vim-lite-7.0.224
Done

Take a gander at the Perl reference though under Package SUMMARY though. 1970 lines had “perl-5.8.8″ in them. That’s 1969 lines more than there should be!!! Add up large ports like OpenOffice, the (Sun) JDK, or all of the X.org ports, and you have one bloated pkgdb.db file.

Of course, it does become an issue if 2 ports own the same file, but that shouldn’t occur and the ports should be setup to conflict with one another in my opinion, instead of causing such a “deadlock”…

Time to ask the port gurus what the next best course of action might be…

Note: A raw copy of  the hash version of my pkgdb can be found (here).

gcooper: Behind INDEX-*.db

Seizing the opportunity of having more free time from my internship, I’ve been able to unravel a bit more behind INDEX-*.db’s format.

Based on the information dumped from MySQL 5.0.x’s db_dump185 source, I can see that the Ruby generated the following formatted database:

:categories{Categories}
:db_version
[db specific version string]
:origins

{Package name
Package origin}
:pkgnames
{Package name
Package origin}
:virtual categories
{
{?Virtual Category Name
{complete Virtual Category Package Origins}
}
{Package name
Package origin}
{Origin
Package Name|Path|Prefix|Comment|Description|Maintainer|{Categories}|{Build Dependencies}|{Run Dependencies}|Website|{Extract Dependencies}|{Patch Dependencies}|{Fetch Dependencies}
a}
}
Observations:

The reason for:

{Package name

Package origin}

being injected every once in a while is probably due to the overflow facility of the BDB database format, combined with the fact that my output was from a raw database dump.

Also, it does appear as if the:

{Package name

Package origin}

set is sorted by “Package name”, which I find a) interesting and b) inefficient, depending on the algorithm used to extract the fields from the INDEX-* file.

Summary:

So, the ruby ports management scripts tack on an additional metadata to the existing INDEX-* file most likely for what the author considered to be wise for looking up ports / packages. It may decrease the search time, but serves only to increase the overall raw INDEX-* data by 2.1MB, which results in greater pre-/post-processing.

Notes:

gcooper@optimus ~/gcooper/Desktop
$ ls -lh INDEX-7 INDEX-7.raw
-rwx------+ 1 gcooper None 9.9M Apr 14 15:34 INDEX-7
-rwx------+ 1 gcooper None 12M May 19 00:17 INDEX-7.raw

  • The

{Origin

Package Name|Path|Prefix|Comment|Description|Maintainer|{Categories}|{Build Dependencies}|{Run Dependencies}|Website|{Extract Dependencies}|{Patch Dependencies}|{Fetch Dependencies}

a}
set appears to have been taken verbatim from the INDEX-* file. See http://www.lpthe.jussieu.fr/~talon/freebsdports.html#htoc11 for more details.

Notation:

  • “{ }” : denotes pattern repeating list, typically space delimited.
  • code blocks denote verbatim string segments or characters.
  • Italicized text denotes package metadata fields.

gcooper: /usr/lib/libc/db/ and friends, and INDEX’es

So, over the past couple days I’ve been looking over the db code under libc, trying to get a grasp on what needs to be done to get the packaging stuff up and running, and it turns out that my results have been fairly successful. In order to remain fully compatible with the ruby packaging tools I plan on first implementing BDB btree(3)’s (plus they’re simpler to quickly implement from what I’ve seen), and then take that info and work my way up to BDB hash(3) table.

I noticed that a little bit of the btree code could be reduced by using C macros, so I’ve done that. I need to run make on the db directory though and ensure that stuff passes on my machine before integrating the source into my remote tree. Anybody know of any software that uses BDB 1.85 and is in ports :)? Or I suppose I could use diff with the assembly code >_>..

Anyhow, after that I’ve been taking a look at the INDEX format currently in play with /usr/ports/INDEX-[5-7]. After some discussion and helpful pointers from Mark Linimon, Matthew Seaman, and Michel Talon, I’ve figured out that the formatting of the fields are as follows:

  1. pkgname
  2. path
  3. prefix
  4. comment
  5. descr
  6. maintainer
  7. categories
  8. build_deps
  9. run_deps
  10. website
  11. extract_deps
  12. patch_deps
  13. fetch_deps

Note: The above list was derived from http://www.lpthe.jussieu.fr/~talon/freebsdports.html#htoc11.

So, now all I need to do is write simple sprintf and sscanf one liners converting the strings in the DBT.data fields into the appropriate fields and pass the data along to the appropriate package utils. Easier said than done, but it shouldn’t be too bad.

Until next week :).

gcooper: Work, work, and more work

I apologize for the lack of info, but work at Intel is proving to be a bit challenging right now (pending internal tool release soon at the end of the financial Quarter).

I’ve looked at the internals of db.h, pkg_which, pkgdb, /usr/ports/INDEX-7* and played around with perforce a bit, but I haven’t had a lot of time to sit down with everything and work over an extended period of time.

I will update further, once I have the opportunity to sit down and work on this stuff long term.

gcooper: First Post

So, this is my first post to my FreeBSD blog. In here I’ll discuss my triumphs, tribulations, and the entire learning process of my SoC 2007 project, of which it’s comprised of several mini-projects:

  1. Revise FreeBSD’s package tools (pkg_*) to use a BDB hash table backend.
  2. Write a common Bourne shell frontend for all pkg_* tools to help unify ports and package management.
  3. Create virtuals for packages included in the base system.
  4. Backport all completed items to 5.x and 6.x if possible.

1. is definitely my first priority goal and will probably take me around 2 months to complete, but I don’t plan on skimping on the other two goals.

For now I need to play with INDEX.7, and db.h / BDB 1.85 documentation (included in /usr/src/lib/libc/db/docs I discovered thanks to Kris@ :)…).

Until next I blog..