So far we’ve been walking only commits. But Git has more types of objects than
that! Let’s see if we can walk all objects, and find out some information
about each one.
We can base our work on an example. git pack-objects prepares all kinds of
objects for packing into a bitmap or packfile. The work we are interested in
resides in builtins/pack-objects.c:get_object_list(); examination of that
function shows that the all-object walk is being performed by
traverse_commit_list() or traverse_commit_list_filtered(). Those two
functions reside in list-objects.c; examining the source shows that, despite
the name, these functions traverse all kinds of objects. Let’s have a look at
the arguments to traverse_commit_list().
-
struct rev_info *revs: This is the rev_info used for the walk. If
its filter member is not NULL, then filter contains information for
how to filter the object list.
-
show_commit_fn show_commit: A callback which will be used to handle each
individual commit object.
-
show_object_fn show_object: A callback which will be used to handle each
non-commit object (so each blob, tree, or tag).
-
void *show_data: A context buffer which is passed in turn to show_commit
and show_object.
In addition, traverse_commit_list_filtered() has an additional parameter:
It looks like these methods use callbacks we provide instead of needing us
to call it repeatedly ourselves. Cool! Let’s add the callbacks first.
For the sake of this tutorial, we’ll simply keep track of how many of each kind
of object we find. At file scope in builtin/walken.c add the following
tracking variables:
static int commit_count;
static int tag_count;
static int blob_count;
static int tree_count;
Commits are handled by a different callback than other objects; let’s do that
one first:
static void walken_show_commit(struct commit *cmt, void *buf)
{
commit_count++;
}
The cmt argument is fairly self-explanatory. But it’s worth mentioning that
the buf argument is actually the context buffer that we can provide to the
traversal calls - show_data, which we mentioned a moment ago.
Since we have the struct commit object, we can look at all the same parts that
we looked at in our earlier commit-only walk. For the sake of this tutorial,
though, we’ll just increment the commit counter and move on.
The callback for non-commits is a little different, as we’ll need to check
which kind of object we’re dealing with:
static void walken_show_object(struct object *obj, const char *str, void *buf)
{
switch (obj->type) {
case OBJ_TREE:
tree_count++;
break;
case OBJ_BLOB:
blob_count++;
break;
case OBJ_TAG:
tag_count++;
break;
case OBJ_COMMIT:
BUG("unexpected commit object in walken_show_object\n");
default:
BUG("unexpected object type %s in walken_show_object\n",
type_name(obj->type));
}
}
Again, obj is fairly self-explanatory, and we can guess that buf is the same
context pointer that walken_show_commit() receives: the show_data argument
to traverse_commit_list() and traverse_commit_list_filtered(). Finally,
str contains the name of the object, which ends up being something like
foo.txt (blob), bar/baz (tree), or v1.2.3 (tag).
To help assure us that we aren’t double-counting commits, we’ll include some
complaining if a commit object is routed through our non-commit callback; we’ll
also complain if we see an invalid object type. Since those two cases should be
unreachable, and would only change in the event of a semantic change to the Git
codebase, we complain by using BUG() - which is a signal to a developer that
the change they made caused unintended consequences, and the rest of the
codebase needs to be updated to understand that change. BUG() is not intended
to be seen by the public, so it is not localized.
Our main object walk implementation is substantially different from our commit
walk implementation, so let’s make a new function to perform the object walk. We
can perform setup which is applicable to all objects here, too, to keep separate
from setup which is applicable to commit-only walks.
We’ll start by enabling all types of objects in the struct rev_info. We’ll
also turn on tree_blobs_in_commit_order, which means that we will walk a
commit’s tree and everything it points to immediately after we find each commit,
as opposed to waiting for the end and walking through all trees after the commit
history has been discovered. With the appropriate settings configured, we are
ready to call prepare_revision_walk().
static void walken_object_walk(struct rev_info *rev)
{
rev->tree_objects = 1;
rev->blob_objects = 1;
rev->tag_objects = 1;
rev->tree_blobs_in_commit_order = 1;
if (prepare_revision_walk(rev))
die(_("revision walk setup failed"));
commit_count = 0;
tag_count = 0;
blob_count = 0;
tree_count = 0;
Let’s start by calling just the unfiltered walk and reporting our counts.
Complete your implementation of walken_object_walk().
We’ll also need to include the list-objects.h header.
#include "list-objects.h"
...
traverse_commit_list(rev, walken_show_commit, walken_show_object, NULL);
printf("commits %d\nblobs %d\ntags %d\ntrees %d\n", commit_count,
blob_count, tag_count, tree_count);
}
|
Note
|
This output is intended to be machine-parsed. Therefore, we are not
sending it to trace_printf(), and we are not localizing it - we need scripts
to be able to count on the formatting to be exactly the way it is shown here.
If we were intending this output to be read by humans, we would need to localize
it with _(). |
Finally, we’ll ask cmd_walken() to use the object walk instead. Discussing
command line options is out of scope for this tutorial, so we’ll just hardcode
a branch we can change at compile time. Where you call final_rev_info_setup()
and walken_commit_walk(), instead branch like so:
if (1) {
add_head_to_pending(&rev);
walken_object_walk(&rev);
} else {
final_rev_info_setup(argc, argv, prefix, &rev);
walken_commit_walk(&rev);
}
|
Note
|
For simplicity, we’ve avoided all the filters and sorts we applied in
final_rev_info_setup() and simply added HEAD to our pending queue. If you
want, you can certainly use the filters we added before by moving
final_rev_info_setup() out of the conditional and removing the call to
add_head_to_pending(). |
Now we can try to run our command! It should take noticeably longer than the
commit walk, but an examination of the output will give you an idea why. Your
output should look similar to this example, but with different counts:
Object walk completed. Found 55733 commits, 100274 blobs, 0 tags, and 104210 trees.
This makes sense. We have more trees than commits because the Git project has
lots of subdirectories which can change, plus at least one tree per commit. We
have no tags because we started on a commit (HEAD) and while tags can point to
commits, commits can’t point to tags.
|
Note
|
You will have different counts when you run this yourself! The number of
objects grows along with the Git project. |
Adding a Filter
There are a handful of filters that we can apply to the object walk laid out in
Documentation/rev-list-options.txt. These filters are typically useful for
operations such as creating packfiles or performing a partial clone. They are
defined in list-objects-filter-options.h. For the purposes of this tutorial we
will use the "tree:1" filter, which causes the walk to omit all trees and blobs
which are not directly referenced by commits reachable from the commit in
pending when the walk begins. (pending is the list of objects which need to
be traversed during a walk; you can imagine a breadth-first tree traversal to
help understand. In our case, that means we omit trees and blobs not directly
referenced by HEAD or HEAD's history, because we begin the walk with only
HEAD in the pending list.)
For now, we are not going to track the omitted objects, so we’ll replace those
parameters with NULL. For the sake of simplicity, we’ll add a simple
build-time branch to use our filter or not. Preface the line calling
traverse_commit_list() with the following, which will remind us which kind of
walk we’ve just performed:
if (0) {
/* Unfiltered: */
trace_printf(_("Unfiltered object walk.\n"));
} else {
trace_printf(
_("Filtered object walk with filterspec 'tree:1'.\n"));
CALLOC_ARRAY(rev->filter, 1);
parse_list_objects_filter(rev->filter, "tree:1");
}
traverse_commit_list(rev, walken_show_commit,
walken_show_object, NULL);
The rev->filter member is usually built directly from a command
line argument, so the module provides an easy way to build one from a string.
Even though we aren’t taking user input right now, we can still build one with
a hardcoded string using parse_list_objects_filter().
With the filter spec "tree:1", we are expecting to see only the root tree for
each commit; therefore, the tree object count should be less than or equal to
the number of commits. (For an example of why that’s true: git commit --revert
points to the same tree object as its grandparent.)
Counting Omitted Objects
We also have the capability to enumerate all objects which were omitted by a
filter, like with git log --filter=<spec> --filter-print-omitted. Asking
traverse_commit_list_filtered() to populate the omitted list means that our
object walk does not perform any better than an unfiltered object walk; all
reachable objects are walked in order to populate the list.
First, add the struct oidset and related items we will use to iterate it:
#include "oidset.h"
...
static void walken_object_walk(
...
struct oidset omitted;
struct oidset_iter oit;
struct object_id *oid = NULL;
int omitted_count = 0;
oidset_init(&omitted, 0);
...
Modify the call to traverse_commit_list_filtered() to include your omitted
object:
...
traverse_commit_list_filtered(rev,
walken_show_commit, walken_show_object, NULL, &omitted);
...
Then, after your traversal, the oidset traversal is pretty straightforward.
Count all the objects within and modify the print statement:
/* Count the omitted objects. */
oidset_iter_init(&omitted, &oit);
while ((oid = oidset_iter_next(&oit)))
omitted_count++;
printf("commits %d\nblobs %d\ntags %d\ntrees %d\nomitted %d\n",
commit_count, blob_count, tag_count, tree_count, omitted_count);
By running your walk with and without the filter, you should find that the total
object count in each case is identical. You can also time each invocation of
the walken subcommand, with and without omitted being passed in, to confirm
to yourself the runtime impact of tracking all omitted objects.
Changing the Order
Finally, let’s demonstrate that you can also reorder walks of all objects, not
just walks of commits. First, we’ll make our handlers chattier - modify
walken_show_commit() and walken_show_object() to print the object as they
go:
#include "hex.h"
...
static void walken_show_commit(struct commit *cmt, void *buf)
{
trace_printf("commit: %s\n", oid_to_hex(&cmt->object.oid));
commit_count++;
}
static void walken_show_object(struct object *obj, const char *str, void *buf)
{
trace_printf("%s: %s\n", type_name(obj->type), oid_to_hex(&obj->oid));
...
}
|
Note
|
Since we will be examining this output directly as humans, we’ll use
trace_printf() here. Additionally, since this change introduces a significant
number of printed lines, using trace_printf() will allow us to easily silence
those lines without having to recompile. |
(Leave the counter increment logic in place.)
With only that change, run again (but save yourself some scrollback):
$ GIT_TRACE=1 ./bin-wrappers/git walken | head -n 10
Take a look at the top commit with git show and the object ID you printed; it
should be the same as the output of git show HEAD.
Next, let’s change a setting on our struct rev_info within
walken_object_walk(). Find where you’re changing the other settings on rev,
such as rev->tree_objects and rev->tree_blobs_in_commit_order, and add the
reverse setting at the bottom:
...
rev->tree_objects = 1;
rev->blob_objects = 1;
rev->tag_objects = 1;
rev->tree_blobs_in_commit_order = 1;
rev->reverse = 1;
...
Now, run again, but this time, let’s grab the last handful of objects instead
of the first handful:
$ make
$ GIT_TRACE=1 ./bin-wrappers git walken | tail -n 10
The last commit object given should have the same OID as the one we saw at the
top before, and running git show <oid> with that OID should give you again
the same results as git show HEAD. Furthermore, if you run and examine the
first ten lines again (with head instead of tail like we did before applying
the reverse setting), you should see that now the first commit printed is the
initial commit, e83c5163.