Occam's razor Archive Pages Categories Tags

Querying Git History with Datalog - Part 2: Parsing Git Objects with JGit

20 February 2018

In Part 1, we covered the Datomic schema for storing git objects. Now we need to actually parse those objects out of a git repository. This means dealing with JGit, Java’s git implementation, and walking the git object graph.

The challenge: git objects reference each other by SHA. A commit points to a tree SHA, trees contain blob SHAs. We need to resolve all these references and load everything into Datomic in the right order.


JGit Basics

JGit represents git objects with Java classes:

(import '[org.eclipse.jgit.revwalk RevCommit RevTree]
        '[org.eclipse.jgit.lib Repository ObjectId]
        '[org.eclipse.jgit.treewalk TreeWalk])

;; Load a repository
(def repo (load-repo "/path/to/repo"))

;; Get all commits
(def commits (rev-list repo))  ;; Returns seq of RevCommit objects

Each RevCommit is a handle to the commit object. To get its data, you call methods:

(.getId commit)          ;; ObjectId (SHA)
(.getShortMessage commit)  ;; First line of commit message
(.getFullMessage commit)   ;; Full commit message
(.getCommitTime commit)    ;; Unix timestamp
(.getParents commit)       ;; Array of parent commits
(.getTree commit)          ;; RevTree object


The Protocol Approach

Rather than dealing with JGit classes directly, we define a protocol for git objects:

(defprotocol IGitObject
  (parse [object repo]
    "Parse a JGit object into a Clojure record")

  (nodes [object]
    "Returns list of child object SHAs for graph traversal")

  (serialize [object]
    "Serializes object into Datomic entity map"))

This lets us treat all git objects uniformly - commits, trees, blobs all implement the same interface.


Records for Git Objects

We define Clojure records to hold parsed data:

(defrecord Commit [sha msg message time tree parents])
(defrecord Tree [sha nodes])
(defrecord Blob [sha uri])
(defrecord Node [type sha filename mode])

These are plain Clojure data - no JGit dependencies. Once parsed, we work with these records instead of JGit objects.


Parsing Commits

Implementing the protocol for commits:

(extend-protocol IGitObject
  RevCommit
  (parse [commit repo]
    (let [sha (.name (.getId commit))
          msg (.getShortMessage commit)
          message (.getFullMessage commit)
          time (java.util.Date. (* 1000 (.getCommitTime commit)))
          tree (.name (.getId (.getTree commit)))
          parents (map #(.name (.getId %)) (.getParents commit))]
      (->Commit sha msg message time tree (vec parents))))

  (nodes [commit]
    ;; Return the tree SHA - we need to parse the tree next
    [(.name (.getId (.getTree commit)))])

  (serialize [commit]
    {:git/sha (:sha commit)
     :git/type :git.types/commit
     :git.commit/msg (:msg commit)
     :git.commit/message (:message commit)
     :git.commit/time (:time commit)}))

Notice nodes returns the tree SHA. This is how we know what to parse next. When walking the object graph, we start with commits, then follow the SHAs in nodes to parse trees and blobs.


Parsing Trees

Trees are trickier. A tree contains multiple entries (nodes), each pointing to either another tree or a blob:

(extend-protocol IGitObject
  RevTree
  (parse [tree repo]
    (let [sha (.name (.getId tree))
          walk (TreeWalk/forPath (.newObjectReader repo) "" tree)]
      (when walk
        (.setRecursive walk false)  ;; Don't recurse into subtrees
        (let [nodes (loop [acc []]
                      (if (.next walk)
                        (recur
                         (conj acc
                               (->Node
                                (if (.isSubtree walk)
                                  :git.types/tree
                                  :git.types/blob)
                                (.name (.getObjectId walk 0))
                                (.getNameString walk)
                                (.getModeOctal walk))))
                        acc))]
          (->Tree sha nodes)))))

  (nodes [tree]
    ;; Return SHAs of all child nodes
    (map :sha (:nodes tree)))

  (serialize [tree]
    {:git/sha (:sha tree)
     :git/type :git.types/tree
     :git.tree/nodes (mapv serialize-node (:nodes tree))}))

TreeWalk is JGit’s API for iterating tree entries. We call .next repeatedly to walk through all entries, creating a Node record for each.


Parsing Blobs

Blobs are simple - they’re just content. In Muramasa, we store a URI instead of the raw bytes:

(extend-protocol IGitObject
  ObjectId  ;; Blobs are referenced by ObjectId, not RevBlob
  (parse [object-id repo]
    (let [sha (.name object-id)
          loader (.open (.newObjectReader repo) object-id)
          bytes (.getBytes loader)
          uri (write-blob-to-disk sha bytes *blob-dir*)]
      (->Blob sha uri)))

  (nodes [object-id]
    [])  ;; Blobs have no children

  (serialize [blob]
    {:git/sha (:sha blob)
     :git/type :git.types/blob
     :git.blob/uri (:uri blob)}))

write-blob-to-disk stores the blob content on disk with a directory structure:

(defn write-blob-to-disk [sha bytes blob-dir]
  (let [prefix1 (subs sha 0 2)
        prefix2 (subs sha 2 4)
        dir (io/file blob-dir prefix1 prefix2)
        file (io/file dir sha)]
    (.mkdirs dir)
    (with-open [out (io/output-stream file)]
      (.write out bytes))
    (str "file://" (.getAbsolutePath file))))

This creates paths like blobs/ab/cd/abcd1234..., similar to how git stores objects internally.


Walking the Object Graph

The sync process walks the entire object graph, starting from commits:

(defn walk-objects [repo commits db]
  (loop [to-visit (set (map #(.name (.getId %)) commits))
         visited #{}
         objects {}]

    (if (empty? to-visit)
      objects  ;; Done, return all parsed objects

      (let [sha (first to-visit)
            remaining (disj to-visit sha)]

        ;; Skip if already visited or in database
        (if (or (visited sha) (db-has-sha? db sha))
          (recur remaining (conj visited sha) objects)

          ;; Parse the object
          (let [obj-id (ObjectId/fromString sha)
                parsed (parse obj-id repo)
                child-shas (nodes parsed)]

            ;; Add parsed object and queue children
            (recur
             (into remaining child-shas)
             (conj visited sha)
             (assoc objects sha parsed))))))))

This is a breadth-first walk. We start with commit SHAs, parse each commit, extract its tree SHA, parse the tree, extract blob SHAs from the tree, parse blobs. The visited set prevents cycles and duplicate work.


Dependency Order

Git objects have dependencies:

When transacting to Datomic, we need to ensure dependencies exist first. Otherwise, a commit referencing a tree that doesn’t exist yet will fail.

(defn prepare-transaction [objects]
  (let [blobs (filter #(instance? Blob (val %)) objects)
        trees (filter #(instance? Tree (val %)) objects)
        commits (filter #(instance? Commit (val %)) objects)]

    ;; Serialize in dependency order
    (vec (concat
          (map #(serialize (val %)) blobs)
          (map #(serialize (val %)) trees)
          (map #(serialize (val %)) commits)))))

By transacting blobs first, then trees, then commits, we ensure all references are valid.


Handling References

Entity references in Datomic use lookup refs - [:git/sha "abc123"] means “the entity whose :git/sha is abc123.”

When serializing a commit, we need to convert SHA strings to lookup refs:

(defn prepare-references [objects]
  (let [commits (filter #(instance? Commit (val %)) objects)]
    (mapcat
     (fn [[sha commit]]
       (concat
        ;; Add tree reference
        (when-let [tree-sha (:tree commit)]
          [{:db/id [:git/sha sha]
            :git.commit/tree [:git/sha tree-sha]}])

        ;; Add parent references
        (map (fn [parent-sha]
               {:db/id [:git/sha sha]
                :git.commit/parents [:git/sha parent-sha]})
             (:parents commit))))
     commits)))

This creates a second transaction that adds the references after entities exist. It’s a two-pass approach: first create entities, then wire them together.


The Complete Flow

Putting it all together:

(defn sync! [conn repo-path]
  (let [repo (load-repo repo-path)
        db (d/db conn)]

    ;; 1. Get all commits
    (let [commits (rev-list repo)

          ;; 2. Walk object graph, parse everything
          objects (walk-objects repo commits db)

          ;; 3. Prepare entities in dependency order
          entities (prepare-transaction objects)

          ;; 4. Transact entities
          _ @(d/transact conn entities)

          ;; 5. Add references
          refs (prepare-references objects)
          _ @(d/transact conn refs)]

      {:commits-synced (count commits)
       :objects-synced (count objects)})))

Next time: advanced queries and what you can actually do with git data in Datomic.

blog comments powered by Disqus