Occam's razor Archive Pages Categories Tags

Querying Git History with Datalog - Part 1: Why and How

13 January 2018

Git history is a graph database. Commits point to trees, trees contain blobs, commits reference parent commits. But git’s native query interface is terrible - you’re stuck with shell commands and grep.

What if you could query git history with a real query language? Not git log | grep, but actual relational queries. “Show me all commits that touched file X and were authored by person Y in date range Z.” Or “find files that exist in commit A but not in commit B.”

That’s what Muramasa does. It syncs a git repository into Datomic, letting you query git history with Datalog.

Why Datomic?

Datomic is a database where everything is immutable. Perfect for git, where nothing ever changes - commits are eternal, trees are content-addressed, the SHA is the identity.

Datomic also has Datalog, a query language that’s basically SQL but for graphs. Git history is a graph. The match is natural.

The Schema

A git repository has four object types: commits, trees, blobs, and tags. Muramasa models these as Datomic entities:

;; Every git object has a type and SHA
{:db/ident :git/type
 :db/valueType :db.type/keyword
 :db/cardinality :db.cardinality/one}

{:db/ident :git/sha
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one
 :db/unique :db.unique/identity}  ;; SHA is unique

The :db/unique :db.unique/identity is key. It means we can upsert by SHA - if we try to insert an object that already exists, Datomic merges it. This makes incremental sync trivial.

Commits

Commits have messages, timestamps, and references to parents and tree:

{:db/ident :git.commit/msg
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one}

{:db/ident :git.commit/message  ;; Full message
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one
 :db/fulltext true}  ;; Enable fulltext search

{:db/ident :git.commit/time
 :db/valueType :db.type/instant
 :db/cardinality :db.cardinality/one}

{:db/ident :git.commit/tree
 :db/valueType :db.type/ref  ;; Reference to tree entity
 :db/cardinality :db.cardinality/one}

{:db/ident :git.commit/parents
 :db/valueType :db.type/ref  ;; References to parent commits
 :db/cardinality :db.cardinality/many}

:db.type/ref is how you model relationships. :git.commit/tree points to another entity (the tree). This is the power - you can traverse the graph in queries.

Trees and Blobs

Trees contain nodes (file entries), blobs are file content:

{:db/ident :git.tree/nodes
 :db/valueType :db.type/ref
 :db/cardinality :db.cardinality/many
 :db/isComponent true}  ;; Nodes are owned by trees

{:db/ident :git.node/filename
 :db/valueType :db.type/ref  ;; Reference to file entity
 :db/cardinality :db.cardinality/one}

{:db/ident :git.blob/uri
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one}

{:db/ident :file/name
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one
 :db/unique :db.unique/identity
 :db/fulltext true}  ;; Search filenames

The sync! Function

The main API is a single function - sync!:

(defn sync! [conn repo-path]
  (let [repo (load-repo repo-path)
        db (d/db conn)]

    ;; Ensure schema exists
    (ensure-schema! conn)

    ;; Collect all commits
    (let [commits (rev-list repo)
          objects (parse-objects repo commits db)]

      ;; Transact in dependency order
      (transact-objects! conn objects)

      {:commits-synced (count commits)
       :objects-synced (count objects)})))

The process:

Load the git repo using JGit
Get all commits (git log --all)
Parse each commit and its referenced objects (tree, blobs)
Filter out objects already in the database (by SHA)
Transact to Datomic

Incremental Sync

Because SHAs are unique identities, sync is idempotent:

(defn db-has-sha? [db sha]
  (boolean
   (d/q '[:find ?e .
          :in $ ?sha
          :where [?e :git/sha ?sha]]
        db sha)))

Before parsing an object, we check if it exists. If it does, skip it. This means you can run sync! repeatedly and it only adds new commits.

Basic Queries

Once synced, you can query with Datalog:

;; Count total commits
(d/q '[:find (count ?c) .
       :where [?c :git/type :git.types/commit]]
     db)
;; => 142

;; Get recent commits with messages
(d/q '[:find ?msg ?time
       :where [?c :git/type :git.types/commit]
              [?c :git.commit/msg ?msg]
              [?c :git.commit/time ?time]
       :order-by [[?time :desc]]
       :limit 10]
     db)

;; Find all unique filenames
(d/q '[:find ?name
       :where [?f :file/name ?name]]
     db)

;; Fulltext search on commit messages
(d/q '[:find ?msg
       :where [?c :git.commit/message ?msg]
              [(fulltext $ :git.commit/message "bugfix") [[?c]]]]
     db)

The :where clause is a pattern. [?c :git/type :git.types/commit] means “find entities where the :git/type attribute equals :git.types/commit, bind the entity to ?c.”

Variables (prefixed with ?) unify across clauses. If ?c appears in multiple clauses, it must be the same entity in all of them.

Why This Matters

Git’s query capabilities are limited. Want to find commits in a date range that touched a specific file? You’re writing shell scripts. Want to analyze commit patterns over time? Export to CSV and load into pandas.

With Muramasa, it’s just Datalog:

;; Commits in date range touching specific file
(d/q '[:find ?msg ?time
       :in $ ?start ?end ?filename
       :where [?c :git/type :git.types/commit]
              [?c :git.commit/msg ?msg]
              [?c :git.commit/time ?time]
              [(>= ?time ?start)]
              [(<= ?time ?end)]
              [?c :git.commit/tree ?tree]
              [?tree :git.tree/nodes ?node]
              [?node :git.node/filename ?file]
              [?file :file/name ?filename]]
     db
     #inst "2024-01-01"
     #inst "2024-12-31"
     "README.md")

The query walks the graph - from commit to tree to node to file. Datalog handles the joins.

Next time: how we parse git objects with JGit and handle the object graph.