Occam's razor Archive Pages Categories Tags

Querying Git History with Datalog - Part 3: Advanced Queries and Analytics

17 March 2018

In Part 1, we covered the schema. In Part 2, we built the sync engine. Now let’s actually use this thing - what kind of questions can we answer with git data in Datomic?

The full code is at github.com/navgeet/muramasa.

File History

Git’s native commands for file history are awkward. git log -- filename shows commits that touched a file, but extracting structured data from the output is painful. With Datalog, it’s straightforward:

;; All commits that touched README.md
(d/q '[:find ?msg ?time
       :in $ ?filename
       :where [?c :git/type :git.types/commit]
              [?c :git.commit/msg ?msg]
              [?c :git.commit/time ?time]
              [?c :git.commit/tree ?tree]
              [?tree :git.tree/nodes ?node]
              [?node :git.node/filename ?file]
              [?file :file/name ?filename]
       :order-by [[?time :desc]]]
     db
     "README.md")

This walks the graph: commit → tree → node → file. Each step is a join. Datalog handles it.

Want to know when a file was added or deleted?

;; Find when README.md first appeared
(d/q '[:find (min ?time) .
       :in $ ?filename
       :where [?c :git/type :git.types/commit]
              [?c :git.commit/time ?time]
              [?c :git.commit/tree ?tree]
              [?tree :git.tree/nodes ?node]
              [?node :git.node/filename ?file]
              [?file :file/name ?filename]]
     db
     "README.md")
;; => #inst "2016-04-15T10:23:00.000-00:00"

;; Find when it was last modified
(d/q '[:find (max ?time) .
       :in $ ?filename
       :where [?c :git/type :git.types/commit]
              [?c :git.commit/time ?time]
              [?c :git.commit/tree ?tree]
              [?tree :git.tree/nodes ?node]
              [?node :git.node/filename ?file]
              [?file :file/name ?filename]]
     db
     "README.md")

Commit Patterns

Analyze commit activity over time. When do people commit most? What days of the week?

;; Commits per month
(d/q '[:find ?month (count ?c)
       :where [?c :git/type :git.types/commit]
              [?c :git.commit/time ?time]
              [(clojure.instant/format-date "yyyy-MM" ?time) ?month]]
     db)
;; => [["2024-01" 42] ["2024-02" 38] ...]

;; Commits by day of week
(d/q '[:find ?dow (count ?c)
       :where [?c :git/type :git.types/commit]
              [?c :git.commit/time ?time]
              [(clojure.instant/format-date "E" ?time) ?dow]]
     db)
;; => [["Mon" 98] ["Tue" 112] ["Fri" 45] ...]

The [(clojure.instant/format-date ...)] is a Datalog function call. You can call arbitrary Clojure functions in queries.

Repository Structure

What does the codebase look like? How many files, what types?

;; Total unique files that have ever existed
(d/q '[:find (count ?f) .
       :where [?f :file/name _]]
     db)
;; => 247

;; Files by extension
(d/q '[:find ?ext (count ?f)
       :where [?f :file/name ?name]
              [(re-find #"\\.([^.]+)$" ?name) [_ ?ext]]]
     db)
;; => [["clj" 42] ["md" 5] ["txt" 3] ...]

The re-find extracts file extensions using regex. Again, Datalog lets you call Clojure functions directly.

Finding Deleted Files

Git doesn’t make it easy to find files that used to exist but were deleted. With Muramasa:

;; Files that exist in any commit
(def all-files
  (d/q '[:find ?name
         :where [?f :file/name ?name]]
       db))

;; Files in the most recent commit
(def current-commit-sha
  (d/q '[:find ?sha .
         :where [?c :git/type :git.types/commit]
                [?c :git/sha ?sha]
                [?c :git.commit/time ?time]
         :order-by [[?time :desc]]
         :limit 1]
       db))

(def current-files
  (d/q '[:find ?name
         :in $ ?sha
         :where [?c :git/sha ?sha]
                [?c :git.commit/tree ?tree]
                [?tree :git.tree/nodes ?node]
                [?node :git.node/filename ?file]
                [?file :file/name ?name]]
       db
       current-commit-sha))

;; Difference = deleted files
(clojure.set/difference (set all-files) (set current-files))
;; => #{["old_config.yml"] ["deprecated.clj"] ...}

This is tedious with git commands. With Datalog, it’s a set operation.

Commit Message Analysis

Fulltext search is built into the schema:

;; Find all bug fixes
(d/q '[:find ?msg ?time
       :where [?c :git/type :git.types/commit]
              [?c :git.commit/msg ?msg]
              [?c :git.commit/time ?time]
              [(fulltext $ :git.commit/message "fix") [[?c]]]]
     db)

;; Commits mentioning specific issues
(d/q '[:find ?msg ?time
       :where [?c :git/type :git.types/commit]
              [?c :git.commit/msg ?msg]
              [?c :git.commit/time ?time]
              [(re-find #"#\d+" ?msg)]]
     db)
;; Finds commits with "#123" style issue references

You could build a dashboard showing which issues were addressed in which commits, when bugs were fixed, what features were added in each release.

Tree Similarity

Find commits with similar file sets:

(defn files-in-commit [db sha]
  (d/q '[:find ?name
         :in $ ?sha
         :where [?c :git/sha ?sha]
                [?c :git.commit/tree ?tree]
                [?tree :git.tree/nodes ?node]
                [?node :git.node/filename ?file]
                [?file :file/name ?name]]
       db
       sha))

(defn commit-similarity [db sha1 sha2]
  (let [files1 (set (files-in-commit db sha1))
        files2 (set (files-in-commit db sha2))
        intersection (clojure.set/intersection files1 files2)
        union (clojure.set/union files1 files2)]
    (/ (count intersection) (count union))))  ;; Jaccard similarity

;; Find commits similar to a given commit
(defn similar-commits [db target-sha threshold]
  (let [all-commits (d/q '[:find ?sha
                           :where [?c :git/type :git.types/commit]
                                  [?c :git/sha ?sha]]
                         db)]
    (filter
     (fn [[sha]]
       (> (commit-similarity db target-sha sha) threshold))
     all-commits)))

This could identify refactoring commits, merges that brought in similar changes, or releases with comparable scope.

Object Type Distribution

Understand repository composition:

;; Count objects by type
(d/q '[:find ?type (count ?e)
       :where [?e :git/type ?type]]
     db)
;; => [[:git.types/commit 142]
;;     [:git.types/tree 580]
;;     [:git.types/blob 1240]
;;     [:git.types/node 2450]]

;; Average tree size (number of nodes per tree)
(d/q '[:find (avg ?count) .
       :where [?t :git/type :git.types/tree]
              [(count ?t :git.tree/nodes) ?count]]
     db)
;; => 4.2

Time-Based Queries

Datomic is temporal - you can query the database “as of” a specific time. But for git, this is even more useful because commits have timestamps:

;; What files existed on 2024-01-01?
(d/q '[:find ?name
       :in $ ?date
       :where [?c :git/type :git.types/commit]
              [?c :git.commit/time ?time]
              [(<= ?time ?date)]
              [?c :git.commit/tree ?tree]
              [?tree :git.tree/nodes ?node]
              [?node :git.node/filename ?file]
              [?file :file/name ?name]
       ;; Get the most recent commit before date
       :order-by [[?time :desc]]
       :limit 1]
     db
     #inst "2024-01-01T00:00:00")

This reconstructs the repository state at a point in time. You could build a “time machine” viewer that shows what the repo looked like at any historical moment.

What This Enables

The killer feature isn’t any single query - it’s that you can answer questions you didn’t anticipate. With git log, you’re limited to what the commands expose. With Datalog, you have the full data model.

Want to find commits that added files matching a regex pattern, authored in a specific month, that touched fewer than 5 files total? That’s a Datalog query. Want to track how file count grew over time, segmented by directory? Datalog. Want to identify commits that might be “big refactorings” based on number of files touched and lack of new files? Datalog.

The limit is your SQL/Datalog skills, not git’s interface.

Performance

Muramasa isn’t optimized for huge repos. Syncing the Linux kernel (1M+ commits) would take a while. But for most projects (hundreds to low thousands of commits), sync is under a minute and queries are instant.

The bottleneck is usually parsing objects from JGit, not Datomic. Blob persistence to disk can be skipped for faster sync if you don’t need file contents.

For analytics, you sync once, query many times. That tradeoff works.

What’s Missing

Entity references for tree/parent relationships are currently disabled (see Part 2). Re-enabling them would allow queries like “find all files reachable from commit X” by walking tree → subtree → blob references.

Author/committer information isn’t captured. Adding :git.commit/author would enable “who touched this file” queries.

Branch/tag support would let you query “what’s on main vs what’s on feature-branch.”

The foundation is there. These are extensions, not fundamental rethinks.