Git history is a graph database. Commits point to trees, trees contain blobs, commits reference parent commits. But git’s native query interface is terrible - you’re stuck with shell commands and grep.
What if you could query git history with a real query language? Not git log | grep, but actual relational queries. “Show me all commits that touched file X and were authored by person Y in date range Z.” Or “find files that exist in commit A but not in commit B.”
That’s what Muramasa does. It syncs a git repository into Datomic, letting you query git history with Datalog.
Datomic is a database where everything is immutable. Perfect for git, where nothing ever changes - commits are eternal, trees are content-addressed, the SHA is the identity.
Datomic also has Datalog, a query language that’s basically SQL but for graphs. Git history is a graph. The match is natural.
A git repository has four object types: commits, trees, blobs, and tags. Muramasa models these as Datomic entities:
;; Every git object has a type and SHA
{:db/ident :git/type
:db/valueType :db.type/keyword
:db/cardinality :db.cardinality/one}
{:db/ident :git/sha
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/unique :db.unique/identity} ;; SHA is unique
The :db/unique :db.unique/identity is key. It means we can upsert by SHA - if we try to insert an object that already exists, Datomic merges it. This makes incremental sync trivial.
Commits have messages, timestamps, and references to parents and tree:
{:db/ident :git.commit/msg
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one}
{:db/ident :git.commit/message ;; Full message
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/fulltext true} ;; Enable fulltext search
{:db/ident :git.commit/time
:db/valueType :db.type/instant
:db/cardinality :db.cardinality/one}
{:db/ident :git.commit/tree
:db/valueType :db.type/ref ;; Reference to tree entity
:db/cardinality :db.cardinality/one}
{:db/ident :git.commit/parents
:db/valueType :db.type/ref ;; References to parent commits
:db/cardinality :db.cardinality/many}
:db.type/ref is how you model relationships. :git.commit/tree points to another entity (the tree). This is the power - you can traverse the graph in queries.
Trees contain nodes (file entries), blobs are file content:
{:db/ident :git.tree/nodes
:db/valueType :db.type/ref
:db/cardinality :db.cardinality/many
:db/isComponent true} ;; Nodes are owned by trees
{:db/ident :git.node/filename
:db/valueType :db.type/ref ;; Reference to file entity
:db/cardinality :db.cardinality/one}
{:db/ident :git.blob/uri
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one}
{:db/ident :file/name
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/unique :db.unique/identity
:db/fulltext true} ;; Search filenames
The main API is a single function - sync!:
(defn sync! [conn repo-path]
(let [repo (load-repo repo-path)
db (d/db conn)]
;; Ensure schema exists
(ensure-schema! conn)
;; Collect all commits
(let [commits (rev-list repo)
objects (parse-objects repo commits db)]
;; Transact in dependency order
(transact-objects! conn objects)
{:commits-synced (count commits)
:objects-synced (count objects)})))
The process:
git log --all)Because SHAs are unique identities, sync is idempotent:
(defn db-has-sha? [db sha]
(boolean
(d/q '[:find ?e .
:in $ ?sha
:where [?e :git/sha ?sha]]
db sha)))
Before parsing an object, we check if it exists. If it does, skip it. This means you can run sync! repeatedly and it only adds new commits.
Once synced, you can query with Datalog:
;; Count total commits
(d/q '[:find (count ?c) .
:where [?c :git/type :git.types/commit]]
db)
;; => 142
;; Get recent commits with messages
(d/q '[:find ?msg ?time
:where [?c :git/type :git.types/commit]
[?c :git.commit/msg ?msg]
[?c :git.commit/time ?time]
:order-by [[?time :desc]]
:limit 10]
db)
;; Find all unique filenames
(d/q '[:find ?name
:where [?f :file/name ?name]]
db)
;; Fulltext search on commit messages
(d/q '[:find ?msg
:where [?c :git.commit/message ?msg]
[(fulltext $ :git.commit/message "bugfix") [[?c]]]]
db)
The :where clause is a pattern. [?c :git/type :git.types/commit] means “find entities where the :git/type attribute equals :git.types/commit, bind the entity to ?c.”
Variables (prefixed with ?) unify across clauses. If ?c appears in multiple clauses, it must be the same entity in all of them.
Git’s query capabilities are limited. Want to find commits in a date range that touched a specific file? You’re writing shell scripts. Want to analyze commit patterns over time? Export to CSV and load into pandas.
With Muramasa, it’s just Datalog:
;; Commits in date range touching specific file
(d/q '[:find ?msg ?time
:in $ ?start ?end ?filename
:where [?c :git/type :git.types/commit]
[?c :git.commit/msg ?msg]
[?c :git.commit/time ?time]
[(>= ?time ?start)]
[(<= ?time ?end)]
[?c :git.commit/tree ?tree]
[?tree :git.tree/nodes ?node]
[?node :git.node/filename ?file]
[?file :file/name ?filename]]
db
#inst "2024-01-01"
#inst "2024-12-31"
"README.md")
The query walks the graph - from commit to tree to node to file. Datalog handles the joins.
Next time: how we parse git objects with JGit and handle the object graph.