This is a specification of Digital Succession Identifiers implemented over git.
This is a technical specification of
A digital succession contains digital objects. A digital object in
this context is a fixed finite collection of bits. A computer file is
an example of a digital object. A file system directory (folder) can
also be represented as a digital object by both
A digital succession is also digitally signed. In the simplest
case, a single user signs using a digital key that is verified to be
for that user. In this specification, the digital signature is made
using a
When a digital succession is created, it consists of only a
digitally signed genesis record and no digital object editions. The
In the git implementation, the genesis record is a signed initial git commit with an empty tree (and no parent).
To expand the digital succession, additional commits are made
using the same signing key. The git tree associated with each
additional commit is not a new edition of the digital object. The
git tree of a commit is a full succession record. The full
succession record is a directory of all editions to date. The top
level directory consists of only subdirectories named as
non-negative numbers. Within each number named subdirectory is an
entry named
For example, when a single file is added as edition 1, the full
succession record is a directory with one path of
Given a digital succession genesis record, git and open-source
software that emulate git, can calculate a 20 byte binary hash which
identifies that digital succession. In most uses of git, this 20
byte binary hash is represented textually as a 40 digit hexadecimal
number. In this specification, the textual identifier of the 20 byte
binary hash is a 27 character representation in standard base64url
format (RFC
4648)
It is worth noting, that when a new digital succession (git initial commit) is created, a user can choose not to use it and instead immediately create a new one with a different base64url text representation. There is little cost in not using empty digital successions and recreating new genesis records (git initial commits) until an acceptable base64url identifier is found.
The term edition is used to refer to a digital object in the digital succession. Part of the motivation is to avoid confusion with the many uses of the term version. For instance, version can be used to refer to versions of a digital object that are not added as editions. One can also talk about versions of the entire succession record.
This language is partly due to the initial application of digital
successions for scientific articles. Although there are rarely new
versions of scientific articles after they are published, it is well
established that scientific textbooks are published in multiple
In the initial application of digital successions, documents are
the digital object. A single edition of a document might be
presented in different HTML and PDF
A significant design choice is to not directly rely on git commit history to determine the succession of editions. Git commit history is conducive to accurately and faithfully recording the actions performed with git. However this often can be inflexible and confusing for establishing a clean linear history. That Software Heritage automatically preserves git commits compounds the risk that git commits do not correspond well, or perhaps even prevent, an intended clean linear history. It seems likely that there will be situations where having merge commits and non-linear git commit history will be convenient.
Having edition history separate from git history also provides a potential path for an enhancement akin to retractions of specific editions.
The textual identifier is in base64url (Base64 with URL and
filename safe alphabet) as specified in RFC 4648
As mentioned in the introduction to the textual representation, there is no obligation to use the Digital Succession Identifier (DSI) generated by a brand new digital succession. If a DSI is unacceptable to the creator of a new digital succession, a new DSI can be easily generated. Helper tools could perform this function for users by default.
The main con to base64url is that it can contain characters which are more prone to copy errors when transmitted through human sight. Certain fonts and handwriting can make poor or no distinction between certain character pairs. For example, some popular sans serif fonts make no visual distinction between capital I and lowercase l.
The main pro to base64url is that it is 27 characters rather than 40. Future upgrades to DSI will likely make those lengths even larger, and the difference greater. Given that DSIs are used in contexts similar to DOIs, it is likely more acceptable to adopters to use 27 characters which is very typical for a long DOI, rather than 40 characters. Displaying an ID of 27 characters rather than 40 characters also fits better on web pages rendered on mobile devices. 27 characters are less like to be hidden by showing ellipsis, horizontal scroll regions or shrunk into extremely small fonts.
This design decision is partly made on the belief that the copy-via-human-sight issue is increasingly mitigated by a few technology trends:
computer system are relied upon more and more via more reliable copying mechanisms such as clicking on hyperlinks, copy-and-paste, and QR codes instead of visual copying and handwriting.
Websites tend to switch to using appropriate fonts for code like base64url IDs, when output into HTML. The same applies for tools that automatically produce PDFs from computer data.
human-to-computer interfaces, such as input fields of Internet connected software, tend to add features like autocomplete and search disambiguation for typos, when sufficient need arises for such a feature.
An edition number must be a positive integer strictly less than 65,536 (2 to the 16th power). Edition number zero is reserved for special future use.