Close

February 1, 2024

Software Heritage in 2023: a perspective

As we enter 2024, we publish, as usual, our annual report on the past year, and like last year this is now available as a → standalone document ←, making it easier to grasp the breadth of the mission, follow the progress made and share it with a broader audience.

The start of 2023 witnessed the Software Heritage symposium and summit held at UNESCO’s headquarters in Paris, France. This collaborative event with UNESCO focused on the international conference themed “Software Source Code as documentary heritage and an enabler for sustainable development.” The program extensively delved into five primary dimensions:

  • Understanding software source code as documentary heritage and its role in digital skills education;
  • Considering software source code as a research object in open science;
  • Examining software source code’s impact on innovation and sharing in industry and administration;
  • Discussing long-term preservation perspectives, and
  • Reviewing technological advances in software source code analysis.

Software Heritage Symposium and Summit – UNESCO

The event gathered our community, including team members, ambassadors, grantees, partners, and contributors who discussed the Software Heritage Archive and various aspects of its mission. The dedicated blog post offers a summary of the workshop’s key points, and our annual report, presented as a standalone document for the first time, gives an overview of our progress.

We suggest reading UNESCO’s article, , Positioning software source code as digital heritage for sustainable development“, the complete transcript is accessible in PDF format.

The event’s recording is also available online for those who couldn’t attend.

In 2023, we welcomed 10 new ambassadors to our cause, 5 women and 5 men, bringing the count of our team of ambassadors to 33 worldwide. We featured several ambassador articles this year: one by Simon Phipps titled “Open Source ensures code remains a part of culture” advocating for the preservation of software as a cultural element through Open Source, one by Agustin Bethencourt titled “Why did I become a Software Heritage Ambassador?” that delves into the significance of Software Heritage within the industry, and one titled  “Viewpoints on software in research at the Gustave Eiffel University, an interview with Céline Rousselot and Joenio Marques da Costa.

Throughout the year, the ambassador community held two plenary sessions, in close contact with the Software Heritage core team. One key topic has been software metadata, a complex but essential issue, that is detailed in the article  “Deep Dive into the archival of Software Metadata”. A special effort has been made to present the broad lines of the 2023 Software Heritage technical roadmap, that has been published in the first quarter of 2023.

Supporting Open Source

At Software Heritage, we remain committed to advocating for the importance of open-source software and its role in shaping the future of technology. This is why we co-signed an open letter with the Eclipse Foundation on the Cyber Resilience Act. The objective of this new regulation is to ensure the safety and security of our digital infrastructure, including software, but we must make sure that it does not hinder the progress and innovation of open-source software as an unintended side effect. You can read the open letter and learn more about this important topic on the Eclipse Foundation’s website.

Building a collaboration infrastructure

We know that to succeed in the humbling mission we have undertaken we need to enalbe a large community to contribute and collaborate. This year we are happy to report several key adavances in this direction.

We concluded a multi year effort conducted with help by Open Tech Strategies to transition our development and operations from our previous system to our own GitLab instance, that is more familiar for external contributors.

We opened a new documentation landing page at docs.softwareheritage.org to make it easier for newcomers to find their way in the vast amount of documentation available.

We have been working to make it easier for developers to regularly archive their software in Software Heritage by introducing the dedicated save code webhooks in the API for several popular forges and technologies: Bitbucket, Gitea, GitHub, Gitlab and Sourceforge.

Last, but not least, we have introduced a GRaphQL API, that greatly simplifies programmatic access to the archive: users can play with it usint the Software Heritage GraphQL Explorer. This is an addition to the traditional Software Heritage’s REST API that will enable clients to craft robust queries and seamlessly retrieve server data.

SWHID sees growing adoption adn becomes the Software Hash Identifier

A key part of the Software Heritage infrastructure are the persistent identifiers known as SWHID, that allow to guarantee integrity of software artefact without relying on third parties, enabling better scientific reproductibilit.

This year, SWHID adoption has been growing in academia. A close collaboration wich CCSD and IES-INRIA led to opening up SWHID deposit on HAL since January 2023 to all french researchers, massively simplifying the referencing research software in french institutional portals, and the generation of the many reports often requested in an academic career. At an international level, the Computer Graphic Replicability Stamp Initiative (GRSI) now uses Software Heritage to archive software associated to research articles, and uses SWHIDs to reference it: when a code is accepted for the Replicability Stamp, it relies on Software Heritage to create a snapshot of the project and references the accepted version with the corresponding SWHID.

The SWHD identifier has been developed at Software Heritage, where it has been in use in our archive for almost a decade. Since it can be computed independently, and used of a variety of other applications, the time has come to create and independent specification, to ensure that all stakeholders can benefit from it. To this end, after almost two years of intense work an open working group has released the publicly available specification of the SWHID, that is now spelled “Software Hash Identifier” and no longer “Software Heritage Identifier” (pronounce it /ˈswɪd/).

Software Heritage in European Research Projects

At Software Heritage, we have a long tradition of participating to collaborative research project when we can help improve the way research software is archived, referenced, descibed and cited. On the infrastructural side of Open Science, groundbreaking work is ongoing in a dedicated work package in the FAIRCORE4EOSC European project, to connect scholarly infrastructures with the Software Heritage archive. The first visible outcome is the partnership initiated with the swMATH portal to bridge mathematical publications with comprehensive software records, enriching the scholarly landscape. This year, we also contributed to a collaborative effort by two such projects,  FAIR-IMPACT and FAIRCORE4OSC during the RDA P20 plenary in Gothenburg.

Software Heritage in also part of the SoFAIR project, recently awarded through the CHISTERA Open Research Data & Software Call, whose goal is to elevate the discoverability and reusability of open research software, aligning with our commitment to advancing the accessibility of software source code artifacts.

Research on Software Heritage

Campus Cyber – Paris | © Inria / Photo B. Fourrier

Software Heritage is an archive, but also an exceptional infrastructure to enable research on software develoment. This year, we embarked in the SWHSec project, announced during the launch of a new national research and innovation program on cybersecurity – PTCC. This groundbreaking initiative brings together eight expert research teams specializing in security, software engineering, and open-source software to harness the power of Software Heritage’s robust infrastructure and create cutting-edge tools for cybersecurity.

Software Heritage and Large Language Models for Code

We acknowledge the huge potential of the Software Heritage archive for the training of machine learning models, particularly large language models (LLMs) that can automatically generate code to assist with software development tasks. In alignment with our mission, we advocate for a transparent and respectful approach to the development of these models, aligned with our mission, as detailed in our statement for acceptable machine learning use of the Software Heritage archive.

Saving Inria’s software legacy

In the pursuit of safeguarding Inria’s software legacy, we started a collaboration with the Inria alumni network and the Direction of Culture and Scientific Information (DCIS) to reach out to, and invite former individuals who had worked at Inria to participate in enriching the inventory of software heritage created at Inria since its inception.

Leveraging the Software Stories interface, created in 2021 in collaboration with the Science Stories team and the University of Pisa with UNESCO’s support, a first result of this effort is the publication of the story of the web browser and editor Amaya

Software, a pillar of Open Science

Software, and its source code, is a pillar of Open Science, and Software Heritage has been recognized by the Global Sustainability Coalition for Open Science Services (SCOSS) for its key role in ensuring continuous access to software as a research output. We look forward to seeing many new members join the newly created Archives and Libraries Interest Group (ALIG) that will bring together academic stakeholders worldwide.

Thanks to our sponsors

We’re grateful to our sponsors, including our new additions Hugging Face, ServiceNow, and Scanoss: it is their continued support that enables us to make progress in this long term mission.

First international mirror

And we finished this intense year with the launch of the first international mirror of the Software Heritage Mirror Network by ENEA, the Italian National Agency for New Technologies, Energy and Sustainable Economic Development.  This is a key milestone in the long-term preservation strategy of all our software commons, and is the result of long years of technical and organisational development efforts that will make it much easier for the other forthcoming mirrors to go into production.

Roberto Di Cosmo
Director, Software Heritage

February 1, 2024