update : Added a few comments on making the commons tangible
In this post, we bring together two ideas which have helped to inform how we build the Africa-Arabia Regional Operations Centre - “executable infrastructure” and the “e-infrastructure commons”.
Beginning to worry that "commons" like "excellence" is a word we can all agree on, by avoiding having to admit we don't know what it means..— CⓐmeronNeylon (@CameronNeylon) June 22, 2017
We would like to put some form to these vague concepts, and hopefully this form will provide better encouragement for those involved in the ROC - resource owners, operations specialists, developers, contributors of all kinds - to continue their support. At a bare minimum, we would like to have a public statement of intent, which contributors can decide whether or not to endorse. A web page that says something like
Infrastructure is for all, and it takes form in code. We recognise the following principles … If you do too, sign the contributor agreement and contribute your infrastructure code to the commons.
Then we would have a form of non-binding adherence to a code of conduct which encourages the contribution of code for building infrastructure, to a repository.
The commons, the commoners
Underpinning all of this is the “infrastructure as code” paradigm - a way of representing the physical hardware, the operating systems, the applications, and all of the connections between them, in software. This paradigm goes hand-in-hand with the “DevOps” movement, and implies a growing set of tools and workflows for expressing, testing, and delivering fully-functional platforms. This is what we are all about in the ROC, with the added complication that we are a federated, distributed infrastructure aiming to offer resources to a wide range of scientific groups via more or less unique interfaces. We want everyone using the services in the ROC to be able to do so in the same way (or at least, however they prefer), across all of the sites.
We use code for this - configuration and orchestration software. It may be “meta”, but it’s still code, and it is just as tangible a representation of our actual infrastructure and services, once it is executed, as the hardware itself.
However, we now run into some long-standing general issues around sharing software. There are several legal issues (liability, ownership, copyright, etc ) which, although considered since the start of the Free Software Movement, take on subtle implications now that we are building infrastructure. We want to have a framework which makes it not only easy for collaborators at institutes to contribute their code, but downright attractive.
Why should an institute working on orchestration for a compute service, a cloud platform, or a data infrastructure, or a science gateway, or an identity management system wish to contribute that code to a repository ? This brings into light the idea of the “commons”1.
Why are we doing all this ?
There is a tug towards the opening of access and participation to the products of publicly-funded research, and parts of the Open Science movement are due to this tug. This often addresses scholarly communication and research objects like data sets and other research outputs. The H2020 guidelines on dissemination of research outputs states that grant winners have to deposit (publish) the data, software, and other output… but they do not say anything about the actual infrastructure that is built. If we consider that our infrastructure is code, it’s not a huge leap to expect
What does e-IRG say ?
The 2016 e-IRG Roadmap states in section 5.4 :
Another area of legal issues concerns software licenses. Software licences should allow free use (but not necessarily free of charge) by different scientific communities via the communication (network) infrastructure. Other areas have been dealt with in the past and progress has been made with the recent laws in the areas of State Aid, Data Protection and Network Regulation developed at European level. Scientific software offered by national, regional and disciplinary centres is (and should be) under special licenses tailored to the local requirements negotiated by software licence providers.
Software tools provided as a service under the e-Infrastructure Commons must either have a public license or should be offered under special license conditions for the European scientific community.
This statement aligns our activity to build and publish an executable expression of infrastructure with the roadmap that the e-IRG has envisioned for the European Open Science Commons.
Public infrastructure with public funds
First of all, let’s try to narrow down what kind of code we are talking about. We are not referring to the software of the application - the actual data storage service, the actual HPC application, the actual local resource manager, the actual identity store. We are referring to how that software is deployed - the configuration, monitoring, integration, etc. The intellectual property of the person, team or institute which developed the service is not in question, but rather the ability to reproduce and reliably deliver that service in an arbitrary environment. In a research environment, usually created with public funds, this is almost an obligation.
Extending and scaling infrastructure
The second issue we are trying to address is that of interoperability and scalability of services in an environment which has too few humans. We face a severe lack of resources of all kinds in Africa, so wasting the most precious ones (humans) on needlessly manually installing and tuning services is evidently a bad idea. If instead we could deliver code which had already been developed for certain environments, tested against them, inspected and reviewed by peers, and could be executed in order to build e-Infrastructure, this would be a much better investment of effort2 It would also make it much easier to extend infrastructure, wherever resources could be found - at new sites (universities), or at commercial providers.
Resilience to change
Finally, adopting this paradigm of infrastructure as code, and providing a framework in which it is feasible and attractive to share that code makes it run. Our platforms and middleware stacks are changing rapidly. The concept of infrastructure is also changing, the expectations of research communities are changing. In a way, this is a good thing - many more opportunities are open to many more people now - but with these changes sometimes comes an erosion the desire to collaborate and contribute to a common effort.
Instead of one middleware stack to integrate fairly similar resources, we now have choices to make all the way from the OS and kernel to the authentication mechanism. It can be overwhelming, especially in the absence of strong leadership and community.
E Pluribus Unum ?
Having a single entity in Africa which owns, operates and arbitrates e-Infrastructure is neither feasible nor desirable. On the other hand, making ‘every provider a king” ends in chaos. There is clearly a tradeoff to be reached. Some complexity is required in in order to offer relevant services and allow for some creativity. Some diversity in compute frameworks and platforms is desirable in order to satisfy various workflow demands, but too much makes it confusing to the user and creates too much overhead for the operator.
Neither, in my opinion, can the right balance be determined a-priori.
Touching the commons
Experience is probably the best guide when determining what to support and integrate and what to consider “outside” the commons. However, it should be agreed that there there is a commons, and it should be tangible.
The big idea is to have a library of infrastructure as code, which :
- describes what we have all built
- is expressible in some executable way
- is attributed to the correct people
- can be easily shared.
- is attractive to contributors
This could provide the basis for all kinds of other scholarly and technical output. It would be a tangible representation of things that are so often hidden in data centres, trenches or the deep blue sea.
Legal, License, and Community Conduct Issues
As soon as we start talking about sharing and contributing, we run into intellectual property issues. These are not by any means intractable, and are governed by the license that is chosen to stipulate how and when people can use the software, as well as what liability the copyright holder has towards the end-user. The end user in this case is an institute or group participating to the infrastructure by providing resources.
Since we are building code for infrastructure, we want :
- Code to be owned by the contributor
- Waive liability
- Ensure a certain level of quality
- Accept appropriate contributions
- Define a form of community maintenance responsibility
Some of these can be covered by a license, whilst others can be encapsulated in the form of a Code of Conduct in the repository.
So, which license should we choose ? The obvious place to start would be choosealicense.com. We had a bit of a discussion about which license to choose on the forum. But what about other infrastructure projects ? After a brief scan of some of their repos, I came up with the list below3
- DataCite : MIT
- EUDAT : GPL ?
- EGI : Apache-2.0
- ORCID : MIT
- Zenodo : GPLv2
- OpenAIRE : MIT / Apache-2.0
- GEANT : MIT / Apache-2.0
We have a solid base of code to start from - AAROC/DevOps. This is where we want to encourage contributions to, but it is not clear how or why to do that. The first steps are
- to clarify the contributor agreement and code of conduct, by writing a statement of principles on the AAROC public web page
- This page should have a clear call for contributions
- Write a code of conduct in the repository, outlining how contributions are accepted, reviewed and published.
- Create a “contributors” file to acknowledge the ownership and contribution of various people and institutes.
Eventually, we could curate these contributions via a dedicated collection or repository.
Acknowledgement for discussion of ideas and background go to
- Mozilla Science Lab, particularly their Working Open Workshop
- A lot of background came from reading the work of Nadia Eghbal, and in particular the contributing template
- Martin Fenner from DataCite
- The Github Open Source Guide
How to choose a license ?
The following folks and articles have informed much of what I’ve written here :
- Sam Halliday’s presentation at scalasphere4
- “A Quick Guide to Software Licensing for the Scientist-Programme”5
- “How to choose a license for your software”6
References and Footnotes.
This is not an exhaustive, nor authoritarian list. In some cases, I took the most-used license, when there were several ↩
Morin A, Urban J, Sliz P (2012) “A Quick Guide to Software Licensing for the Scientist-Programmer”. PLOS Computational Biology 8(7): e1002598. https://doi.org/10.1371/journal.pcbi.1002598 ↩
Sufi, S. (2015, September). “How to choose a license for your software. Zenodo.” ↩