tools | iridium

iridium – ‘core’ institutional research data management plan development

February 15, 2013 by Lindsay Wood 2 Comments

Research data management plan authoring is a key part of our draft institutional RDM policy and good practice. Most RCUK funders (apart from EPSRC, currently) require a formal RDMP. NERC now require a pre- and post award RDMP. These types of templates are available in the DMP Online system.

We wanted to write an institutional RDMP within iridium for research projects that do not have a Funder mandated template (a fair proportion, ~66% of research projects?). This was to be as easy to complete by end user as possible (i.e. low time burden for researchers) and to be used across Faculties (disciplines) if possible.

What are the ‘essential’ RDMP questions (a ‘core plan’, Donnelley, 2012)? We reviewed several RCUK RDMP templates from different disciplines for similarities, but also distinctive and pertinent questions. Also we had project specific criteria together with good practice from DATUM RDMP template with strong actions and review (‘active plan‘) emphasis.

It was decided to pursue a post-award RDMP template approach for projects without a mandate plan, as less it was burden to write ‘core’ plans for projects that were not awarded in the end and maximise uptake. We noted need for key aspects of RDMP planning to be brought forward in pre-award processes and RIM systems (such as ethics which is already strongly monitored institutionally, but also including RDM costs/atypical data volume size (plus extended curation duration?)). For example, recommending for questions/planning to be a ‘flag’/check-box in a ‘minimal’ (‘ultra-minimal’?) RDMP check-list in existing RIM systems/pre-award Faculty peer review process.

—- —- —

iridium institutional template post-award RDMP v5 [DRAFT]

This template is for projects that DO NOT have a Funder mandated research data management (RDM) plan. Funding body requirements relating to the creation of a research data management plan are available from …

{ Our institution RIM system MyProjects contains research project administration data (see below). In the long term it would be useful to have this imported and auto-populated into a RDMP direct from RIM system. This aligns to the ‘header’ information in the DMP Online template }

Reference:
Proposal Type:
Proposal Title:
Proposal Short Title:
… …. … …. etc.

Contact details of named individuals (Role/Name/Unit):

MyProjects Owner:

Date of creation of this plan:
Plan version/supersedes:

Aims and purpose of plan: … …

[SCOPE NOTES: Guidance on completion of this plan is available from …. ‘DCC 1.x references link to additional guidance provide by the Digital Curation Centre]

1	Introduction and Context
1.1	Introduction and Context
	[DCC 1.2]: Short description of the project’s fundamental aims and purpose

	[DCC 1.3.2(re-worded)]: Describe how you have considered the Newcastle University RDM institutional policy and any Faculty/research group guidelines, together with any other policy-related dependencies:
	[From RC template] Document the RDM advice you have sought on planning your proposed project, including any consultation with projects using similar methods.

	[DCC 10.2]: Glossary of terms

2	Data Types, Formats, Standards and Capture Methods
2.1	Data Types, Formats, Standards and Capture Methods
	[SCOPE NOTE – for further guidance on ‘data’ definitions and the capture of non-digital data, please see XYZ]
	[DCC 2.1]: Give a short overview description of the data being generated or reused in this research
	[SCOPE NOTE – for further guidance on ‘open’ file formats, please see …]
	[DCC 2.3.3(re-worded)]: Which open file formats will you use, and why?

	DCC 2.3.4: What criteria and/or procedures will you use for Quality Assurance/Management?
	[SCOPE NOTE – for further guidance on ‘Quality Assurance/Management, please see …]
	DCC 2.5.1: Are the datasets which you will be capturing/creating self-explanatory, or understandable in isolation?

	[DCC 2.5.2]: If you answered No to [DCC 2.5.1], what contextual details are needed to make the data you capture or collect meaningful?

	[DCC 2.5.3]: How will you create or capture these metadata?

	[DCC 2.5.4]: What form will the metadata take?
3A	Ethics
3A	Ethics
	HAVE YOU COMPLETED A NEWCASTLE UNIVERSITY ETHICS APPLICATION?[YES] [NO] [NOT APPLICABLE] REFERENCE NUMBER:{ We already have strong RIM/institutional check points for ethics, we don’t want to duplicate information gathering, thus this section is brief. }

3B	Intellectual Property
3B	Intellectual Property
	[SCOPE NOTE – for further guidance Intellectual Property/licensing, please see …]
	[DCC 3.2.1]: Will the dataset(s) be covered by copyright or the Database Right? If so give details in DCC 3.2.2, below.

	[DCC 3.2.2]: If you answered Yes to [DCC 3.2.1], Who owns the copyright and other Intellectual Property?

	[DCC 3.2.3]: If you answered Yes to [DCC 3.2.1], How will the dataset be licensed?

4	Access, Data Sharing and Re-Use
4.1	Access, Data Sharing and Re-Use
	[From Research Council template] Are there issues of consent, confidentiality (including commercial), anonymisation and other ethical considerations?
	[From RC templates] What are the main risks to data security/ confidentiality?

	[DCC 4.2.3]: Are there any embargo periods for political/commercial/patent reasons?

	[DCC 4.2.4]: If you answered Yes to DCC 4.2.3, Please give details.

	[DCC 4.3.1]: Which groups or organisations are likely to be interested in the data that you will create/capture?

	[DCC 4.3.2]: How do you anticipate your new data being reused?

	[DCC 5.3.2]: How will you implement permissions, restrictions and/or embargoes?

	[DCC 4.1.1]: Are you under obligation or do you have plans to share all or part of the data you create/capture?

	[DCC 4.1.3]: If you answered Yes to DCC 4.1.1, How will you make the data available?

	[DCC 4.1.4]: If you answered Yes to DCC 4.1.1, When will you make the data available?

	[DCC 4.1.5]: If you answered Yes to DCC 4.1.1, What is the process for gaining access to the data?

	[From RC template] What will be the responsibilities of data sets users (for example as detailed in a ‘Statement of Agreement’)?
	[SCOPE NOTE – for further guidance responsibilities of data sets users and ‘Statement of Agreement’ wording, please see ….]
	[DCC 4.1.6]: Will access be chargeable?

5	Short-Term Storage and Data Management
5.1	Short-Term Storage and Data Management
	[DCC 5.1.1]: Where (physically) will you store the data during the project’s lifetime?

	[DCC 5.1.2]: What media will you use for primary storage during the project’s lifetime?

	[From RC template] What is the anticipated (‘ballpark’ figure) of data volume that will be collected? Will this vary after processing?

	[DCC 5.2.1]: How will you back-up the data during the project’s lifetime?

	[DCC 5.2.2]: How regularly will back-ups be made?

	Has the back-up process been tested and successfully validate?

	Who is responsible for back-up process?

	[DCC 5.3.1]: How will you manage access restrictions and data security during the project’s lifetime?

6	Deposit and Long-Term Preservation
6.1	Deposit and Long-Term Preservation
	[DCC 6.1]: What is the long-term strategy for maintaining, curating and archiving the data?
	[SCOPE NOTE – for further guidance curation and archiving of data sets, please see …]

	[DCC 6.2.1]: Will or should data be kept beyond the life of the project?

	What is your deletion policy? Will data sets be deleted? When, by whom and how will they be identified?

	[DCC 6.2.2]: If you answered Yes to DCC 6.2.1, How long will or should data be kept beyond the life of the project?

	[DCC 6.2.3]: If you answered Yes to DCC 6.2.1, What data centre/ repository/ archive have you identified as the long-term place of deposit?

	What is the anticipated (‘ballpark’ figure) of data volume that will be archived?

	[DCC 6.2.7]: Will transformations be necessary to prepare data for preservation and/or data sharing?

	[SCOPE NOTE – for further guidance data set transformations, please see …]

	[DCC 6.2.8]: If you answered Yes to DCC 6.2.7, what transformations will be necessary to prepare data for preservation / future re-use?

	[DCC 6.3.3]: Will you include links to published materials and/or outcomes?
	[SCOPE NOTE – for further guidance on include links to published materials and/or outcomes, including the Research Data Catalogue, please see …]

	[DCC 6.3.4]: If you answered Yes to [DCC 6.3.3], please give details.

	[DCC 6.3.5]: How will you address the issue of persistent citation?]
	[SCOPE NOTE – for further guidance persistent citation, please see …]

	[DCC 6.4.1]: Who will have responsibility over time for decisions about the data once the original personnel have gone?

7	Resourcing
7.1	Resourcing
	[DCC 7.1]: Outline the staff/organisational roles and responsibilities for research data management

	[DCC 7.2]: How will data management activities be funded during the project’s lifetime?

	[DCC 7.3]: How will longer-term data management activities be funded after the project ends?

	Describe how funding for RDM has been specifically been costed into funding application (where appropriate).
	[SCOPE NOTE – for further guidance on costings for RDM, please see …]

8	Adherence and Review
8.1	Adherence and Review
	[DCC 8.1.1]: How will adherence to this data management plan be checked or demonstrated?

	[DCC 8.1.2]: Who will check this adherence?

	[DCC 8.2.1]: When will this data management plan be reviewed?
	[SCOPE NOTE – for further guidance on review points for for RDM plans, please see …]
	[DCC 8.2.2]: Who will carry out reviews?

9	Actions Required
9.1	Actions Required
	Please list actions and timelines against named individuals identified as a result of completing this RDM plan.
	For example please indicate additional hardware, software and relevant technical expertise, support and training that is likely to be needed and how it will be acquired.
	For any deferred or unanswered questions outline how you plan to seek advice.

	Action: / Responsibility: / Review Date:-: / -: / -:-: / -: / -:

Signature			Date
Print name			Role/Institution

Signature			Date
Print name			Role/Institution

Signature			Date
Print name			Role/Institution

[Attribution]

DMPOnline: https://dmponline.dcc.ac.uk/

— — — —-

We are currently evaluating end user acceptance of this draft plan, time required to complete and support required to assist with writing.

Filed under human support, policy, reflection, tools Tagged with core, costing, DATUM, DMP, DMPonline, institutional template, minimal, RDM planning, RDMP

iridium – evaluation of DataStage and DataBank research data management tools from DataFlow project

February 14, 2013 by Lindsay Wood Leave a comment

DataFlow project background:

DCC catalogue record: http://www.dcc.ac.uk/resources/external/datastage

Two tools

(a) DataStage, for researchers to manage their research data locally.

“DataFlow lets researchers save their work to a DataStage file system that appears as a mapped drive on their computer, a lightweight system requiring them to install no special software on their computers.“

More details: http://www.dataflow.ox.ac.uk/index.php/datastage/users/researchers

(b) DataBank, to preserve and publish valuable research.

“DataStage is a secure personalized ‘local’ file management environment for use at the research group level, appearing as a mapped drive on the end-user’s computer.“

More details: http://www.dataflow.ox.ac.uk/index.php/databank

Firstly, it’s great that the DataFlow team have released this system openly for re-use. Below are some of our findings.

From a local technical infrastructure assessment:

Ubuntu is not our standard Linux platform (which is Red Hat/CentOS). It would almost certainly be possible to port the Dataflow packages to CentOS (and feed this back to the main project) or use Ubuntu as an appliance (but this would mean that the systems used for this would not be managed by our standard configuration system). Either option comes with a reasonably significant cost.

The feeling that we got from installation (testing prior to 24 July 2012) is that the system is in the early stages of its lifecycle and our assessment is that Dataflow is not yet of sufficient maturity to deploy in production at Newcastle. It would be worth re-evaluating this decision at a later time, this would be prioritised against end users who have tried the system i.e. the more that they liked it, the more worthwhile putting resources into trying it again/working with the DataStage developers.

In terms of initial user testing (in early August 2012/and on ‘v0.3.1rc2’ Oxford installation), initial feedback was:

User testing – DataStage:

Users liked the feature specification of what it offered as a tool (desktop integration through mapped drives, web access aiding working from home, do not need a designated computer for their research work, setting of different access writes (private, public, and collaborative) and the ‘invite to share’ options. System interface is fine, basic yet functional and could be ‘skinned’ to institutional brand. Uploading documents/data files is straightforward.

My opinion was if an institution had no existing RDM systems, it would be a very useful ‘bootstrap’ system providing a simple functional system.

Seamless integration of a data file staging system/VRE with the user desktop (ideally through ‘drag & drop’/mapping over existing user networked drives) and through web access are key features that are top of an ‘average’ researchers wish list.

Making sure research data sets can be appended with an appropriate level of metadata in ‘data staging’ RDM tools (or perhaps later in lifecycle as practical?), so that metadata can flow through to an eventual data catalogue/or national repository is important RDM requirement. Thus, making sure that this function is provided to researchers is important to flag and DataStage/DataBank are a good approach to this.

I thought more data file re-use metadata capture would have been an option in DataStage (noting manifest/Zip package upload feature), pulling in automatically from individual data file itself (that’s probably me being simplistic on technical aspects?) ahead of the DataBank stage?

We noted that not all users are comfortable or had success in Windows drive mapping (network path errors), so some end user support would be needed. Users have high expectations on usability – ‘as easy as DropBox’.

Error messages while testing – access forbidden, 505/405, ‘submit as data package’ – where an entered/saved password was looping? (more helpful customisation of error messages, such as ‘this problem normally occurs because of x, y or z – wrong password, wrong file path, etc.’. (rather than ‘Error 505’/’Error 404’ would be helpful.

User testing – DataBank

Liked:

– Simple, clean functional interface – again could be ‘skinned’ to instituitional brand.

– Current search/’on-off’ filters was good

– Assigning a DOI/RDF were useful RDM specific features.

– Licensing/embargo fields

– Simple admin interface

– CSV/JSON exports are useful

– Rest API was documented

Suggestions:

– Clarifying, who was intended user audience for DataBank? Researcher or archivist?

– Terminology – not understood by user testers – ‘Silo’, ‘Mediator’, ‘Aggregate’ – obviously this could be changed easy.

– RDF and click through access to XML schema was confusing for our testers (they were not archivist, librarians, metadata experts – who would probably appreciate this function – i.e. package/manifest upload/explore)

– A basic tagging interface/fields to populate the RDF/XML for none specialists would be more friendly

– Again frequent error messages (404 not found/ 500 Internal Server Error, ‘Add manifest’ gives 505)

Documentation for DataStage/DataFlow researcher end users:

User documentation for researchers seemed a little sparse (I think the project/developers noted it is a work in progress i.e. https://github.com/dataflow/RDFDatabank/wiki). More end user documentation would facilitate wider take up. To note, technical installation documentation was more detailed with screen shares, etc.

We look forward to further DataFlow project developments.

DataFlow user forum is at: https://groups.google.com/forum/?fromgroups=#!forum/dataflow-users

Filed under evaluation, tools Tagged with DataBank, DataFlow, DataStage, testing, Tools, user testing

iridium – early findings on research data management planning (approaches, tools and writing plans)

October 6, 2012 by Lindsay Wood 3 Comments

Below is brief summary of some resources, findings and discussions on research data management plans (DMP) that have been noted along the way since the project start up. This has been collected from several activities and events such as iridium support team use of the MANTRA RDM online training package, project RDM tools assessment, together with attendance at the JISC Meeting (Disciplinary) Challenges in Research Data Management Planning Workshop and the DCC Roadshow North East.

Definitions

“Research data management refers to all aspects of creating, housing, delivering, maintaining, and archiving and preserving data. It is one of the essential areas of responsible conduct of research.” – MANTRA

“Plans typically state what data will be created and how, and outline the plans for sharing and preservation, noting what is appropriate given the nature of the data and any restrictions that may need to be.” – DCC

Purpose:

to assist in planning the research data management (RDM) aspects of your research

to assist you in making RDM decisions

to identify the RDM actions required

to highlight areas that need further thought

to provide a record of decisions made and actions taken

http://www.northumbria.ac.uk/static/5007/ceispdf/dmpguide.pdf

Attribution: Northumbria University School of Computing, Engineering & Information Sciences, 2012. CC-BY-SA

Benefits:

The benefits of managing your data include:

Meeting funding body grant requirements.

Ensuring research integrity and reproducibility.

Increasing your research efficiency.

Ensuring research data and records are accurate, complete, authentic and reliable.

Saving time and resources in the long run.

Enhancing data security and minimising the risk of data loss.

Preventing duplication of effort by enabling others to use your data.

Complying with practices conducted in industry and commerce.

– MANTRA

Local DMP practice/DAF survey results

From our survey (128 projects), findings were were 23% of projects have a formal research data management plan for institutional as a whole, with further 33% having a partial RDM plan (by Faculty split suggested a slightly higher proportion in line with a likely higher proportion of Research Council awards). I expect this is similar across sector? Open Exeter project reported ‘few researchers have experience of completing a data management plan ‘ from their DAF survey.

Policy

Institutional policies on DMP, some examples (see also DCC website):

Edinburgh, point 3: http://www.ed.ac.uk/schools-departments/information-services/about/policies-and-regulations/research-data-policy

Lincoln (draft), point 4: https://github.com/lncd/RDM-Policy/blob/master/Lincoln%20RDM%20Policy.md

Warwick , point 7: http://www2.warwick.ac.uk/services/rss/researchgovernance_ethics/research_code_of_practice/datacollection_retention/reseatch_data_mgt_policy

Funder polices on DMP:

Various requirements at application and funded project stages. For exampe:

ESRC: http://www.esds.ac.uk/create/esrc/dataman/and http://ukdaresearchdatamanagement.blogspot.co.uk/

NERC: http://www.nerc.ac.uk/research/sites/data/dmp.asp?cookieConsent=A

MRC: http://www.mrc.ac.uk/Ourresearch/Ethicsresearchguidance/datasharing/DMPs/index.htm

See also DCC mappings across 6 funder policies to generic DCC Checklist (July 2011)

Training and guidance on research data management planning

External institutional support pages guidance:

http://www.admin.ox.ac.uk/rdm/dmp/plans/

http://www.ed.ac.uk/schools-departments/information-services/services/research-support/data-library/research-data-mgmt/data-mgmt/why-research-data-policy

http://www.gla.ac.uk/services/datamanagement/creatingyourdata/dataplanning/

MANTRA training package covers DMP.

Advocacy for why DMP is important:

“… the role of data management for a new researcher as being one of those essential skills that you really ought to get at the same time as you learn how to handle your references, as you understand methodology, as you get to grips with the theory that is going to set the frame by which you do your research. And it sits alongside those and it’s equal to them …” – Professor Jeff Haywood, Vice Principal, CIO & Librarian, University of Edinburgh talks about the role of of data management for PhD students and early career researchers

“… it actually gives you a really good framework and for my postgrads now I am pointing them towards that and saying hey, you know, take a look at that because it will help you to think about how you’re going to gather your data and how you are going to look after it from the beginning to the end of the project. It gives you a framework to deal with it rather than realizing too late that you haven’t done some things that you should have done and therefore you’ve made your life and perhaps actually cause problems for you with the use of data subsequently or sharing your data is made that more difficult.” – Professor Jeff Haywood, Vice Principal, CIO & Librarian, University of Edinburgh talks about the role of data management for PhD students and early career researchers

Attribution: EDINA and Data Library, University of Edinburgh. Research Data MANTRA [online course]. http://datalib.edina.ac.uk/mantra

Also, available as a video.

DCC resources:

http://www.dcc.ac.uk/resources/data-management-plans
Checklist, guidance, and an online tool.
DCC 8 key headings of a DMP (leaflet) – http://www.dcc.ac.uk/webfm_send/371
Practical guidance: http://www.dcc.ac.uk/resources/data-management-plans/guidance-examples

Discussion on DMP approaches and reviewing the styles of questions, format, how ‘active’ in approach

Oxford DMPOnline Project wrote on and discussed the detail of research data management plans – very interesting reading. They discussed the concept of ‘plan questions’, ‘project questions’, ‘data questions’ and common issues they found when reviewing DMPs – such as compound questions, duplicates, and individual plan unique questions [link to table XLS]. On DMP style they noted – discursive versus concise, ‘metadata’ versus ‘data’ questions, option to add possible responses and overall gaps in DMP scope and plans that lead to quantified expected data sizes/acquisition rates (resulting in actionable identification of requirements that can be report to a central service provider, as a result of the plan).

The conclusions should be read:

“.. difficult work, since there are many possible questions ..”

“.. avoid asking ambiguous questions ..”

“.. avoid asking for the same or similar information multiple times..”

“..unique questions not covered by the DMPonline ..”

“.. all of the available question sets have drawbacks ..”

“.. in terms of comprehensiveness, the best may be the enemy of the good enough..”

“.. devise and standardize the best possible set of questions for different constituencies of user ..”

[and more …]

http://datamanagementplanning.wordpress.com/2012/03/27/dmp-questions-comparisons-and-conclusions/

DMPs for different audiences – from targeted plans to template author background ‘bias’/priorities

Life-cycle stage specific: from (conception?), pre-award, post-award, to post-project.

Postgrad research project versus PI bidding for new funding.

Curator/archiver versus researcher orientated.

DMP online authoring tools or offline Word/PDF templates

Online tool has many useful advanced staging, customisation & collaboration features.

Online systems:

DMPOnline – pre-award, post-award, post-project, templates for Research Council/major funders, default templates, post-grad, etc. Features – add additional questions from DCC checklist, save, share/collaborate, copy, export to Office files, etc.

DCC/DMPOnline:

Introduction and Context

Data Types, Formats, Standards and Capture Methods

Ethics and Intellectual Property

Access, Data Sharing and Re-Use

Short-Term Storage and Data Management

Deposit and Long-Term Preservation

Resourcing

Adherence and Review

– DCC website/DMPOnline

DMPOnline tools training:

http://www.dcc.ac.uk/webfm_send/879
http://www.dcc.ac.uk/webfm_send/881
http://www.screenr.com/Syo

DMPOnline advocated by MRC (ref 14), etc.

Institutional customisation or tailoring for local use available.
GitHub code: https://github.com/DigitalCurationCentre/DMPOnline (Ruby on Rails/MySQL)

Offline templates (Word/PDF format):

Some users do not like online systems, are overwhelmed by array of features/customisation options and just want a ready to go familiar Office document to type into.

DATUM: http://www.northumbria.ac.uk/sd/academic/ceis/re/isrc/themes/rmarea/datum/action/outputs/?view=Standard

Shotton 20 Questions http://datamanagementplanning.wordpress.com/2012/03/07/twenty-questions-for-research-data-management/ [CC:BY 3.0]

Bath 360 (postgrad-specific): http://blogs.bath.ac.uk/research360/2012/03/postgraduate-dmp-template-first-draft/

DMPTPsych(York) (postgrad?): http://www.dmtpsych.york.ac.uk/docs/pdf/dmpt_guidance.pdf

MRC: http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC008617

Wellcome Trust: http://www.wellcome.ac.uk/About-us/Policy/Spotlight-issues/Data-sharing/Guidance-for-researchers/index.htm

Wider DMP discussions

extent of pre-population of template with default institutional information to aid researcher versus reducing actual thinking/planning for RDM
experience in information banks of DMPs, shared pool, ‘successful’ DMP
DMP online tools – metadata transfer protocols between systems/integration in existing RIM systems
DMP training needs, online/in person, embedding with training – ‘dual service engagement‘ (i.e. Monash), see DCC ‘support researchers with DMP‘
DMP embedding in existing institutional processes, internal peer review, funder review, DMP fields (i.e. data size) resulting a RIM system flag or automatic central service trigger
time requirements for writing a plan – minimal plans/resources required to support/advise/review DMP
DMP auditing – institutional, Funding Council, etc.
wider use as knowledge/information base for forward institutional planning, storing DMP (or parts of ) with an archived data set, re-use to support metadata population

JISC Research Data Management Planning Projects

Strand B: DATUM, DMPSPsych, History DMP, etc.

Next steps for iridium project:

Reporting on initial user testing of DMPOnline and other templates, authoring a local DMP template and hosting options.

Filed under evaluation, human support, policy, postgraduate, reflection, tools Tagged with "Wellcome Trust", award process, conception, data management plan, data managment planning, DATUM, DMP, DMPonline, embedding, lifecycles, MRC, proccess, RDMP, template

SWORD v2 – From clueless to claymore

October 4, 2012 by Andrew Martin 3 Comments

What follows is a summary of my steps along the path of investigating what the sword technology is, through to being able to actually start to code something useful; I should probably point out that the beginning of this post can be consumed by less technical persons as a quick overview, but the later section assumes that you…

Have some knowledge of coding java
Have worked with java server containers (e.g. tomcat) before
Can place the libraries in an IDE like netbeans/eclipse to do “something” with them

(Since my investigations centered around sword in conjunction with Sakai and e-science central my language of choice therefore is Java).

I should also declare that I still don’t fully understand all of the implementation but this should help you along your way if you’re just starting out!

Taking the Sword course

My first port of call was the SWORD website itself, which will point you to some useful videos and slides to give you insight into what the technology is and what it can be used for. In short, this is what the “Sword Course” will teach you…

An Introduction To SWORD (Video/Slides)

What it is:

The “Simple Webservice Offering Repository Deposit” technology (or SWORD for short) intentionally only deals with how to get data into a repository, nothing else, and is complementary to something like dublin core used to describe stuff in a repository; it also does not deal with packaging, metadata, authentication or authorisation

Existing implementations can be found in:

DSpace
Eprints
Fedora
Intralibrary
Zentity

SWORD Use Cases (Video/Slides)

Use cases sword is trying to enable:

Deposit from a desktop machine
Deposit to multiple repositories (For example to allow depositing once and ending up in an institution’s repository, funder’s repository and a subject specific repository)
Deposit from a piece of lab equipment (non-human depositing data)
Deposit from one repository to another (For example Institutional repository to National repository which may have differing visibility of the data…. Dark and light repositories, dark = can’t be seen private repositories, light = can be seen public repository)
Deposit from external publisher/publishing system to long term storage (For example from OJS to your own institution’s repository)

How SWORD Works (Video/Slides)

SWORD is in the form of an “XML REST webservice to put things in a repository”
It has built on the resource creation aspects of the ATOM Pub standard which is for publishing content to the web
SWORD is an extension, or “profile”, of the ATOM Pub spec and existing ATOM Pub clients can be made to work if the relevant extensions are added
SWORD version 2 now includes full CRUD

When you use a sword client, this is basically what happens…

The client asks a repository to describe itself
The server returns a “Service document” to describe what you can do with the repository and how to do that. The service document is typically hidden by Basic Auth Authentication (AM: I think this is crying out for an OAuth implementation!) but once authenticated the web service will customise the service document to what you are allowed to / should do with the system. The server can also describe what data formats you want to accept, where it will go and how long you will store it etc… this is your “collection policy”
The client then uses the service description to format your data and then deposit it

What sword adds to ATOM Pub:

Accept Packaging – Tells the client what types of data the server accepts
Mediated Deposit – Allows you to deposit “as” / “on behalf of” someone else, a repository can say whether it allows this or not
Developer features – You can state that you want verbose output to say what happened (v1.3 featured a dry run feature called “no-op” that does not actually deposit or do anything. NOTE: this does not appear to be in v2 anymore)
Nested Service document – Where there may be many repositories for the service, the top level document provides links to sub documents, instead of repeating the same or similar definitions in one enormous file.

SWORD clients (Video/Slides)

There are generally three sorts of client:

Machine to machine – for very specific automated deposits (lab equipment)
General – human would use, talks to any repository
Specific – for depositing certain data into certain repositories in a given way that has an extra context of a general client, i.e. depositing specific journals, depositing data for a particular project

Interesting possibilities for deposit scenarios:

Depositing by email (see the vimeo video 20:46->25:00)
Depositing from word
Amongst others…..

Writing something useful

My next stop on the journey through sword looked at actual code, how it’s laid out and what you need to do in order to start doing something useful.

As mentioned previously, I have been basing my investigations around the Java client and server libraries but I’d strongly recommend you also get a good grounding in the workings of ATOM Pub (HTML Version) and the Sword Profile specifications themselves. If you’ve ever read specification documents before you’ll know they can make quite dry reading, however, since ATOM Pub and sword are relatively straight forward technologies and the specs only reach into 30-50 odd pages it’s really worth a browse through.

How SWORD works in java

Firstly, it’s probably best to understand a few basic concepts you will need to deal with, the main outputted concepts/objects are:

IRI‘s – unique identifiers to a resource
Entry – A deposit, has IRI’s/metadata
Media – An entry for media (word docs/pdfs/images), can be linked to in an Entry
Collection Entries – ATOM Pub collection of entries (member resources)
Collections – A set of deposited entries represented by an Atom Feed document, you can distinguish a collection feed from “a.n.other feed” by the presence of a collection IRI in service document
Workspaces – A compartmentilisation concept for repositories, has a name but doesn’t have IRI’s or processing models
Service Documents – Groups “Collections” in “Workspaces”, can indicate accepted media types and categories of collections

Next, I found you learn the most by studying the server libraries, what you get is a bundle of java and some set up files for your container. We’ll firstly look at the setup in web.xml

Setup and Servlet mappings (web.xml)

The main servlets (i.e ultimately your rest endpoints) that are defined are…

servicedocument

Class: org.swordapp.server.servlets.ServiceDocumentServletDefault
URL: http://<yourServer>/<yourWebapp>/servicedocument/*
Purpose: Listing workspaces and collections

collection

Class: org.swordapp.server.servlets.CollectionServletDefault
URL: http://<yourServer>/<yourWebapp>/collection/*
Purpose: Retrieveing and deposting to/from collections/feeds and entries

mediaresource

Class: org.swordapp.server.servlets.MediaResourceServletDefault
URL: http://<yourServer>/<yourWebapp>/edit-media/*
Purpose: To edit media entries

container

Class: org.swordapp.server.servlets.ContainerServletDefault
URL: http://<yourServer>/<yourWebapp>/edit/*
Purpose: To retrieve/add/edit entries and metadata

statement

Class: org.swordapp.server.servlets.StatementServletDefault
URL: http://<yourServer>/<yourWebapp>/statement/*
Purpose: ?

The code makes heavy use of interfaces to allow the implementer more freedom to create functionality using the server library in the way they want to, in order to tell the server libraries what code we want to instantiate and what auth. mechanism we will be using, you must set some context parameters to define those settings and implementations used at runtime:

auth

param-name: authentication-method
param-value: Basic or None (default: “Basic”)

You can set an Authorization header and base64 encode user:password (e.g. try going to a basic auth encoding website and encode “user:password” in plain text box) and send as a header… “Authorization” “Basic dXNlcjpwYXNzd29yZA==”, or, if you prefer set the param-value to “None” for no authorisation. I found (at the time of writing) the default code actually has a bug which means turning the auth off doesn’t work correctly. I found the best way of correcting this was in my own war project (which includes the server libraries as a dependancy) I created a org.swordapp.sever package (where I was implementing the objects needed for the interfaces) and dropped in a copy of the SwordAPIEndpoint.java to override the implementation in the library, I then changed the getAuthCredentials to…

protected AuthCredentials getAuthCredentials(HttpServletRequest request, boolean allowUnauthenticated) throws SwordAuthException
{
   AuthCredentials auth = null;
   String authType = this.config.getAuthType();
   String obo = "";
   this.log.info("Auth type = "+authType);
   //If we are insisting on "a" form of authentication that is not of type "none"
   if(!allowUnauthenticated && !authType.equalsIgnoreCase("none"))
   {
      // Has the user passed authentication details
      String authHeader = request.getHeader("Authorization");
      // Is there an On-Behalf-Of header?
      obo = request.getHeader("On-Behalf-Of");
      // Which authentication scheme do we recognise (should only be Basic)
      boolean isBasic = authType.equalsIgnoreCase("basic");

      if(isBasic && (authHeader == null || authHeader.equals("")))
      {
         throw new SwordAuthException(true);
      }
      // decode the auth header and populate the authcredentials object for return
      String[] userPass = this.decodeAuthHeader(authHeader);
      auth = new AuthCredentials(userPass[0], userPass[1], obo);
   }
   else
   {
      log.debug("No Authentication Credentials supplied/required");
      auth = new AuthCredentials(null, null, obo);
   }
   return auth;
}

The following context parameters set the implementations of interfaces used to implement functionality in the endpoints. However, you are not given default implementations for each of these, so in your war you need to create a new class that implements the respective interface and fill in your functionality….

collection-list-impl

param-value: org.swordapp.server.CollectionListManagerImpl
Interface it implements: org.swordapp.server.CollectionListManager

service-document-impl

param-value: org.swordapp.server.ServiceDocumentManagerImpl
Interface it implements: org.swordapp.server.ServiceDocumentManager

collection-list-impl

param-value: org.swordapp.server.CollectionListManagerImpl
Interface it implements: org.swordapp.server.CollectionListManager

collection-deposit-impl

param-value: org.swordapp.server.CollectionDepositManagerImpl
Interface it implements: org.swordapp.server.CollectionDepositManager

media-resource-impl

param-value: org.swordapp.server.MediaResourceManagerImpl
Interface it implements: org.swordapp.server.MediaResourceManager

container-impl

param-value: org.swordapp.server.ContainerManagerImpl
Interface it implements: org.swordapp.server.ContainerManager

statement-impl

param-value: org.swordapp.server.StatementManagerImpl
Interface it implements: org.swordapp.server.StatementManager

config-impl

param-value: org.swordapp.server.SwordConfigurationDefault (Yes, this one does have a default implementation in the library you can use)
Interface it implements: org.swordapp.server.SwordConfiguration

Endpoint Servlet classes (org.swordapp.server.servlets.*)

Let’s now have a look at the servlets themselves, each servlet contains interfaces which are implemented by loading the classes specified in the web.xml (see above).

All servlets used in the server library extend the “SwordServlet” which (obviously) extends the HttpServlet. Since the SwordServlet contains the server configuration object, all servlets (through inheritance) also hold an implementation of the server configuration (i.e. SwordConfiguration) and a method to allow servlets to load classes from the configuration….

SwordServlet encapsulates:

SwordConfiguration interface, instantiated using config-impl
loadImplClass() method used for loading implementing classes from tomcat context params

CollectionServletDefault extends SwordServlet and encapsulates:

CollectionListManager interface, instantied using collection-list-impl
CollectionDepositManager interface, instantied using collection-deposit-impl
CollectionAPI object

ServiceDocumentServletDefault extends SwordServlet and encapsulates:

ServiceDocumentManager interface, instantiated using service-document-impl
ServiceDocumentAPI object

MediaResourceServletDefault extends SwordServlet and encapsulates:

MediaResourceManager interface, instantiated using media-resource-impl
MediaResourceAPI object

ContainerServletDefault extends SwordServlet and encapsulates:

ContainerManager interface, instantiated using container-impl
StatementManager interface, instantiated using statement-impl
ContainerAPI object

StatementServletDefault extends SwordServlet and encapsulates:

StatementManager interface, instantiated using statement-impl
StatementAPI object

Endpoint Servlet dependant classes (org.swordapp.server.*)

Those with a keen eye will have noticed that each servlet is also holding an “API” object, these objects fill out the standard Get/Post/Put/Delete HttpServlet methods that the servlets override by taking the configuration object and any interfaces that have been implemented and combine them to do something useful. Similarly to the servlets, they all extend a Sword API super class called “SwordAPIEndpoint”, which holds a SwordConfiguration implementation. The hierarchy (and interfaces they encapsulate) looks like this…

SwordAPIEndpoint

SwordConfiguration interface

CollectionAPI extends SwordAPIEndpoint

CollectionListManager interface
CollectionDepositManager interface

ServiceDocumentAPI extends SwordAPIEndpoint

ServiceDocumentManager interface

MediaResourceAPI extends SwordAPIEndpoint

MediaResourceManager interface

ContainerAPI extends SwordAPIEndpoint

ContainerManager interface
StatementManager interface

StatementAPI extends SwordAPIEndpoint

StatementManager interface

Implementations of interfaces (org.swordapp.server.*)

I keep mentioning the objects that implement the interfaces, I thought it might be useful to go through in “slightly” more detail what the content of those objects are intended for. Apologies once again, this is not exhaustive as I have not worked my way through what all the methods are intended for:

SwordConfigurationDefault implements org.swordapp.server.SwordConfiguration

This is the default object which holds the configuration for the server

CollectionListManagerImpl implements org.swordapp.server.CollectionListManager

Method: listCollectionContents()
Accessed via: GET http://<yourServer>/<yourWebapp>/Collection%5B/your collection]
Returns: org.apache.abdera.model.Feed
Purpose: Lists the collection entries in a collection

CollectionDepositManagerImpl implements org.swordapp.server.CollectionDepositManager

Method: createNew()
Accessed via: POST http://<yourServer>/<yourWebapp>/Collection%5B/your collection]
Returns: org.swordapp.server.DepositReceipt
Purpose: Deposits to a collection

ServiceDocumentManagerImpl implements org.swordpapp.server.ServerDocumentManager

Method: getServiceDocument()
Accessed via: GET http://<yourServer>/<yourWebapp>/servicedocument/
Returns: org.swordapp.server.ServiceDocument
Purpose: Serves service documents (xml that explains the contents and deposit policies for the repository(/ies)

MediaResourceManagerImpl implements org.swordapp.server.MediaResourceManager

Method: getMediaResourceRepresentation()
Accessed via: GET http://<yourServer>/<yourWebapp>/edit-media/
Returns: org.swordapp.server.MediaResource
Purpose: ?

Method: replaceMediaResource()
Accessed via: PUT http://<yourServer>/<yourWebapp>/edit-media/
Returns: org.swordapp.server.DepositReceipt
Purpose: Swap a media resource (pdf/doc etc….) in the repository with the one being “PUT’ed”

Method: deleteMediaResource()
Accessed via: DELETE http://<yourServer>/<yourWebapp>/edit-media/
Purpose: Delete a media resource (pdf/doc etc….)from the repository

Method: addResource()
Accessed via: POST http://<yourServer>/<yourWebapp>/edit-media/
Returns: org.swordapp.server.DepositReceipt
Purpose: Add a media resource (pdf/doc etc….) to the repository

ContainerManagerImpl implements org.swordapp.server.ContainerManager

Method: getEntry()
Accessed via: GET http://<yourServer>/<yourWebapp>/edit/
Returns: org.swordapp.server.DepositReceipt
Purpose: Gets an entry from the repository

Method: replaceMetadata()
Accessed via: PUT http://<yourServer>/<yourWebapp>/edit/
Returns: org.swordapp.server.DepositReceipt
Purpose: Replaces metadata for an entry

Method: replaceMetadataAndMediaResource()
Accessed via: PUT http://<yourServer>/<yourWebapp>/edit/
Returns: org.swordapp.server.DepositReceipt
Purpose: Replaces metadata and media associated with an entry

Method: addMetadataAndResources()
Accessed via: Does not appear to be “directly” accessible via any specific HTTP request
Returns: org.swordapp.server.DepositReceipt
Purpose: Not used by the ContainerAPI yet, but presumably it would be for adding a series of entries and associated metadata

Method: addMetadata()
Accessed via: POST http://<yourServer>/<yourWebapp>/edit/
Returns: org.swordapp.server.DepositReceipt
Purpose: Adding metadata to an entry

Method: addResources()
Accessed via: Does not appear to be “directly” accessible
Returns: org.swordapp.server.DepositReceipt
Purpose: Not used by the ContainerAPI yet, but presumably it would be for adding a series of entries

Method: deleteContainer()
Accessed via: DELETE http://<yourServer>/<yourWebapp>/edit/<your container>
Purpose: Delete a container of entries

Method: useHeaders()
Accessed via: POST http://<yourServer>/<yourWebapp>/edit/
Returns: org.swordapp.server.DepositReceipt
Purpose: Used when depositing only information specified in the HTTP headers, no entry/ies (i.e. no content body to the POST) will have been specified

Method: isStatementRequest()
Accessed via: GET http://<yourServer>/<yourWebapp>/edit/
Returns: java.lang.boolean
Purpose: ?

StatementManagerImpl implements org.swordapp.server.StatementManager

Method: getStatement()
Accessed via: GET http://<yourServer>/<yourWebapp>/statement/
Returns: org.swordapp.server.Statement
Purpose: ?

And finally…

Once you have all that setup (and I’d recommend just creating skeleton override methods for the objects implementing interfaces for the time being whilst you figure the code out), you can then start coding the abdera / sword code and try make the client do something. The client itself comes with a handy cli driven (SwordCLI) interface that you can point at your newly created server instance and test the various example methods. I would recommend though, that you comment out the entire list of method references in the main method and go through the list iteratively to slowly make each part of your server work…

As a brief example, if we were to try and get a basic service document to work, try adding this code to your ServiceDocumentManagerImpl.java….

    public ServiceDocument getServiceDocument(String sdUri, AuthCredentials auth, SwordConfiguration config) throws SwordError, SwordServerException, SwordAuthException
    {
        //Our test service document
        ServiceDocument sd = new ServiceDocument();
        //sd.setVersion("2.0");
        sd.setMaxUploadSize(1000000000);

        //Our test workspace
        SwordWorkspace sw = new SwordWorkspace();
        sw.setTitle("TestWorkspace");

        //Our test collection
        SwordCollection sc = new SwordCollection();
        sc.setHref("http://<yourServer>/<yourWebapp>/collection/TestCollection");
        sc.setTitle("TestCollection");
        sc.addAccepts("*/*");
        sc.setCollectionPolicy("TestCollectionPolicy");
        sc.setAbstract("TestCollectionAbstract");
        sc.setMediation(false);
        sc.setTreatment("A human-readable statement describing treatment the deposited resource has received or a URI that dereferences to such a description.");
        sc.addAcceptPackaging("http://purl.org/net/sword/package/SimpleZip");
        sc.addAcceptPackaging("http://purl.org/net/sword/package/METSDSpaceSIP");
        sc.addAcceptPackaging("http://www.ncl.ac.uk/terms/package/html");
        sc.setLocation("http://<yourServer>/<yourWebapp>/collection/TestCollection");
        sc.setMediation(true);
        List iris = new ArrayList();
        iris.add(new IRI("http://<yourServer>/<yourWebapp>/collection/TestCollection/TestSubService"));
        sc.setSubServices(iris);

        //Add collection to workspace
        sw.addCollection(sc);
        //Add workspace to service document
        sd.addWorkspace(sw);

        return sd;
    }

Browsing to http://<yourServer>/<yourWebapp>/collection/ should return something, and removing your comment in your client for the line…

    cli.trySwordServiceDocument()

…should now yield some results (If you just get errors try temporarily turning off the auth. requirement on the server for the purposes of testing).

And that’s the basic principle, you then take the specs and implement the returning and unpackaging of ATOM using the abdera/SWORD objects and link what’s passed/returned to the content found in the system you are trying to integrate.

iridium – reporting on identification of available external and internal RDM tools

October 3, 2012 by Lindsay Wood 1 Comment

Some while back we undertook an exercise as part of our tools and systems workpackages to review external RDM tools to supplement our own internal tools that had RDM application in support of the developing draft policy.

i) Firstly, our iridium PG support team used the very useful DCC website, which had been recently updated, with a DCC catalogue of RDM tools:

http://www.dcc.ac.uk/resources/external/tools-services

This is divided into tools for ‘Curators’ and for ‘Researchers’

Curators – depositing, archiving/preserving, administering, etc.:

http://www.dcc.ac.uk/resources/external/tools-services/depositing-and-ingesting-digital-objects
http://www.dcc.ac.uk/resources/external/tools-services/archiving-and-preserving-information
http://www.dcc.ac.uk/resources/external/tools-services/managing-and-administering-repositories

Researchers – managing, active data, sharing outputs, tracking, etc:
http://www.dcc.ac.uk/resources/external/tools-services/managing-active-research-data
http://www.dcc.ac.uk/resources/external/tools-services/sharing-output-and-tracking-impact

There is a useful search function to the tools catalogue. At that time, we tabulated about 45 different tools, such as Dataverse, MyExperiment, Research gate, Curator workbench, NLNZ Metadata Extraction Tool, PREMIS in METS Toolbox, KEEP Emulation Framework, OpenDOAR, etc. and how they maybe useful to our own project or for future reference in response to specific researcher needs/questions.

ii) We also noted the DCC own tools, such as:

CARDIO – ‘assess your data management support and infrastructure’

DMPOnline – ‘develop data management plans that meet research council and funding body mandates’ – (*technical/initial user testing blog to follow)

DAF (Digital asset framework) – ‘identify researchers’ current data management activity’

iii) We then went on to search wider for other external ‘tools’/tool themes that would meet in scope requirements and could potentially be used for RDM:

Zentity
DataStage/DataBank (from DataFlow) – (*technical/initial user testing blog to follow)
Microsoft SharePoint (with possible future Virtual Research Environment toolkits.
File re-namer tools
EduServ cloud
SWORD2 protocol – (*technical blog to follow)
ViDass
Brisskit (also CiviCRM, REDCap, OBiBa Onyx, i2b2)
KRDS (Keeping Research Data Safe)/Beagrie value tool chain
MyLabBook
DDI metadata explorer
‘Security’/encryption tools

After evaluation (a series of blog posts will follow on outcomes of this), regarding supporting policy implementation and selection of tools with general institutional application and suitability for local technical infrastructure this list was reduced.

iv) We also had a few specific tools that our requirement gathering survey reported and were flagged by researchers/data managers – some, but not all, of these were perhaps more particular research data analysis/methods tools compared to our own definition of a ‘research data management tool’. Generally though, our survey did not report high levels of RDM tools use by researchers.

CARMEN – neuroscience code, repository, modelling
rsnapshot

OMERO – microscopy discipline specific software allowing organisation, cataloging, open file formats, etc.

Datashield (data aggregation)

CasaXPS
ArcGIS
Dotmatics Browser and Gateway
Document management software, study/sample management applications

Dropbox.com
Newcastle Drop-off service

v) We then outlined our local institutional tools/systems with possible RDM utility (noting that we had existing RIM systems of MyProject/MyImpact).

At the first, initial scoping stage these included:

Sakai Virtual Research Environment (existing tool, RDM modifications possible if available)
e-Science Central (custom install, RDM/authentication/data transfer protocol modifications possible)
MyProjects, MyImpact, local ePrints systems/feeds – with a tool to connect? (see later ’Research Data Catalogue‘ tool)
Sharepoint (that could be modified for RDM)
Talend
Google Search Appliance

This list has been later refined following investigations (blog series to follow on tools going forward and initial technical testing/user evaluation).

vi) We also intend to keep an eye out for updates on above and new tools that come to light:

I’ve noted the following discussed within the MRD Programme and wider community recently:

CKAN seems a mature ‘data management’ system (see recent ORBITAL/Lincoln blogs) with some useful features
DataCite system – minting DOIs
ORCID – ‘aims to solve the author/contributor name ambiguity problem in scholarly communications’
Archivematica – ‘open-source digital preservation system … to maintain standards-based, long-term access to collections of digital objects’
Bazaar – version control system
OwnCloud – ‘universal access to your files through a web interface or WebDAV’
Sparkleshare
JISC MRD Hackday winner and discussion of an ‘Academic dropbox‘

Q. Are there any others tools/systems recommended from Programme we should know about?

Research metrics/citations tools are topical in wider MRD community:

Research Gate

altmetrics

ColWiz

Filed under deliverable, evaluation, tools Tagged with deliverable, external tools, internal tools, managing research data, metadata extraction, RDM tools, tools scoping

iridium – requirements for a research data catalogue and proof-of-concept development

October 2, 2012 by Lindsay Wood Leave a comment

Background

A brief recap on background to the development a proof-of-concept Research Data Catalogue tool:

From the iridium JISC project proposal:

Funding councils are now expecting researchers to make their data available to others. This

requires them to:

• make their data available into the future in a safe and secure manner;

• provide appropriate tagging of the data such that its significance can be determined;

• provide provenance information over the data so that the prior uses and access to the data can be determined.

There is an ever-growing expectation that data curation processes should be handled at the institutional level. For example the EPSRC Policy Framework on Research Data and expectations are clearly stated. Points highlighted below are of particular relevance to the the development of a data catalogue/institutional metadata record:

“… processes to maintain effective internal awareness of their publicly funded research data holdings and of requests by third parties to access such data …”

“… organisations will ensure that appropriately structured metadata describing the research data they hold is published (normally within 12 months of the data being generated) and made freely accessible on the internet …”

“… metadata must be sufficient to allow others to understand what research data exists, why, when and how it was generated, and how to access it …”

“… where the research data referred to in the metadata is a digital object it is expected that the metadata will include use of a robust digital object identifier …”

“… where access to the data is restricted the published metadata should also give the reason and summarise the conditions which must be satisfied for access to be granted. For example ‘commercially confidential’ data, in which a business organisation has a legitimate interest, might be made available to others subject to a suitable legally enforceable non-disclosure agreement …”

The timelines for this are:

a roadmap by May 2012,
full compliance May 2015

The iridium project is developing and consulting on a draft research data management policy, as are many Universities, in response to the above. For a national review of policies published, see the DCC website. Across the sector there is general consensus on key principles that need to be present (many originating from the RCUK principles and of the ‘aspirational’ policy type pioneered by Edinburgh University). For example:

accountability, responsibilities and roles
data location, treatment and security
data access mechanisms linked to publications
internal availability of research data
internal deposit of data/discipline repositories
curation mechanisms
training and support provision
embargo periods, ethical considerations and protecting commercial opportunities

A research data catalogue tool, to support implementation of the draft policy, is an obvious need for several of the above themes.

Existing local institutional research information management systems

Our institution has several major existing systems for Research Information Management (RIM):

MyProjects – used to support and record University projects, particularly funded projects, throughout their life-cycle (with information additionally fed in from MyProjectsProposals on costing, pricing and sign-off )
MyImpact – a person based system in which information on publications and activities are collected for day to day use by the academic community (with information sent to Newcastle ePrints repository of peer-reviewed institutional research)

The iridium project proposal outlined an early RDM (tool) specification for research data to be made available to others through a definable access policy, including tagging of the data and along with recording of provenance. This was proposed to be handled in a semi-automated fashion such that as much of the information as possible can be provided automatically, and that information which cannot be provided automatically can be prompted for.

Research Data Catalogue (RDC) proof-of-concept specification

Early specification work outlined:

this should use functionalities which already exist within the University, or which need to be brought in, and which can be integrated to provide a research data management system
to fulfil these requirements, the Research Data Catalogue (RDC) proof-of-concept will join up the MyProjects and MyImpact systems and facilitate the manual attachment of a small pre-defined set of metadata. It uses the Institutional Data Feeds Service (IDFS) (see earlier JISC-funded IDMAPS project)
this tools will derive appropriate high quality metadata from the existing project and publications records, then augmented and exported to an open searchable web-based interface (API) (machine and person readable).

Schematic below outlines the RDC and institutional RIM system data flows [click for larger image]

iridium Research Data Catalogue proof-of-concept data flows

RIM data flows – iridium Research Data Catalogue proof-of-concept – [Click for larger image]

Research Data Catalogue Basic User Interface

The Basic User Interface of the proof-of-concept RDC has been designed, is functional on a test server and has been demonstrated/tested within project team.

The following summarises the user interface.

Home screen – after the user logs in through Shibboleth authentication, the home welcome screen presents user with tabs for Project, Publications and Data Sources.

Welcome tab – iridium Research Data Catalogue proof-of-concept [click for larger image]

Project tab – presents user with research project(s) owned by a Principle Investigator and are pre-populated from their Newcastle ID using MyProjects data feed.

Projects tab – iridium Research Data Catalogue proof-of-concept [click for larger image]

Publications tab – presents user with research publications owned by a Principle Investigator and are pre-populated from their Newcastle ID using MyImpact data feed.

Publications tab – iridium Research Data Catalogue proof-of-concept [click for larger image]

Data Source tab– allows the addition of a novel data source metadata record to a publication and for this link to be saved in catalogue record system, together with additional user added metadata/tagging.

Data source tab – iridium Research Data Catalogue proof-of-concept [click for larger image]

The working metadata scheme for the proof-of-concept RDC Data Source record has been blogged here. This scheme will be reviewed as project progresses.

We note discussions in the MRD Programme on the DataCite scheme which is gaining popularity, and which is broadly similar and also findings from some previous projects on what fields are actually *used* in data catalogues/repositories for re-use in practice.

The RDC, if implemented institutionally, would bring us closer to CERIF compliance, which is a longer term aim outside the project, but with many institutional advantages.

RDC ‘roadmap’ and within project proof-of-concept development plan priorities

The process will allow for prototyping, iterative development/feedback, user testing and follow-up development work. Plain English user documentation will be written to fully support tools by the human support workpackage.

Phase 1: Basic UI is the implementation of:

A basic User Interface for linking Publications, Projects and Data Sources (status: working demonstration completed and tested).
an Admin Interface for assigned roles and records approval process (ideally inherited from existing RIM systems) (initiated and is in progress)
simple reporting feature for researchers/research groups piloting the RDC (initiated and is in progress)

This proof-of-concept work is on track for late 2012. The second phase falls within a ‘roadmap’ of potential developments based on project priorities:

Phase 2a: Web Servies & Reporting is the future implementation of:

the fully documented web service interface (an API, systems can push data to it or read data from it for internal, or external, reporting purposes), including pluggable interfaces using LTI/Widget/Portlet technologies, interoperability standards (i.e. OAI-PMH, SWORD, etc.) allowing embedding in and exchange with other tools/systems [not required for proof-of-concept]
advanced reporting [not required for proof-of-concept]

Phase 2B: Data (metadata) Harvesting is the implementation of harvesting plugins:

use of ‘harvester’ plugins are being looked at as part of tools/systems workpackages to enable researchers to have their information automatically updated from e-Science Central, SAKAI, web services and other repositories. SWORD protocol is desirable. [scoping work initiated looking at Talend, RapidMiner, etc. options) – implementation not required for proof-of-concept].

Developing ‘Data Catalogues’ and future points of discussion

Below are some development and testing team discussion points and questions we have had, which would be useful to share more widely to JISC MRD Programme, as they are likely to be of interest to those developing systems with similar/overlapping functionality (some of these do not need to be resolved for this stage in our proof-of-concept development and have been deferred):

Comments would be welcome on all of above from JISC MRD Programme, especially if you have tackled any of these issues already or have lessons learned on what works well (or not) in your own institutional systems or initial testing.

We are also trying to keep up to date (the blog bundle is very useful!) with the outputs from other JISC MRD02 projects (such as DataFinder work from Oxford/DaMaRo, RD@Essesx and Open Exeter, etc.) on data catalogues/data repositories technical development.

Filed under policy, reflection, systems, tools Tagged with discoverable, metadata record, outputs, RDC, RDM, research data catalogue, Systems, technical development, tool

CERIF in Practice Workshop

December 9, 2011 by niallco 5 Comments

I attended the (ARMA UK) CERIF in Practice workshop on Wednesday (7th Dec) and had interesting conversation with Keith Jeffrey from euroCRIS about what sort of information we should attach as metadata to our data in order to maximise its discoverability and utility and he suggested an interesting 3 level taxonomy.

1. The top level relates to the Discoverability of the Resource and could be based around the 15 Dublin Core elements
2. The second level is contextual and essentially covers the elements in cerif (notably; outcomes – publications, patents etc, funding; research council grant, person; project team members, organisation- Uni, collaborators.
3. Finally a specific level giving the minutiae.

This is quite appealing as we already collect much of the information in the first two levels through our current systems (MyProjects, e-prints and MyImpact) so the main additional input we’d be requiring from the academic would be at the third level.

Useful links:
http://dublincore.org/documents/usageguide/elements.shtml
http://www.eurocris.org

Filed under JISCMRD02, systems, tools Tagged with cerif, dublin core, MetaData

Research Data Management at Euro Sakai 2011

November 22, 2011 by Andrew Martin Leave a comment

I recently visited Euro Sakai 2011 at the Pakhuis De Zwijger in the Eastern Dockland section of Amsterdam, the main purpose of the visit was to find how best we can make the most out of our VRE software (Sakai) and where it is going, but some presentation strands on research data management also caught my interest. It would seem that the sakai crowd are quite intimately intertwined with R.D.M. and I counted at least 2 research data focused presentations and a couple of others mentioning it.

The most useful was from Hull’s Chris Awre, who described Hull’s approach to managing data through its whole lifecycle. They have built a “Fedora” based, versioned, object repository (i.e. we’re NOT talking about the version of red hat linux but an open source object repository… http://fedora-commons.org/) that was developed and managed using “Duraspace” (http://www.duraspace.org/). They claimed this approach was scalable, standards based, content agnostic and allowed the recording of the relationships between objects.

The history of project stemmed from JISC projects:

CTREP, a Cambridge University based project exploring sakai/fedora/DSpace running from 2007-2009 https://camtools.cam.ac.uk/wiki/jisc-ctrep/Home.html http://www.jisc.ac.uk/whatwedo/programmes/reppres/sue/tetracam.aspx
CLIF, a collaboration between Hull and King’s College London to look at data lifecycle management and how to store research data running from 2009-2011 http://www.jisc.ac.uk/whatwedo/programmes/inf11/sue2/clif.aspx http://www2.hull.ac.uk/discover/clif.aspx, this project was also prefaced by the RepoMMAN – http://www.hull.ac.uk/esig/repomman/ and REMAP – http://www.jisc.ac.uk/whatwedo/programmes/preservation/remap.aspx projects.

To summarise, their implementation uses sakai 2.6.4 and Fedora 3.4 (although they also made an integration to talk to sharepoint) and re-uses the sakai resource section as a GUI driver for the fedora repository, squashing the data objects down into a file and directory “view” inside sakai; all the standard CRUD operations are translated through sakai into the repository. The code for all of this is hosted on github. Looking forward, they would like to create an OAE integration with annotation capture on original documents that gets directly packaged up as metadata into the fedora repository.

The main lessons learned were to stay standards based, it makes everything much easier throughout the entire project; and draw up strong policies around what repositories are for and how they are to be used at the very start.

Presentation hosted by:

Chris Awre – c.awre@hull.ac.uk

Useful links/people:

https://github.com/uohull

http://www2.hull.ac.uk/discover/clif.aspx

https://edocs.hull.ac.uk/muradora/objectView.action?pid=hull:4194 (Final report of the CLIF project)

simon.waddington@k cl.ac.uk (King’s College contact for the CLIF project)

Other interesting stuff worth mentioning:

Chris Awre also made mention of the Hydra Project, which is an attempt to standardise data object repositories structures in order to enable and aid interoperability of repositories.

Another presentation by the University of Amsterdam and Edia (A private company that helps with sakai integrations and also the conference organisers) discussed the Fluor research data tool; to briefly summarise sections that differed from Hull’s implementation, they have create the “Fluor tool” inside sakai that talks to their library’s Fedora object repository and have also attached the Fedora Generic Search Service to a SOLR implementation in order to allow searching of the repository (although they also cited the possibility of using Lucene or Zebra; SOLR is apparently based on Lucene but is easier to use and supports REST, JSON and XML). Their implementation works with sakai 2.5 up and allows a fine grain access model on a per object basis so data can be as open or closed as is necessary, all data streams holding object data is encrypted (unlike Hull’s) and even their backups are encrypted.

Presentation host by:

Roland Groen (Edia) – http://www.edia.nl/en/edia/founders

Casper Treijtel (UvA) – dpcmedewerkers-uba@uva.nl

Useful links:

http://www.slideshare.net/RolandGroen/fluor-sakai-la-2011

https://confluence.sakaiproject.org/display/CONF2011/Fluor+-+Your+connection+to+the+Fedora+Digital+Objects+Repository

Here’s a rough representation of my understanding of the model university’s are taking when integrating R.D.M. soutions with the V.R.E (interspersed with a couple of my own ideas)….

To briefly explain, starting at the bottom left, you define structures for the context of your data (so how should a medics repository, or a mathematics repository, or a geography repository look?) this helps with organisation of the contents of the repository (and potentially comparison between repositories); then you can use defined transport standards to interact with your repository and wrap it with search and discover functionality and/or general input/output interactions.

In order to make all of this usable you then need to integrate your research focused tools (your VRE, ELN, profile or research project systems, third party tools etc….) to your repository system via a custom link that caters for the connecting tool (e.g. making the objects appear as files/folders for a VRE system) or front it with some sort of service based system that would offer a defined API to talk to your tools.

Andrew Martin

Research and Collaborative Services

Filed under blog report, dissemination, implementation, infrastructure, systems, tools Tagged with eln, fedora, sakai, vre

iridium

iridium – ‘core’ institutional research data management plan development

iridium – evaluation of DataStage and DataBank research data management tools from DataFlow project

iridium – early findings on research data management planning (approaches, tools and writing plans)

SWORD v2 – From clueless to claymore

Taking the Sword course

An Introduction To SWORD (Video/Slides)

SWORD Use Cases (Video/Slides)

How SWORD Works (Video/Slides)

SWORD clients (Video/Slides)

Writing something useful

How SWORD works in java

Setup and Servlet mappings (web.xml)

Endpoint Servlet classes (org.swordapp.server.servlets.*)

Endpoint Servlet dependant classes (org.swordapp.server.*)

Implementations of interfaces (org.swordapp.server.*)

And finally…

Further reading

iridium – reporting on identification of available external and internal RDM tools

iridium – requirements for a research data catalogue and proof-of-concept development

CERIF in Practice Workshop

Research Data Management at Euro Sakai 2011

iridium website

Funded by

Search blog

Recent blog posts

iridium tweets

Archive

Categories

Subscribe

Follow Blog via Email

Blogroll

Share this:

Share this:

Share this:

Taking the Sword course

An Introduction To SWORD (Video/Slides)

SWORD Use Cases (Video/Slides)

How SWORD Works (Video/Slides)

SWORD clients (Video/Slides)

Writing something useful

How SWORD works in java

Setup and Servlet mappings (web.xml)

Endpoint Servlet classes (org.swordapp.server.servlets.*)

Endpoint Servlet dependant classes (org.swordapp.server.*)

Implementations of interfaces (org.swordapp.server.*)

And finally…

Further reading

Share this:

Share this:

Share this:

Share this:

Share this:

iridium website

Funded by

Search blog

Recent blog posts

Archive

Categories

Subscribe

Follow Blog via Email

Blogroll