Bioschemas brings data closer to researchers

Bioschemas-logoWith every new generation of sequencers, mass spectrometers, and other lab equipment producing richer and cheaper data, even a small biology lab can become a big-data generator. With this flood of data, finding the right dataset or the right analysis tool is proving to be more and more difficult.

Bioschemas is an open community supported by ELIXIR to improve the discoverability of biological information and help researchers to easily find data, tools, training materials and other information they need for their research.

From cooking recipes to protein annotation

The principle of Bioschemas is to add to websites with biological objects a snippet with information about the objects themselves. Such information is hidden from typical website users, but helps search engines structure and summarise individual web searches.

The same way Google can display summary information about a particular cooking recipe (eg. calories, ingredients), Bioschemas allows search engines or specialised registries to harvest protein annotations, biological samples or genomics datasets.

Bioschemas is developing a collection of guidelines for describing those biological objects and the relations between them.

Rafael Jimenez from the ELIXIR Hub and one of the leaders of the Bioschemas community explains: "Even though many resources already provide detailed information about their data, they are often hard to find or are not accessible by computers.

“For example, the date of the last modification of a dataset may not be an important piece of information for researchers, but it is critical for developers and operators of other tools and resources in the life sciences. However, it is simply not feasible to constantly monitor the webpage of the dataset to check this value. Bioschemas specification exposes all those information in a structured way that is easy to read by search engines and other software. This means that developers can access this data programmatically and build other tools and resources on top of the existing ones.”

Raising awareness

While the Bioschemas specification is a good practice for data stewards and managers of biological resources, the benefits for life-science researchers become visible once it is adopted by many different biological resources and data archives.

One of the main activities of Bioschemas is to engage with the life-science community and encourage their members to adopt and use the Bioschemas specifications. To do that, the Bioschemas initiative organises regular workshops for developers and operators of biological resources to help them test and implement the Bioschemas specifications.

The latest such workshop was organised in October 2017 at the Wellcome Genome Campus in Hinxton, UK. "We had participants representing 33 different biological resources, including major international resources like UniProt and PDBe. Each of them tested and adopted at least one of the Bioschemas specifications," says Carole Goble, ELIXIR UK Head of Node and one of the Bioschemas leaders.

Bioschemas is also expanding its scope, developing specifications for different kinds of life-science data. "We now have 12 different specifications, including sample, protein, dataset, tool or a laboratory protocol,” continues Goble. "During our October workshop in Hinxton, we started working on a new specification for bio-chemical entity."

Another task is integration of Bioschemas specifications with repositories for life-science tools and services. This will facilitate validation and benchmarking of tools life scientists use to analyse their data.

Towards a sustainable community

Bioschemas operates as an open community and welcomes any individual or organisation to join.  Speaking about the plan for 2018, Carole Goble explains:

"The plans are to really ramp up on the solid foundations we have built — the life-science community will begin to see the benefits of the Bioschemas markup and we will be expand our reach to new data types and new data sets.

"ELIXIR aggregation portals such as TeSS, FAIRsharing and Identifiers.org are already benefiting from automated updates of their records backed by Bioschemas markup. By getting Core and Node-supported data resources to embed markup in their sites ELIXIR will build a crucial component of our FAIR metadata infrastructure; enough to be indexable by search engines."

Bioschemas is supported by ELIXIR as Implementation Study. It is led by Carole Goble, Alasdair Gray (both ELIXIR UK) and Rafael Jimenez (ELIXIR Hub).

www.bioschemas.org

 
Mon 4 December 2017