CERN Open Data Portal

The CERN Open Data portal is the access point to a growing range of data produced through the research performed at CERN. It disseminates the preserved output from various research activities, including accompanying software and documentation which is needed to understand and analyse the data being shared.

The portal adheres to established global standards in data preservation and Open Science: the products are shared under open licenses; they are issued with a digital object identifier (DOI) to make them citable objects in the scientific discourse (see details below on how to do this).

LHC Data

Data produced by the LHC experiments are usually categorised in four different levels (DPHEP Study Group (2009)). The Open Data portal focuses on the release of data from level 2 and 3.

Level 1 data comprises data that is directly related to publications which provide documentation for the published results
Level 2 data includes simplified data formats for analysis in outreach and training exercises
Level 3 data comprises reconstructed data and simulations as well as the analysis level software to allow a full scientific analysis
Level 4 covers basic raw level data (if not yet covered as level 3 data) and their associated software and allows access to the full potential of the experimental data

CERN releases 300TB of Large Hadron Collider data into open access

CERN just dropped 300 terabytes of collider data on the world.

Kati Lassila-Perini, a physicist who works on the Compact Muon Solenoid (!) detector, gave a refreshingly straightforward explanation for this huge release.

“Once we’ve exhausted our exploration of the data, we see no reason not to make them available publicly,” she said in a news release accompanying the data. “The benefits are numerous, from inspiring high school students to the training of the particle physicists of tomorrow. And personally, as CMS’s data preservation coordinator, this is a crucial part of ensuring the long-term availability of our research data.”

Amazing that this perspective is not more widely held — though I suspect it is, by the scientists at least, if not the publishers and department heads who must think of the bottom line.

The data itself is from 2011, much of it from protons colliding at 7 TeV (teraelectronvolts, you know) and producing those wonderful fountains of rare particles we all love to fail to understand. All told, it’s about half the total data collected by the CMS detector, and makes up about 2.5 inverse femtobarns. But who’s counting?

There’s both the raw data from the detectors (so you can verify the results) and also “derived” datasets that are more easy to work with — and don’t worry, CERN is providing the tools to do so, as well. There’s a whole CERN Linux environment ready for booting up in a virtual machine, and a bunch of scripts and apps (some are on GitHub, too).

Just messing around in the same computing environment used by researchers plumbing the depths of the universe would be an interesting way to spend a few labs in a college physics course. There are even “masterclasses,” data sets and tools specially curated for high school kids.

This is only the latest of several data dumps, but it’s also by far the largest. A more detailed explanation of the types of data and how they can be accessed is right here.