Big Data - Exercise

Mandatory Assignment

Updated : 2017-03-27

Mandatory Assignment in Big Data

To analyse figures provided of Keld Mortensen (Environmental Science - ENVS).

The Domain is a data system for monitoring air quality based on measuring from Northern Greenland

You can work either individual or in a two person group.

Each individual / group must be announced e to the teacher before 29^th March at lunchtime (see google document)

Each individual / group must also define the goal (What will you try) for your experiments - also to be announced the 29^th March.

Work with the experiment in the period 22^nd March to 26^th April

Demonstration the result of your work in the class the 26^th April between 12:30 PM and 3 PM.

In addition, you must hand in the material in Wiseflow.

The work will be parted in two

Analysing data (some csv-formatted or xml-formatted files) using Hadoop - Spark or similar
Analysing data (comes as a stream of measurements) among other using Hadoop Nifi, Solr with more

The data file you can get (see an overview of the data files Filedescriptions.pdf):

Concentrations of gaseous Mercury (heavy metal) - Mercury.xml
Concentrations of Ozone (present both in troposphere (from photosynthesis) and in stratosphere (ozone layer) - Ozone.xml
Precipitation of water and snow - Precipitation.xml
Particle number /cm3 and particle volume in cm3 - Particles.csv
Meteorology parameters from a mast at the station - Meteorologi.xml
Temperature inside the station, where Ozone and Mercury instruments are placed - StationTemperature.xml

Hint: reading xml-files you can convert these into csv (see e.g. http://www.convertcsv.com/xml-to-csv.htm ) or use a mapper (see Streaming XML Files)

Find dependencies in data; You can try to find dependencies between one or more of these data:

You should set up a system where you upload the appropriated files to the HDFS, then use Spark or similar to make the analysing.

For more information of Spark:

To be provided ...