ZiBaT => Peter Levinsky => Big Data=> exercise

Mandatory Assignment

Updated : 2017-03-27


Mandatory Assignment in Big Data

Idea:

To analyse figures provided of Keld Mortensen (Environmental Science - ENVS).

The Domain is a data system for monitoring air quality based on measuring from Northern Greenland

Back ground material:

Deadlines and Delivery:

You can work either individual or in a two person group.

Date 29th March

Each individual / group must be announced e to the teacher before 29th March at lunchtime (see google document)

Each individual / group must also define the goal (What will you try) for your experiments - also to be announced the 29th March.

Date 29th March to 26th April

Work with the experiment in the period 22nd March to 26th April

Date 26th April

Demonstration the result of your work in the class the 26th April between 12:30 PM and 3 PM.

In addition, you must hand in the material in Wiseflow.

 

Detailed description:

The work will be parted in two

  1. Analysing data (some csv-formatted or xml-formatted files) using Hadoop - Spark or similar
  2. Analysing data (comes as a stream of measurements) among other using Hadoop Nifi, Solr with more
Part 1: Analysing cSV or XML -formatted files

The data file you can get (see an overview of the data files Filedescriptions.pdf):

Hint: reading xml-files you can convert these into csv (see e.g. http://www.convertcsv.com/xml-to-csv.htm ) or use a mapper (see Streaming XML Files)

 

Find dependencies in data; You can try to find dependencies between one or more of these data:

 

You should set up a system where you upload the appropriated files to the HDFS, then use Spark or similar to make the analysing.

For more information of Spark:

 

Part 2: Analysing Stream Data

To be provided ...