Newly Emerging Best Practices for Big Data

An interesting angle on low latency data is the desire to begin serious analysis on the data as it streams in, but possibly far before the data transfer process terminates. There is significant interest in streaming analysis systems which allow SQL-like queries to process the data as it flows into the system. In some use cases, when the results of a streaming query surpass a threshold, the analysis can be halted without running the job to the bitter end. An academic effort, known as continuous query language (CQL), has made impressive progress in defining the requirements for streaming data processing including clever semantics for dynamically moving time windows on the streaming data.

.. For example, a single Twitter tweet “Wow! That is awesome!” may not seem to contain anything worth dimensionalizing, but with some analysis we often can get:

  • customer (or citizen or patient),
  • location,
  • product (or service or contract or event),
  • marketplace condition,
  • provider,
  • weather,
  • cohort group (or demographic cluster),
  • session,
  • triggering prior event,
  • final outcome,
  • and the list goes on

One of the charms of big data is putting off declaring data structures at the time of loading into Hadoop or a data grid.

.. This best practice conflicts with traditional RDBMS methodologies, which puts a lot of
emphasis on modeling the data carefully before loading

.. Contrary to popular belief, there are not just 24 time zones around the world, but hundreds! The complexity comes from daylight savings time rules. For example, although the state of Indiana is entirely in the Eastern time zone, part of Indiana observes daylight savings time and part does not. You need a list of Indiana counties to know what time it is in Kokomo!