当前位置:首页 >> 计算机硬件及网络 >>

Undefined by data


Unde?ned By Data: A Survey of Big Data De?nitions
Jonathan Stuart Ward and Adam Barker
School of Computer Science University of St Andrews, UK

{jonthan.stuart.ward, adam.barker}@st-andrews.ac.uk

arXiv:1309.5821v1 [cs.DB] 20 Sep 2013

ABSTRACT
The term big data has become ubiquitous. Owing to a shared origin between academia, industry and the media there is no single uni?ed de?nition, and various stakeholders provide diverse and often contradictory de?nitions. The lack of a consistent de?nition introduces ambiguity and hampers discourse relating to big data. This short paper attempts to collate the various de?nitions which have gained some degree of traction and to furnish a clear and concise de?nition of an otherwise ambiguous term.

big data literature, the evidence presented in the Gartner de?nition is entirely anecdotal. No numerical quanti?cation of big data is a?orded. This de?nition has since been reiterated by NIST [3] and Gartner in 2012 [6] expanded upon by IBM [5] and others to include a fourth V: Veracity. Veracity includes questions of trust and uncertainty with regards to data and the outcome of analysis of that data. Oracle avoids employing any Vs in o?ering a de?nition. Instead Oracle [8] contends that big data is the derivation of value from traditional relational database driven business decision making, augmented with new sources of unstructured data. Such new sources include blogs, social media, sensor networks, image data and other forms of data which vary in size, structure, format and other factors. Oracle, therefore asserts a de?nition which is one of inclusion. They assert that big data is the inclusion of additional data sources to augment existing operations. Notably, and perhaps unsurprisingly, the Oracle de?nition is focused upon infrastructure. Unlike those o?ered by others, Oracle places emphasis upon a set of technologies including: NoSQL, Hadoop, HDFS, R and relational databases. In doing so they present both a de?nition of big data and a solution to big data. While this de?nition is somewhat more easily applied than others it similarly lacks quanti?cation. Under the Oracle de?nition it is not clear as to exactly when the term big data becomes applicable it rather provides a means to “know it when you see it”. Intel is one of the few organisations to provide concrete ?gures in their literature. Intel links big data to organisations “generating a median of 300 terabytes (TB) of data weekly” [2]. Rather than providing a de?nition as per the aforementioned organisations, Intel describes big data through quantifying the experiences of its business partners. Intel suggests that the organisations which were surveyed deal extensively with unstructured data and place an emphasis on performing analytics over their data which is produced at a rate of up to 500 TB per week. Intel asserts that the most common data type involved in analytics is business transactions stored in relational databases (consistent with Oracle’s de?nition), followed by documents, email, sensor data, blogs and social media. Microsoft provides a notably succinct de?nition: “Big data is the term increasingly used to describe the process of applying serious computing power - the latest in machine learning and arti?cial intelligence - to seriously massive and of-

1.

BIG DATA

Since 2011 interest in an area known as big data has increased exponentially [10]. Unlike the vast majority of computer science research, big data has received signi?cant public and media interest. Headlines such as “Big data: the greater good or invasion of privacy?” [7] and “Big data is opening doors, but maybe too many” [11] speak volumes as to the common perception of big data. From the outset it is clear that big data is intertwined with considerable technical and socio-technical issues but an exact de?nition is unclear. Early literature using the term has come from numerous ?elds. This shared provenance has led to multiple, ambiguous and often contradictory de?nitions. In order to further research goals and eliminate ambiguity, a concrete de?nition is necessary. Anecdotally big data is predominantly associated with two ideas: data storage and data analysis. Despite the sudden interest in big data, these concepts are far from new and have long lineages. This, therefore, raises the question as to how big data is notably di?erent from conventional data processing techniques. For rudimentary insight as to the answer to this question one need look no further than the term big data. “Big” implies signi?cance, complexity and challenge. Unfortunately the term “big” also invites quanti?cation and therein lies the di?culty in furnishing a de?nition. Amongst the most cited de?nitions is that included in a Meta (now Gartner) report from 2001 [9]. The Gartner report makes no mention of the phrase “big data” and predates the current trend. However, the report has since been coopted as a key de?nition. Gartner proposed a three fold de?nition encompassing the “three Vs”: Volume, Velocity, Variety. This is a de?nition routed in magnitude. The report remarks upon the increasing size of data, the increasing rate at which it is produced and the increasing range of formats and representations employed. As is common throughout

ten highly complex sets of information” [4]. This de?nition states in no uncertain terms that big data requires the application of signi?cant compute power. This is alluded to in previous de?nitions but not outright stated. Furthermore this de?nition introduces two technologies: machine learning and arti?cial intelligence which have been overlooked by previous de?nitions. This, therefore, introduces the concept of there being a set of related technologies which are form crucial parts of a de?nition. A de?nition, or at least an indication of related technologies can be obtained through an investigation of related terms. Google Trends provides the following terms in relation to big data [10], from most to least frequent: data analytics, Hadoop, NoSQL, Google, IBM, and Oracle. From these terms a number of trends are evident. Firstly, that big data is intrinsically related to data analytics and the discovery of meaning from data. Secondly, it is clear that there are a number of related technologies as alluded to by the Microsoft de?nition, namely NoSQL and Apache Hadoop. Finally it is evident that there are a number of organisations, speci?cally industrial organisations which are linked with big data. As suggested by Google Trends, there are a set of technologies which are frequently suggested as being involved in big data. NoSQL stores including Amazon Dynamo, Cassandra, CouchDB, MongoDB et al play a critical role in storing large volumes of unstructured and highly variable data. Related to the use of NoSQL data stores there is a range of analysis tools and methods including MapReduce, text mining, NLP, statistical programming, machine learning and information visualisation. The application of one of these technologies alone is not su?cient to merit the use of the term big data. Rather, trends suggest that it is the combination of a number of technologies and the use of signi?cant data sets that merit the term. These trends suggest big data as a technical movement which incorporates ideas, new and old and unlike other de?nitions provides little commentary as to social and business implications. While the previously mentioned de?nitions rely upon a combination size, complexity and technology, a less common de?nition relies purely upon complexity. The Method for an Integrated Knowledge Environment (MIKE2.0) project, frequently cited in the open source community, introduces a potentially contradictory idea: “Big Data can be very small and not all large datasets are big” [1]. This is an argument in favour of complexity and not size as the dominant factor. The MIKE project argues that it is a high degree of permutations and interactions within a dataset which de?nes big data. The idea expressed latterly in the MIKE2.0 de?nition; that big data is not easily handled by conventional tools is a common anecdotal de?nition. This idea is supported the NIST de?nition which states that big data is data which: “exceed(s) the capacity or capability of current or conventional methods and systems” [3]. Given the constantly advancing nature of computer science this de?nition is not as valuable as it may initially appear. The assertion that big data is data that challenges current paradigms and practices is nothing new. This de?nition suggests that data is “big” relative to the current standard of computation. The application of ad-

ditional computation or indeed the advancing of the status quo promises to shrink big data. This de?nition can only serve as a set of continually moving goalposts and suggests that big data has always existed, and always will. Despite the range and di?erences existing within each of the aforementioned de?nitions there are some points of similarity. Notably all de?nitions make at least one of the following assertions: Size: the volume of the datasets is a critical factor. Complexity: the structure, behaviour and permutations of the datasets is a critical factor. Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor. The de?nitions surveyed here all encompass at least one of these factors, most encompass two. An extrapolation of these factors would therefore postulate the following: Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.

2. REFERENCES

[1] Big Data De?nition - MIKE2.0, the open source methodology for Information Development. http://mike2.openmethodology.org/wiki/Big Data De?nition. [2] Intel Peer Research on Big Data Analysis. http://www.intel.com/content/www/us/en/bigdata/data-insights-peer-research-report.html. [3] NIST Big Data Working Group (NBD-WG). http://bigdatawg.nist.gov/home.php. [4] The Big Bang: How the Big Data Explosion Is Changing the World - Microsoft UK Enterprise Insights Blog - Site Home - MSDN Blogs. http://blogs.msdn.com/b/microsoftenterpriseinsight/archive/2013/04 big-bang-how-the-big-data-explosion-is-changing-theworld.aspx. [5] IBM What is big data? - Bringing big data to the enterprise. http://www-01.ibm.com/software/data/bigdata/, July 2013. [6] M. A. Beyer and D. Laney. The importance of big data: A de?nition. Stamford, CT: Gartner, 2012. [7] P. Chatterjee. Big data: the greater good or invasion of privacy? http://www.guardian.co.uk/commentisfree/2013/mar/12/bigdata-greater-good-privacy-invasion, 2013. [8] J. P. Dijcks. Oracle: Big data for the enterprise. Oracle White Paper, 2012. [9] L. Douglas. 3d data management: Controlling data volume, velocity and variety. Gartner. Retrieved, 6, 2001. [10] Google. Google Trends for Big Data, 2013. [11] S. Lohr. Big Data Is Opening Doors, but Maybe Too Many. https://www.nytimes.com/2013/03/24/technology/bigdata-and-a-renewed-debate-overprivacy.html?pagewanted=all& r=0.


相关文章:
g++问题解决方法.txt
libicudata.so.54, needed by /home/lixp/Qt5.5.1/5.5/gcc/lib//lib...undefined reference to `ucal_setMillis_54' undefined reference to `ucol_strcol...
Keil编译常见问题.doc
(by wenshidu.o and main.o).在编译的 时候出现...warning: C3017W: data may be used before being...identifier "xxxx" is undefined 一块出现,而且后面...
Data set record formats数据集的格式.doc
U (Undefined) This format consists of variable-length physical records and ...The type and length of a data set are defined by its record format (...
ecshop错误大全.doc
' ORDER BY region, library, sort_order'; $res = $GLOBALS['db']->get...Undefined variable: data in D:\wwwroot\KISS\wwwroot\includes\lib_base.php...
Keil最常见编译错误和警告.doc
即 eint.c 中添加了一个 datawrite() 的函数,并...Symbol temp multiply defined (by wenshidu.o and ...identifier "TIM2_IRQChannel" is undefined 谁能说说...
Quartus常见错误.doc
为什么还提示“ Top为什么还提示“Error: Top-level undefined” design entity ...by clock clk44 with clock skew larger than data delay 原因:时钟抖动大于...
Quartus常见错误分析 Error.txt
1. Warning: Found pins ing as undefined clocks and/or memory enables Info...(s) clocked by clock "class[1]" with clock skew larger than data delay...
8D Problem Solving Process_图文.ppt
undefined progress - no sense of urgency, no goals ? lack of data/facts, not shared by all 8-D Problem Solving Process -STEP 1 Requirements for ...
jquery data table 服务器端分页实现方法.doc
(); } List<Baojing> list = bjmi.getByRiqi(req.getParameter("kaishi...判断datatable是否存在,如果存在需要销毁 if (datatable == undefined || data...
...Adaptive Interpolation by Pixel Level Data-Depen....pdf
In this phase the interpolator assign value to the undefined pixel by pixel level data dependent geometrical shapes. Fig. 9 Square split by edge into ...
...Adaptive Interpolation by Pixel Level Data-Depen....pdf
In this phase the interpolator assign value to the undefined pixel by pixel level data dependent geometrical shapes. Fig. 9 Square split by edge into ...
Asterisk_Guide.pdf
Undefined variable: data in /usr/src/freepbx-2.0-beta4/install_amp on line...database..OK Please Reload Asterisk by visiting http://XXX.XXX.XXX.XX/...
利用全局临时表解决database link不能使用的问题.doc
DATABASE LINK lob_link CONNECT TO dave IDENTIFIED BY dave USING '(...will raise a truncation error andthe contents of the buffer are undefined....
IS62LV256AL_datasheet.pdf
ADDRESS CE tAW tPWE WE tHZWE tLZWE HIGH-Z DOUT DATA UNDEFINED tSD tHD DIN DATA-IN VALID Notes: 1. The internal write time is defined by the ...
Lesson 3 Sampled-Data Signals_图文.ppt
of the Dirac pulse is unity), even though its precise shape is undefined...data signal, but also a very convenient method of defining the signal by ...
利用Trace32如何dump出调试信息.pdf
by the CPU mode exception handler return to current excution Exception Reset Data Abort FIQ IRQ Prefetch Abort SWI Undefined Instruction Exception when ...
Data Transformation Based On Data Model Derivation.pdf
Data Transformation Based On Data Model Derivation_专业资料。Where data needs...is evaluated as the first non-UNDEFINED result obtained by evaluating the comp...
...Cannot read property 'opera' of undefined.txt
== undefined ) { migrateWarn("Use of jQuery.fn.data('events') is ...() replaced by jQuery.fn.addBack()"); return oldSelf.apply( this, ...
PCF2129AT1,518;中文规格书,Datasheet资料.pdf
PCF2129AT1,518;中文规格书,Datasheet资料_电子/电路_工程科技_专业资料。PCF...Bits labeled as X are undefined at power-on and unchanged by subsequent ...
Petrel Data Import.doc
This includes Time data. Importing Lines/Points by using the General ASCII ...you must specify: Number of header lines, flag value and undefined value....