当前位置:首页 >> 计算机硬件及网络 >>

Undefined by data_a survey of big data definitions


Unde?ned By Data: A Survey of Big Data De?nitions
Jonathan Stuart Ward and Adam Barker
School of Computer Science University of St Andrews, UK

{jonthan.stuart.ward, adam.barker}@st-andrews.ac.uk

arXiv:1309.5821v1 [cs.DB] 20 Sep 2013

ABSTRACT
The term big data has become ubiquitous. Owing to a shared origin between academia, industry and the media there is no single uni?ed de?nition, and various stakeholders provide diverse and often contradictory de?nitions. The lack of a consistent de?nition introduces ambiguity and hampers discourse relating to big data. This short paper attempts to collate the various de?nitions which have gained some degree of traction and to furnish a clear and concise de?nition of an otherwise ambiguous term.

big data literature, the evidence presented in the Gartner de?nition is entirely anecdotal. No numerical quanti?cation of big data is a?orded. This de?nition has since been reiterated by NIST [3] and Gartner in 2012 [6] expanded upon by IBM [5] and others to include a fourth V: Veracity. Veracity includes questions of trust and uncertainty with regards to data and the outcome of analysis of that data. Oracle avoids employing any Vs in o?ering a de?nition. Instead Oracle [8] contends that big data is the derivation of value from traditional relational database driven business decision making, augmented with new sources of unstructured data. Such new sources include blogs, social media, sensor networks, image data and other forms of data which vary in size, structure, format and other factors. Oracle, therefore asserts a de?nition which is one of inclusion. They assert that big data is the inclusion of additional data sources to augment existing operations. Notably, and perhaps unsurprisingly, the Oracle de?nition is focused upon infrastructure. Unlike those o?ered by others, Oracle places emphasis upon a set of technologies including: NoSQL, Hadoop, HDFS, R and relational databases. In doing so they present both a de?nition of big data and a solution to big data. While this de?nition is somewhat more easily applied than others it similarly lacks quanti?cation. Under the Oracle de?nition it is not clear as to exactly when the term big data becomes applicable it rather provides a means to “know it when you see it”. Intel is one of the few organisations to provide concrete ?gures in their literature. Intel links big data to organisations “generating a median of 300 terabytes (TB) of data weekly” [2]. Rather than providing a de?nition as per the aforementioned organisations, Intel describes big data through quantifying the experiences of its business partners. Intel suggests that the organisations which were surveyed deal extensively with unstructured data and place an emphasis on performing analytics over their data which is produced at a rate of up to 500 TB per week. Intel asserts that the most common data type involved in analytics is business transactions stored in relational databases (consistent with Oracle’s de?nition), followed by documents, email, sensor data, blogs and social media. Microsoft provides a notably succinct de?nition: “Big data is the term increasingly used to describe the process of applying serious computing power - the latest in machine learning and arti?cial intelligence - to seriously massive and of-

1.

BIG DATA

Since 2011 interest in an area known as big data has increased exponentially [10]. Unlike the vast majority of computer science research, big data has received signi?cant public and media interest. Headlines such as “Big data: the greater good or invasion of privacy?” [7] and “Big data is opening doors, but maybe too many” [11] speak volumes as to the common perception of big data. From the outset it is clear that big data is intertwined with considerable technical and socio-technical issues but an exact de?nition is unclear. Early literature using the term has come from numerous ?elds. This shared provenance has led to multiple, ambiguous and often contradictory de?nitions. In order to further research goals and eliminate ambiguity, a concrete de?nition is necessary. Anecdotally big data is predominantly associated with two ideas: data storage and data analysis. Despite the sudden interest in big data, these concepts are far from new and have long lineages. This, therefore, raises the question as to how big data is notably di?erent from conventional data processing techniques. For rudimentary insight as to the answer to this question one need look no further than the term big data. “Big” implies signi?cance, complexity and challenge. Unfortunately the term “big” also invites quanti?cation and therein lies the di?culty in furnishing a de?nition. Amongst the most cited de?nitions is that included in a Meta (now Gartner) report from 2001 [9]. The Gartner report makes no mention of the phrase “big data” and predates the current trend. However, the report has since been coopted as a key de?nition. Gartner proposed a three fold de?nition encompassing the “three Vs”: Volume, Velocity, Variety. This is a de?nition routed in magnitude. The report remarks upon the increasing size of data, the increasing rate at which it is produced and the increasing range of formats and representations employed. As is common throughout

ten highly complex sets of information” [4]. This de?nition states in no uncertain terms that big data requires the application of signi?cant compute power. This is alluded to in previous de?nitions but not outright stated. Furthermore this de?nition introduces two technologies: machine learning and arti?cial intelligence which have been overlooked by previous de?nitions. This, therefore, introduces the concept of there being a set of related technologies which are form crucial parts of a de?nition. A de?nition, or at least an indication of related technologies can be obtained through an investigation of related terms. Google Trends provides the following terms in relation to big data [10], from most to least frequent: data analytics, Hadoop, NoSQL, Google, IBM, and Oracle. From these terms a number of trends are evident. Firstly, that big data is intrinsically related to data analytics and the discovery of meaning from data. Secondly, it is clear that there are a number of related technologies as alluded to by the Microsoft de?nition, namely NoSQL and Apache Hadoop. Finally it is evident that there are a number of organisations, speci?cally industrial organisations which are linked with big data. As suggested by Google Trends, there are a set of technologies which are frequently suggested as being involved in big data. NoSQL stores including Amazon Dynamo, Cassandra, CouchDB, MongoDB et al play a critical role in storing large volumes of unstructured and highly variable data. Related to the use of NoSQL data stores there is a range of analysis tools and methods including MapReduce, text mining, NLP, statistical programming, machine learning and information visualisation. The application of one of these technologies alone is not su?cient to merit the use of the term big data. Rather, trends suggest that it is the combination of a number of technologies and the use of signi?cant data sets that merit the term. These trends suggest big data as a technical movement which incorporates ideas, new and old and unlike other de?nitions provides little commentary as to social and business implications. While the previously mentioned de?nitions rely upon a combination size, complexity and technology, a less common de?nition relies purely upon complexity. The Method for an Integrated Knowledge Environment (MIKE2.0) project, frequently cited in the open source community, introduces a potentially contradictory idea: “Big Data can be very small and not all large datasets are big” [1]. This is an argument in favour of complexity and not size as the dominant factor. The MIKE project argues that it is a high degree of permutations and interactions within a dataset which de?nes big data. The idea expressed latterly in the MIKE2.0 de?nition; that big data is not easily handled by conventional tools is a common anecdotal de?nition. This idea is supported the NIST de?nition which states that big data is data which: “exceed(s) the capacity or capability of current or conventional methods and systems” [3]. Given the constantly advancing nature of computer science this de?nition is not as valuable as it may initially appear. The assertion that big data is data that challenges current paradigms and practices is nothing new. This de?nition suggests that data is “big” relative to the current standard of computation. The application of ad-

ditional computation or indeed the advancing of the status quo promises to shrink big data. This de?nition can only serve as a set of continually moving goalposts and suggests that big data has always existed, and always will. Despite the range and di?erences existing within each of the aforementioned de?nitions there are some points of similarity. Notably all de?nitions make at least one of the following assertions: Size: the volume of the datasets is a critical factor. Complexity: the structure, behaviour and permutations of the datasets is a critical factor. Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor. The de?nitions surveyed here all encompass at least one of these factors, most encompass two. An extrapolation of these factors would therefore postulate the following: Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.

2. REFERENCES

[1] Big Data De?nition - MIKE2.0, the open source methodology for Information Development. http://mike2.openmethodology.org/wiki/Big Data De?nition. [2] Intel Peer Research on Big Data Analysis. http://www.intel.com/content/www/us/en/bigdata/data-insights-peer-research-report.html. [3] NIST Big Data Working Group (NBD-WG). http://bigdatawg.nist.gov/home.php. [4] The Big Bang: How the Big Data Explosion Is Changing the World - Microsoft UK Enterprise Insights Blog - Site Home - MSDN Blogs. http://blogs.msdn.com/b/microsoftenterpriseinsight/archive/2013/04 big-bang-how-the-big-data-explosion-is-changing-theworld.aspx. [5] IBM What is big data? - Bringing big data to the enterprise. http://www-01.ibm.com/software/data/bigdata/, July 2013. [6] M. A. Beyer and D. Laney. The importance of big data: A de?nition. Stamford, CT: Gartner, 2012. [7] P. Chatterjee. Big data: the greater good or invasion of privacy? http://www.guardian.co.uk/commentisfree/2013/mar/12/bigdata-greater-good-privacy-invasion, 2013. [8] J. P. Dijcks. Oracle: Big data for the enterprise. Oracle White Paper, 2012. [9] L. Douglas. 3d data management: Controlling data volume, velocity and variety. Gartner. Retrieved, 6, 2001. [10] Google. Google Trends for Big Data, 2013. [11] S. Lohr. Big Data Is Opening Doors, but Maybe Too Many. https://www.nytimes.com/2013/03/24/technology/bigdata-and-a-renewed-debate-overprivacy.html?pagewanted=all& r=0.


相关文章:
Undefined by data_a survey of big data definitions.pdf
Undefined by data_a survey of big data definitions_计算机硬件及网络_IT/计算机_专业资料。Unde?ned By Data: A Survey of Big Data De?nitions Jonathan ...
Abstract writing_图文.ppt
with precision and avoid undefined term or jargon...Build a logical argument around key data elements...Study design Definitions Participants Any ...
HSP3103-L2(155M双纤 1310nm 20km).pdf
Data Rate Standard Industrial Symbol Tc Vcc Icc ...Definitions Pin Diagram -5- 武汉恒泰通技术有限...(>0.8V, < 2.0V): Undefined High (2.0 to ...
NE529.pdf
A H H X L STROBE B X H H L OUTPUT A L Undefined H H OUTPUT B ...data Preliminary data Product status [2] Development Qualification Definitions ...
BTS7740.pdf
Pin Definitions and Functions Symbol DL1 IL1 N.C. DHVS GND IH1 ST IH...undefined Status: 1 = No error 0 = Error Data Sheet 6 2001-02-01 BTS...
datasheet.pdf
< 2.0V): High (2.0 3.465V): Open: 3) Transmitter on Undefined Transmitter Disabled Transmitter Disabled They should be pulled up with a 4.7K ...
101-模式识别-聚类算法-01_图文.ppt
To perform clustering of a data set, a ...6 ? Clustering Definitions ? Hard Clustering: ...undefined or both the q-th coordinates are equal...
2007会议A New Fuzzy Dominance GA Applied to Solve Many-....pdf
Different fuzzy-based definitions of optimality and...undefined concept of optimality for a multi-...An updated survey of GA-based multiobjective ...
What is an EPICS Database.pdf
s type Values for each design field A record type provides Definitions of ...undefined Time when last processed EPICS Run-time fields 1999/Ph 514: What...
SDT8211-TD-QW中文资料.pdf
The symbols and definitions are as shown below. ...General SDT8211-T_-Q_ is a compact and high ...( "H" ) OFF ( "L" ) undefined undefined 9...
1.25G SFP BD 20KM T1310R1550-LC.doc
Definitions Pin Diagram 武汉威盛康光电技 术有限公司 Address: Add: A-2406 ...(>0.8V, < 2.0V): Undefined High (2.0 to 3.465V): Transmitter ...
1.25G SFP BD 10KM T1550R1310-LC.doc
Definitions Pin Diagram 武汉威盛康光电技 术有限公司 Address: Add: A-2406 ...(>0.8V, < 2.0V): Undefined High (2.0 to 3.465V): Transmitter ...
BUS801_VBAAssignment.pdf
For a given series of data Yi (i = 1,…,N...Undefined ? EMAi = ?mean(Y1 ,K, Yk ) ?α ...definitions of MACD and EMA -- different EMA ...
07_Pointers.pdf
? ? ? Introduction Pointer Variable Definitions ... A Card Shuffling and Dealing Simulation Pointers ... number 5 n undefined After cubeByValue receives...
TLE5206-2_DS_11[1].pdf
undefined For Open circuit detection, use the TLE...A IF = 3 A Rev.1.1, 2007-07-31 Data ...Figure 4 Switching Time Definitions +5V +V S 2...
1.25G SFP BD 40KM 1550TX-1310RX 2010428180051.doc
cost effective modules supporting dual data-rate ...Definitions Pin Diagram Page 6 of 9 Sep.11 / ...(>0.8V, < 2.0V): Undefined High (2.0 to ...
错误信息提示.pdf
Incorrect input data from STL/VRML import. Curves...An expression contains an undefined function name....As a first step, check that the definitions of...
2.488G SFP 20KM T1310.doc
data links 1550nm DFB laser and PIN photodetector...Definitions Pin Diagram Address: Add: A-2406 ...(>0.8V, < 2.0V): Undefined High (2.0 to ...
Context Logic_图文.ppt
Logics for Data and Knowledge Representation Context...context is often unclear, undefined or left ...Definitions of context From Webster dictionary A ...
...12TLI-TR;IS61C256AL-12JLI-TR;中文规格书,Datasheet资料.pdf
Lead-free available DESCRIPTION The ISSI IS61C256AL is a very high-speed, ...DATA UNDEFINED t SD DIN t HD DATAIN VALID CE_WR1.eps 6 http://one...