当前位置:首页 >> 计算机硬件及网络 >>

Undefined by data_a survey of big data definitions

Unde?ned By Data: A Survey of Big Data De?nitions
Jonathan Stuart Ward and Adam Barker
School of Computer Science University of St Andrews, UK

{jonthan.stuart.ward, adam.barker}@st-andrews.ac.uk

arXiv:1309.5821v1 [cs.DB] 20 Sep 2013

ABSTRACT
The term big data has become ubiquitous. Owing to a shared origin between academia, industry and the media there is no single uni?ed de?nition, and various stakeholders provide diverse and often contradictory de?nitions. The lack of a consistent de?nition introduces ambiguity and hampers discourse relating to big data. This short paper attempts to collate the various de?nitions which have gained some degree of traction and to furnish a clear and concise de?nition of an otherwise ambiguous term.

big data literature, the evidence presented in the Gartner de?nition is entirely anecdotal. No numerical quanti?cation of big data is a?orded. This de?nition has since been reiterated by NIST [3] and Gartner in 2012 [6] expanded upon by IBM [5] and others to include a fourth V: Veracity. Veracity includes questions of trust and uncertainty with regards to data and the outcome of analysis of that data. Oracle avoids employing any Vs in o?ering a de?nition. Instead Oracle [8] contends that big data is the derivation of value from traditional relational database driven business decision making, augmented with new sources of unstructured data. Such new sources include blogs, social media, sensor networks, image data and other forms of data which vary in size, structure, format and other factors. Oracle, therefore asserts a de?nition which is one of inclusion. They assert that big data is the inclusion of additional data sources to augment existing operations. Notably, and perhaps unsurprisingly, the Oracle de?nition is focused upon infrastructure. Unlike those o?ered by others, Oracle places emphasis upon a set of technologies including: NoSQL, Hadoop, HDFS, R and relational databases. In doing so they present both a de?nition of big data and a solution to big data. While this de?nition is somewhat more easily applied than others it similarly lacks quanti?cation. Under the Oracle de?nition it is not clear as to exactly when the term big data becomes applicable it rather provides a means to “know it when you see it”. Intel is one of the few organisations to provide concrete ?gures in their literature. Intel links big data to organisations “generating a median of 300 terabytes (TB) of data weekly” [2]. Rather than providing a de?nition as per the aforementioned organisations, Intel describes big data through quantifying the experiences of its business partners. Intel suggests that the organisations which were surveyed deal extensively with unstructured data and place an emphasis on performing analytics over their data which is produced at a rate of up to 500 TB per week. Intel asserts that the most common data type involved in analytics is business transactions stored in relational databases (consistent with Oracle’s de?nition), followed by documents, email, sensor data, blogs and social media. Microsoft provides a notably succinct de?nition: “Big data is the term increasingly used to describe the process of applying serious computing power - the latest in machine learning and arti?cial intelligence - to seriously massive and of-

1.

BIG DATA

Since 2011 interest in an area known as big data has increased exponentially [10]. Unlike the vast majority of computer science research, big data has received signi?cant public and media interest. Headlines such as “Big data: the greater good or invasion of privacy?” [7] and “Big data is opening doors, but maybe too many” [11] speak volumes as to the common perception of big data. From the outset it is clear that big data is intertwined with considerable technical and socio-technical issues but an exact de?nition is unclear. Early literature using the term has come from numerous ?elds. This shared provenance has led to multiple, ambiguous and often contradictory de?nitions. In order to further research goals and eliminate ambiguity, a concrete de?nition is necessary. Anecdotally big data is predominantly associated with two ideas: data storage and data analysis. Despite the sudden interest in big data, these concepts are far from new and have long lineages. This, therefore, raises the question as to how big data is notably di?erent from conventional data processing techniques. For rudimentary insight as to the answer to this question one need look no further than the term big data. “Big” implies signi?cance, complexity and challenge. Unfortunately the term “big” also invites quanti?cation and therein lies the di?culty in furnishing a de?nition. Amongst the most cited de?nitions is that included in a Meta (now Gartner) report from 2001 [9]. The Gartner report makes no mention of the phrase “big data” and predates the current trend. However, the report has since been coopted as a key de?nition. Gartner proposed a three fold de?nition encompassing the “three Vs”: Volume, Velocity, Variety. This is a de?nition routed in magnitude. The report remarks upon the increasing size of data, the increasing rate at which it is produced and the increasing range of formats and representations employed. As is common throughout

ten highly complex sets of information” [4]. This de?nition states in no uncertain terms that big data requires the application of signi?cant compute power. This is alluded to in previous de?nitions but not outright stated. Furthermore this de?nition introduces two technologies: machine learning and arti?cial intelligence which have been overlooked by previous de?nitions. This, therefore, introduces the concept of there being a set of related technologies which are form crucial parts of a de?nition. A de?nition, or at least an indication of related technologies can be obtained through an investigation of related terms. Google Trends provides the following terms in relation to big data [10], from most to least frequent: data analytics, Hadoop, NoSQL, Google, IBM, and Oracle. From these terms a number of trends are evident. Firstly, that big data is intrinsically related to data analytics and the discovery of meaning from data. Secondly, it is clear that there are a number of related technologies as alluded to by the Microsoft de?nition, namely NoSQL and Apache Hadoop. Finally it is evident that there are a number of organisations, speci?cally industrial organisations which are linked with big data. As suggested by Google Trends, there are a set of technologies which are frequently suggested as being involved in big data. NoSQL stores including Amazon Dynamo, Cassandra, CouchDB, MongoDB et al play a critical role in storing large volumes of unstructured and highly variable data. Related to the use of NoSQL data stores there is a range of analysis tools and methods including MapReduce, text mining, NLP, statistical programming, machine learning and information visualisation. The application of one of these technologies alone is not su?cient to merit the use of the term big data. Rather, trends suggest that it is the combination of a number of technologies and the use of signi?cant data sets that merit the term. These trends suggest big data as a technical movement which incorporates ideas, new and old and unlike other de?nitions provides little commentary as to social and business implications. While the previously mentioned de?nitions rely upon a combination size, complexity and technology, a less common de?nition relies purely upon complexity. The Method for an Integrated Knowledge Environment (MIKE2.0) project, frequently cited in the open source community, introduces a potentially contradictory idea: “Big Data can be very small and not all large datasets are big” [1]. This is an argument in favour of complexity and not size as the dominant factor. The MIKE project argues that it is a high degree of permutations and interactions within a dataset which de?nes big data. The idea expressed latterly in the MIKE2.0 de?nition; that big data is not easily handled by conventional tools is a common anecdotal de?nition. This idea is supported the NIST de?nition which states that big data is data which: “exceed(s) the capacity or capability of current or conventional methods and systems” [3]. Given the constantly advancing nature of computer science this de?nition is not as valuable as it may initially appear. The assertion that big data is data that challenges current paradigms and practices is nothing new. This de?nition suggests that data is “big” relative to the current standard of computation. The application of ad-

ditional computation or indeed the advancing of the status quo promises to shrink big data. This de?nition can only serve as a set of continually moving goalposts and suggests that big data has always existed, and always will. Despite the range and di?erences existing within each of the aforementioned de?nitions there are some points of similarity. Notably all de?nitions make at least one of the following assertions: Size: the volume of the datasets is a critical factor. Complexity: the structure, behaviour and permutations of the datasets is a critical factor. Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor. The de?nitions surveyed here all encompass at least one of these factors, most encompass two. An extrapolation of these factors would therefore postulate the following: Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.

2. REFERENCES

[1] Big Data De?nition - MIKE2.0, the open source methodology for Information Development. http://mike2.openmethodology.org/wiki/Big Data De?nition. [2] Intel Peer Research on Big Data Analysis. http://www.intel.com/content/www/us/en/bigdata/data-insights-peer-research-report.html. [3] NIST Big Data Working Group (NBD-WG). http://bigdatawg.nist.gov/home.php. [4] The Big Bang: How the Big Data Explosion Is Changing the World - Microsoft UK Enterprise Insights Blog - Site Home - MSDN Blogs. http://blogs.msdn.com/b/microsoftenterpriseinsight/archive/2013/04 big-bang-how-the-big-data-explosion-is-changing-theworld.aspx. [5] IBM What is big data? - Bringing big data to the enterprise. http://www-01.ibm.com/software/data/bigdata/, July 2013. [6] M. A. Beyer and D. Laney. The importance of big data: A de?nition. Stamford, CT: Gartner, 2012. [7] P. Chatterjee. Big data: the greater good or invasion of privacy? http://www.guardian.co.uk/commentisfree/2013/mar/12/bigdata-greater-good-privacy-invasion, 2013. [8] J. P. Dijcks. Oracle: Big data for the enterprise. Oracle White Paper, 2012. [9] L. Douglas. 3d data management: Controlling data volume, velocity and variety. Gartner. Retrieved, 6, 2001. [10] Google. Google Trends for Big Data, 2013. [11] S. Lohr. Big Data Is Opening Doors, but Maybe Too Many. https://www.nytimes.com/2013/03/24/technology/bigdata-and-a-renewed-debate-overprivacy.html?pagewanted=all& r=0.


相关文章:
Undefined by data_a survey of big data definitions.pdf
Undefined by data_a survey of big data definitions_计算机硬件及网络_IT/计算机_专业资料。Unde?ned By Data: A Survey of Big Data De?nitions Jonathan ...
Big Data A Survey大数据综述_图文.pdf
Big Data A Survey大数据综述_互联网_IT/计算机_...Apart from masses of data, it also has some ...practitioners have different definitions of big data...
大数据系统Benchmark综述ASurveyofBenchmarkinBigData.pdf
大数据系统Benchmark综述ASurveyofBenchmarkinBigData - 大数据系统 Benchmark 综述 闫义博 1 朱文强 2 杨仝 3 李晓明 3 (1 北京大学深圳...
The inevitable application of big data to health care.pdf
The inevitable application of big data to health care_信息与通信_工程科技_...A survey by the American Hospital Association showed that adoption of EHRs ...
A Brief Introduction of Big Data 大数据PPT.ppt
A Brief Introduction of Big Data 大数据PPT_互联网_IT/计算机_专业资料。 1. 2. 3. What is big data The Characteristic The Applications What happened? ...
Big Data Needs Agile Information And Integration Go....pdf
Data profiling is foundational to continuous ...A survey of big data sponsors, business ...by Forrester Consulting on behalf of IBM, July ...
Big Data The Next Big Thing in Innovation_图文.pdf
By changing what and how we see, big data ...data; in a recent survey of over 600 global ...of commentators offer concise, cogent definitions. ...
...INTELLIGENCE AND ANALYTICS:FROM BIG DATA TO BIG ....pdf
BUSINESS INTELLIGENCE AND ANALYTICS:FROM BIG DATA TO BIG IMPACT_互联网_IT/...In a survey of the state of business analytics by Bloomberg Businessweek (...
大数据英语PPT.ppt
Big data is a term applied to data sets whose size is beyond the ...The software works by sifting through records of patients who were previously...
BigData_图文.ppt
BIG DATA IN ENGINEERING APPLICATIONS BY JASTI ASWINI 206513 Overview ? ? ?...Big data is the term for a collection of data sets so large and complex...
Big Data 大数据介绍(全英)_图文.ppt
Big Data 大数据介绍(全英)_计算机软件及应用_IT/...data scientists who tracked a decade of flight ...? States will survey youth regarding six outcomes...
地质大数据应用与地质信息化发展的思考.pdf
Massivegeologicalinformationanddatahaveformedbythelongtermaccumulationofgeological...ofgeologicalsurveyapplicationofbigdataandon promoted the developmenta of ...
Big Data in Logistics and Supply Chain Management.pdf
survey and system tutorial for big data analytics platforms, and provided an...of time or era, a process and data that are from a variety of sources...
Big Data Analysis_图文.ppt
Strengths of Big Data Platforms ?Capabilities of analyzing a variety of ...Stream computing: 35% adoption by mid-tolarge organizations by 2015 ?Data ...
WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf
WEF_TC_MFS_BigDataBigImpact_Briefing_2012_互联网...A flood of data is created every day by the ...The company uses SMS to survey emerging market ...
What is Big Data大数据_图文.pdf
Twitter generates 7TB of data daily “Emerging Technologies and the Future of Business” 8 How Is Big Data Different? ? Automatically generated by a ...
【独家原文翻译56页版】麦肯锡大数据:创新、竞争和生产....doc
Big data: The next frontier for innovation, competition, and productivity...“exhaust data,” i.e., data that are created as a by-product of ...
BigData.doc
BigData_计算机软件及应用_IT/计算机_专业资料。大...有两种方法可 以改进层次聚类的结果:(a)在每层...(Density-Based Spatial Clustering of Application ...
big-data-Application.doc
big-data-Application_法律资料_人文社科_专业资料。Application of big data ...by giving feedback messages from a malfunctioning meter to the power company...
...Regression of Big Data in RPPT课件教案讲义(附代....ppt
of Big Data in RPPT课件教案讲义(附代码数据)...Flights Example Reading data into R Parametric analysis...a free and open-source software environment for ...