Publications of year 2005 |
Conference's articles |
Abstract: | Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on. We present a system for automating such analyses. A filtering phase, in which a query is expressed using a new procedural programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design - including the separation into two phases, the form of the programming language, and the properties of the aggregators - exploits the parallelism inherent in having data and computation distributed across many machines. |
@inproceedings{pike2005idp, Abstract = {Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on. We present a system for automating such analyses. A filtering phase, in which a query is expressed using a new procedural programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design - including the separation into two phases, the form of the programming language, and the properties of the aggregators - exploits the parallelism inherent in having data and computation distributed across many machines.}, Author = {Pike, Rob and Dorward,Sean and Griesemer, Robert and Quinlan, Sean}, Booktitle = {Scientific Programming}, Date-Added = {2008-05-22 10:04:34 +0200}, Date-Modified = {2008-10-12 22:57:24 +0200}, Note = {Reserved for Gabriel Beaulieu}, Number = {4}, Pages = {277-298}, Publisher = {IOS Press}, Title = {Interpreting the data: Parallel analysis with Sawzall}, Volume = {13}, Year = {2005}, Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGCQpYJHZlcnNpb25UJHRvcFkkYXJjaGl2ZXJYJG9iamVjdHMSAAGGoNE HCFRyb290gAFfEA9OU0tleWVkQXJjaGl2ZXKoCwwXGBkdJCVVJG51bGzTDQ4PEBEUViRjbGFzc1 dOUy5rZXlzWk5TLm9iamVjdHOAB6ISE4ACgAOiFRaABIAGWWFsaWFzRGF0YVxyZWxhdGl2ZVBhd GjSDRobHFdOUy5kYXRhgAVPEQGaAAAAAAGaAAIAAAVhc3BlbgAAAAAAAAAAAAAAAAAAAAAAAAAA AADE+sV6SCsAAAARwR8Tc2F3emFsbC1zY2lwcm9nLnBkZgAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAABHBQsRa95RQREYgQ0FSTwAEAAQAAAkgAAAAAAAAAAAAAA AAAAAABlBhcGVycwAQAAgAAMT6qVoAAAARAAgAAMRa23QAAAABABQAEcEfABHBBAARv/8ADiyaA A4sigACAD5hc3BlbjpVc2VyczplY2Fyb246UmVjaGVyY2hlOkJpYmxpbzpQYXBlcnM6c2F3emFs bC1zY2lwcm9nLnBkZgAOACgAEwBzAGEAdwB6AGEAbABsAC0AcwBjAGkAcAByAG8AZwAuAHAAZAB mAA8ADAAFAGEAcwBwAGUAbgASADhVc2Vycy9lY2Fyb24vUmVjaGVyY2hlL0JpYmxpby9QYXBlcn Mvc2F3emFsbC1zY2lwcm9nLnBkZgATAAEvAAAVAAIADf//AADSHh8gIVgkY2xhc3Nlc1okY2xhc 3NuYW1loyEiI11OU011dGFibGVEYXRhVk5TRGF0YVhOU09iamVjdF8QNy4uLy4uLy4uLy4uL1Jl Y2hlcmNoZS9CaWJsaW8vUGFwZXJzL3Nhd3phbGwtc2NpcHJvZy5wZGbSHh8mJ6InI1xOU0RpY3R pb25hcnkACAARABoAHwApADIANwA6AD8AQQBTAFwAYgBpAHAAeACDAIUAiACKAIwAjwCRAJMAnQ CqAK8AtwC5AlcCXAJlAnACdAKCAokCkgLMAtEC1AAAAAAAAAIBAAAAAAAAACgAAAAAAAAAAAAAA AAAAALh} }
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All person copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
Les documents contenus dans ces répertoires sont rendus disponibles par les auteurs qui y ont contribué en vue d'assurer la diffusion à temps de travaux savants et techniques sur une base non-commerciale. Les droits de copie et autres droits sont gardés par les auteurs et par les détenteurs du copyright, en dépit du fait qu'ils présentent ici leurs travaux sous forme électronique. Les personnes copiant ces informations doivent adhérer aux termes et contraintes couverts par le copyright de chaque auteur. Ces travaux ne peuvent pas être rendus disponibles ailleurs sans la permission explicite du détenteur du copyright.
This document was translated from BibTEX by bibtex2html