Free Download Analysis of Big Data Technologies and Methods Online Book PDF
ByTed Garcia,Amazon.com (Firm)
Total Download
36“I often carry things to read so that I will not have to look at people.” –Charles Bukowski
Synopsis
Querying large datasets has become easier with Big Data technologies such as Hadoop's MapReduce. Large public datasets are becoming more available and can be found on the Amazon Web Service (AWS) Cloud. In particular, Web Data Commons has extracted and posted RDF Quads from the Common Crawl Corpus found on AWS which comprises over five billion web pages of the Internet. Technologies and methods are in their infancy when attempting to process and query these large web RDF datasets. For example, within the last couple of years, AWS and Elastic MapReduce (EMR) have provided processing of large files with parallelization and a distributed file system. RDF technologies and methods have existed for some time and the tools are available commercially and open source. RDF Parsers and databases are being used successfully with moderately sized datasets. However, the use and analysis of RDF tools against large datasets, especially in a distributed environment, is relatively new. In order to assist the BigData developer, this work explores several open source parsing tools and how they perform in Hadoop on the Amazon Cloud. Apache Any23, Apache Jena RIOT/ARQ, and SemanticWeb.com's NxParser are open source parsers that can process the RDF quads contained in the Web Data Commons files. In order to achieve the highest performance, it is essential to work with large datasets without preprocessing or importing them into a database. Therefore, parsing and querying will be done on the raw Web Data Commons files. Since the parsers do not all have query support, they will be analyzed with extract and parse functionality only. This work includes challenges and lessons learned from using these parsing tools in the Hadoop and Amazon Cloud environments and suggests future research areas.

No comments:
Post a Comment