Sponsored Links


+ Reply to Thread
Results 1 to 2 of 2

Thread: CS614 Data Warehousing assignment no 5 idea solution spring June 2011

  1. #1
    Super Moderator Vuhelper's Avatar Users Flag!
    Join Date
    Apr 2011
    Posts
    7,900

    Icon51 CS614 Data Warehousing assignment no 5 idea solution spring June 2011

    Sponsored Links


    Question # 1 [20 marks]

    Read the article thoroughly and then answer the following questions.

    1. What three ways are highlighted by the author to handle scalability issues and get high performance at low cost?

    2. How the three approaches of parallelism are used for better performance, and what do you think which one of these three approaches is more suitable? Give reasons to support your ideas.

    Note:

    i) Do not cut copy and paste from research paper. Write in your own words. If you will cut, copy and paste from research paper then you will get zero marks with no leniency.

    ii) Explain your answer briefly and precisely. If you will give extra i.e. un required explanation will lead to deduction of marks. Write your answer in points form.

    Note:

    Explain your answer briefly and precisely. If you will give extra i.e. un required explanation will lead to deduction of marks. Write your answer in points form.


    Sponsored Links
    Attached Files
      Please register or login to download the attachments.

  2. #2
    Super Moderator Vuhelper's Avatar Users Flag!
    Join Date
    Apr 2011
    Posts
    7,900
    1. What three ways are highlighted by the author to handle scalability issues and get high performance at low cost?

    Fortunately, there are two software tactics that offer the possibility of dramatically better performance. This section
    discusses these tactics, which can be used individually or together.

    Vertical partitioning via column-oriented database architectures [6]: Existing shared-nothing
    databases partition data “horizontally” by distributing the rows of each table across both multiple
    nodes and multiple disks on each node. Recent research has focused on an interesting
    alternative: partitioning data vertically so that different columns in a table are stored in different
    files. While still providing an SQL interface to users, these “column-oriented” databases,
    particularly when coupled with horizontal partitioning in a shared-nothing architecture, offer
    tremendous performance advantages.

    For example, in a typical data warehouse query that accesses only a few columns from each
    table, the DBMS need only read the desired columns from disk, ignoring the other columns that
    do not appear in the query. In contrast, a conventional, row-oriented DBMS must read all
    columns whether they are used in the query or not. In round numbers, this will mean that a column store reads 10 to 100 times less data from disk, resulting in a dramatic performance
    advantage relative to a row store, both running on the same shared-nothing commodity
    hardware.
    Compression-aware databases [1]: It is clear to any observer of the computing scene that
    CPUs are getting faster at an incredible rate. Moreover, CPUs will increasingly come packaged
    with multiple cores, possibly 10 or more, in the near future. Hence, the cost of computation is
    plummeting. In contrast, disks are getting much bigger and much cheaper in cost per byte, but
    they are not getting any faster in terms of the bandwidth between disk and main memory.
    Hence, the cost of moving a byte from disk to main memory is getting increasingly expensive,
    relative to the cost of processing that byte. This suggests that it would be smart to trade some
    of the cheap resource (CPU) to save the expensive resource (disk bandwidth). The clear way
    to do this is through data compression.

    A multitude of compression approaches, each tailored to a specific type and representation of
    data, have been developed, and there are new database designs that incorporate these
    compression techniques throughout query execution. In round numbers, a system that uses
    compression will yield a database that is one third the size (and that needs one third the disks).
    More importantly, only one-third the number of bytes will be brought into main memory,
    compared to a system that uses no compression. This will result in dramatically better I/O
    performance.

    However, there are two additional points to note. First, some systems, such as Oracle and
    SybaseIQ, store compressed data on the disk, but decompress it immediately when it is brought
    into main memory. Other systems, notably Vertica, do not decompress the data until it must be
    delivered to the user. An execution engine that runs on compressed data is dramatically more
    efficient than a conventional one that doesn’t run on compressed data. The former accesses
    less data from main memory, and copies and/or writes less data to main memory, resulting in
    better L2 cache performance and fewer reads and writes to main memory.

    Second, a column store can compress data more effectively than a row store. The reason is
    that every data element on a disk block comes from a single column, and therefore is of the
    same data type. Hence, a column-based database execution engine only has to compress
    elements of a single data type, rather than elements from many data types, resulting in a three-
    fold improvement in compression over row-based database execution engines.

    What You Can Do

    The message to be taken away from this article is straightforward: You can obtain a scalable
    database system with high performance at low cost by using the following tactics.

    1) Use a shared-nothing architecture. Anything else will be much less scalable.

    2) Build your architecture from commodity parts. There is no reason why the cost of a gr
    should exceed $700 per (CPU, disk) pair. If you are paying more, then you are offerin
    a vendor a guided tour through your wallet.

    3) Get a DBMS with compression. This is a good idea today, and will become an even
    better idea tomorrow. It offers about a factor of three performance improvement.

    4) Use a column-store database. These are 10 to 100 times faster than a row-store
    database on star-schema warehouse queries.

    5) Make sure your column-store database has an executor that runs on compressed data
    Otherwise, your CPU costs can be an order of magnitude or more higher than in a
    traditional database.
    2. How the three approaches of parallelism are used for better performance, and what do you think which one of these three approaches is more suitable? Give reasons to support your ideas.

    Better Performance through Parallelism: Three Common Approaches

    There are three widely used approaches for parallelizing work over additional hardware:

    • shared memory
    • shared disk
    • shared nothing

    Shared memory: In a shared-memory approach, as implemented on many symmetric multi-
    processor machines, all of the CPUs share a single memory and a single collection of disks.
    This approach is relatively easy to program: complex distributed locking and commit protocol
    are not needed, since the lock manager and buffer pool are both stored in the memory system
    where they can be easily accessed by all the processors.


    Shared disk: Shared-disk systems suffer from similar scalability limitations. In a shared-disk
    architecture, there are a number of independent processor nodes, each with its own memory.
    These nodes all access a single collection of disks, typically in the form of a storage area
    network (SAN) system or a network-attached storage (NAS) system. This architecture
    originated with the Digital Equipment Corporation VAXcluster in the early 1980s, and has been
    widely used by Sun Microsystems and Hewlett-Packard.
    To make shared-disk technology work better, vendors typically implement a “shared-cache”
    design. Shared cache works much like shared disk, except that, when a node in a parallel
    cluster needs to access a disk page, it:
    1) First checks to see if the page is in its local buffer pool (“cache”)
    2) If not, checks to see if the page is in the cache of any other node in the cluster
    3) If not, reads the page from disk
    Shared Nothing: In a shared-nothing approach, by contrast, each processor has its own set of
    disks. Data is “horizontally partitioned” across nodes, such that each node has a subset of the
    rows from each table in the database. Each node is then responsible for processing only the
    rows on its own disks. Such architectures are especially well suited to the star schema queries
    present in data warehouse workloads, as only a very limited amount of communication
    bandwidth is required to join one or more (typically small) dimension tables with the (typically
    much larger) fact table.

    In addition, every node maintains its own lock table and buffer pool, eliminating the need for
    complicated locking and software or hardware consistency mechanisms. Because shared
    nothing does not typically have nearly as severe bus or resource contention as shared-memory
    or shared-disk machines, shared nothing can be made to scale to hundreds or even thousands
    of machines. Because of this, it is generally regarded as the best-scaling architecture [4].

    Shared-nothing clusters also can be constructed using very low-cost commodity PCs and
    networking hardware – as Google, Amazon, Yahoo, and MSN have all demonstrated. For
    example, Google’s search clusters reportedly consist of tens of thousands of shared-nothing
    nodes, each costing around $700. Such clusters of PCs are frequently termed “grid computers.”

    In summary, shared nothing dominates shared disk, which in turn dominates shared memory, in
    terms of scalability.

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

     

Similar Threads

  1. Replies: 0
    Last Post: 04-25-2011, 03:41 PM
  2. Replies: 1
    Last Post: 04-20-2011, 07:12 PM
  3. Replies: 0
    Last Post: 01-14-2011, 12:45 PM
  4. CS614 Data Warehousing Spring 2009 Final Term Paper
    By viki in forum Unsolved Papers
    Replies: 0
    Last Post: 07-17-2010, 04:12 PM
  5. CS614 Data Warehousing Solution May 20,2010
    By viki in forum Assignments & Solutions
    Replies: 0
    Last Post: 05-20-2010, 11:03 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
-: Vuhelp Disclaimer :-
None of the files shown here are hosted or transmitted by this server. The links are provided solely by this site's users. The administrator's or staff of Vuhelp.net cannot be held responsible for what its users post, or any other actions of its users. You may not use this site to distribute or download any material when you do not have the legal rights to do so. It is your own responsibility to adhere to these terms. If you have any doubts about legality of content or you have any suspicions, feel free to contact us.
Pknews |Vugurus|Online Education