Archive

Posts Tagged ‘Database’

Database engine running on a GPU

January 27, 2012 Leave a comment

Alenka is a modern analytical database engine written to take advantage of vector based processing and high bandwidth of modern GPUs.

Features include:

Vector-based processing

CUDA programming model allows a single operation to be applied to an entire set of data at once.

Self optimizing compression

Ultra fast compression and decompression performed directly inside GPU

Column-based storage

Minimize disk I/O by only accessing the relevant data

Fast database loads

Data load times measured in minutes, not in hours.

Open source and free

Some benchmarks :

Alenka : Pentium E5200 (2 cores), 4 GB of RAM, 1x2TB hard drive , NVidia GTX 260

Current Top #10 TPC-H 300GB non-clustered performance result : MS SQL Server 2005 : Hitachi BladeSymphony (8 CPU/8 Cores), 128 GB of RAM, 290x36GB 15K rpm drives

Current Top #7 TPC-H 300GB non-clustered performance result : MS SQL Server 2005 : HP ProLiant DL585 G2 (4 CPU/8 Cores), 128 GB of RAM, 200x36GB 15K rpm drives

via ålenkå – Browse Files at SourceForge.net.

Advertisements

Extreme databases: The biggest and fastest

November 8, 2011 Leave a comment

Calling something big or fast immediately begs the question, “Compared to what?” A “big” database for a small company is dwarfed by a national data repository growing by 28 petabytes per year, a “fast” database that processes transactions for an e-commerce site is slow compared to one that delivers access times measured in milliseconds in order to automatically execute stock trades.

But even if your company isn’t in the running for the biggest or fastest database on the planet, the lessons in administering such databases may be applicable to your environment. It’s a sure bet that the trends in this realm are going to filter down to databases of all sizes.

via Extreme databases: The biggest and fastest.

Tags: ,

Storing hundreds of millions of simple key-value pairs in Redis

November 2, 2011 Leave a comment

We needed a solution that would:

  • Look up keys and return values very quickly
  • Fit the data in memory, and ideally within one of the EC2 high-memory types (the 17GB or 34GB, rather than the 68GB instance type)
  • Fit well into our existing infrastructure
  • Be persistent, so that we wouldn’t have to re-populate it if a server died

One simple solution to this problem would be to simply store them as a bunch of rows in a database, with “Media ID” and “User ID” columns. However, a SQL database seemed like overkill given that these IDs were never updated (only inserted), didn’t need to be transactional, and didn’t have any relations with other tables.

Instead, we turned to Redis, an advanced key-value store that we use extensively here at Instagram (for example, it powers our main feed). Redis is a key-value swiss-army knife; rather than just normal “Set key, get key” mechanics like Memcached, it provides powerful aggregate types like sorted sets and lists. It has a configurable persistence model, where it background saves at a specified interval, and can be run in a master-slave setup.

via Instagram Engineering • Storing hundreds of millions of simple key-value pairs in Redis.

Master-Master Replication | Xeround Blog

October 29, 2011 Leave a comment

Master-Master Replication

In Master-Master Replication, or Multi-Master Replication, there are multiple masters and every master can be used both for reading from as well as writing data to. Updates can be made to any of the masters and are propagated to all other masters. Since masters are abundant and can be (arguably) added on a whim, write scalability becomes a non-issue.

Sounds too good to be true? It is.

Master-Master Replication is notoriously hard to set up and maintain, and that’s even before taking into consideration real-world facts such as communication latencies, bandwidth issues and network failures. The main difficulty is in effectively resolving conflicting updates whose existence is made possible by the multiple masters. Because this setup potentially allows simultaneous conflicting changes to the same data by different masters, a conflict resolution policy has to be implemented in order to protect against these. There are numerous approaches when it comes to conflict resolution, and choosing the right mix of resolution methods in a policy largely depends on the actual requirements that the application poses. Ironically, best practice and common sense suggest that the simplest and most effective conflict resolution strategy is to avoid conflicts altogether (and thus save the need to handle them).

Putting aside the challenges involved in keeping a Master-Master gig going and conflict-free, the real issue is that the Master-Master Replication model rarely provides a viable, long-term solution for Write Scalability.

This is because every additional master that gets added to the gang carries the price of having to synchronize it with all updates as well. As all masters update all other masters, the marginal cost of every additional master actually increases and introduces more communication into what will eventually become the setup’s bottleneck – the interconnect.

By opting for Master-Master Replication to address the need for scalability, we merely transferred the problem from the domain of a single server to that of the network (after having added a lot of complexity along the way).

via Master-Master Replication | Xeround Blog.

Differences between memcached and redis?

October 27, 2011 Leave a comment

Source: Adam D’Angelo on What are the differences between memcached and redis? – Quora.

Database as a Service DbaaS Product Directory

October 27, 2011 Leave a comment

Project Voldemort

October 27, 2011 Leave a comment

Voldemort is a distributed key-value storage system

  • Data is automatically replicated over multiple servers.
  • Data is automatically partitioned so each server contains only a subset of the total data
  • Server failure is handled transparently
  • Pluggable serialization is supported to allow rich keys and values including lists and tuples with named fields, as well as to integrate with common serialization frameworks like Protocol Buffers, Thrift, Avro and Java Serialization
  • Data items are versioned to maximize data integrity in failure scenarios without compromising availability of the system
  • Each node is independent of other nodes with no central point of failure or coordination
  • Good single node performance: you can expect 10-20k operations per second depending on the machines, the network, the disk system, and the data replication factor
  • Support for pluggable data placement strategies to support things like distribution across data centers that are geographically far apart.

Project Voldemort.