I am trying to make a comparison between a system setup using Hadoop and HBase and achieving the same using Oracle DB as back end. I lack knowledge on the Oracle side of things so come to a fair comparison.
The work load and non-functional requirements are roughly this:
A) 12M transactions on two tables with one simple relation and multiple (non-text) indexes within 4 hours. That amounts to 833 transactions per second (TPS), sustained. This needs to be done every 8 hours.
B) Make sure that all writes are durable (so a running transaction survives a machine failure in case of a clustered setup) and have a decent level of availability? With a decent level of availability, I mean that regular failures such as disk and a single network interface / tcp connection drop should not require human intervention. Rare failures, may require intervention, but should be solved by just firing up a cold standby that can take over quickly.
C) Additionally add another 300 TPS, but have these happen almost continuously 24/7 across many tables (but all in pairs of two with the same simple relation and multiple indexes)?
Some context: this workload is 24/7 and the system needs to hold 10 years worth of historical data available for live querying. Query performance can be a bit worse than sub-second, but must be lively enough to consider for day-to-day usage. The ETL jobs are setup in such a way that there is little churn. Also in a relational setup, this workload would lead to little lock contention. I would expect index updates to be the major pain. To make a comparison as fair as possible, I would expect the loosest consistency level that Oracle provides.
I have no intention of bashing Oracle. I think it is a great database for many uses. I am trying to get a feeling for the tradeoff there is between going open source (and NoSQL) like we do and using a commercially supported, proven setup.
Nobody can answer this definitively.
When you go buy a car you can sensibly expect that its top speed, acceleration and fuel consumption will be within a few percent of values from independent testing. The same does not apply to software in general nor to databases in particular.
Even if you had provided exact details of the hardware, OS and data structures, along with full details of the amount of data stored as well as transactions, the performance could easily vary by a factor of 100 times depending on the pattern of usage (due to development of hot spots of record caching, disk fragmentation).
However, having said that there are a few pointers I can give:
1) invariably a nosql database will outperform a conventional DBMS - the reason d'etre for nosql databases is performance and parallelization. That does not mean that conventional DBMS's are redundant - they provide much greater flexibility for interacting with data
2) for small to mid range data volumes, Oracle is relatively slow in my experience compared with other relational databases. I'm not overly impressed with Oracle RAC as a scalable solution either.
3) I suspect that the workload would require a mid-range server for consistent results (something in the region of $8k+) running Oracle
4) While having a hot standby is a quick way to cover all sorts of outages, in a lot of cases, the risk/cost/benefit favours approaches such as RAID, multiple network cards, UPS rather than the problems of maintaining a synchronized cluster.
5) Support - have you ever bothered to ask the developers of an open source software package if they'll provide paid for support? IME, the SLAs / EULAs for commercial software are more about protecting the vendor than the customer.
So if you think its worthwhile considering, and cost is not a big issue, then the best answer would be to try it out for yourself.