| 1 | Papers read:
|
---|
| 2 | - J. Gray, The Next Database Revolution. SIGMOD 2004, pp. 13-18, June 2004.
|
---|
| 3 | - Hans-Peter Kriegel, et al. Future trends in data mining. Data Mining and
|
---|
| 4 | Knowledge Discovery [1384-5810] 2007 vol:15 iss:1 pg:87
|
---|
| 5 |
|
---|
| 6 | Current database limits [Gray, 2004] and the current limits in data-mining
|
---|
| 7 | [Kriegel, 2007] technology does not solely focus on the technological barriers of
|
---|
| 8 | current datasets, with regards to memory, latency and storage. But also focus
|
---|
| 9 | on a way on how to process this data efficiently. [Gray, 2004] talks about the
|
---|
| 10 | use of interfaces to connect databases to the clients e.g. providing direct
|
---|
| 11 | interfaces to the clients using SOAP calls for example. But the use of
|
---|
| 12 | distributed databases how not gone mentioned. But first things first:
|
---|
| 13 |
|
---|
| 14 | = Data-mining & Usability =
|
---|
| 15 |
|
---|
| 16 | Data-mining nowadays focus on subset solutions, with less attempt to generalize
|
---|
| 17 | the effort for mass-use. The tools and methods provided and used are merely the
|
---|
| 18 | building blocks for algorithms with focus on subset solutions with a well known
|
---|
| 19 | datasets or a lot of sanitized and known meta-data.
|
---|
| 20 |
|
---|
| 21 | With the ever increasing amount of data gathered and stored, generated human
|
---|
| 22 | understandable results (if any result at all) becomes harder and harder. The
|
---|
| 23 | underlying technique for generating results is often not to be explained by
|
---|
| 24 | logic human reasoning. Making the results hard to justify or even explain,
|
---|
| 25 | leaving potential good algorithms and strategies unused.
|
---|
| 26 |
|
---|
| 27 | (Near) future should show us whether we are capable of extracting results which
|
---|
| 28 | are of added value to understanding the process instead of showing heuristics,
|
---|
| 29 | allowing us to reason further about what is going on inside an process.
|
---|
| 30 |
|
---|
| 31 |
|
---|
| 32 | = Memory based databases with a file based backend =
|
---|
| 33 |
|
---|
| 34 | Reducing and elimination latency to the database objects on specific media has
|
---|
| 35 | been always been a major focus within the design of algorithms of database query
|
---|
| 36 | automation. Recent technology inventions and improvements has lead to
|
---|
| 37 | developments allowing us to run any average small size database fully into the
|
---|
| 38 | memory system. Hence reducing access to every object within the database to a
|
---|
| 39 | equal level, making the latency decisions in algorithms obsolete, clearing the
|
---|
| 40 | path for a new type of algorithm design focusing of spanning the whole data-set
|
---|
| 41 | as fast possible.
|
---|
| 42 |
|
---|
| 43 | Together with a full-memory database, comes the process of designing the
|
---|
| 44 | database in such way that it can be mirrored on persistent media for obvious
|
---|
| 45 | reasons (power failure, transport, backup, revisions). Instead of taking the
|
---|
| 46 | traditional block level disk access approach new disks comes with ability to do
|
---|
| 47 | clever queuing and latency reducing actions of file based objects. Future will
|
---|
| 48 | show whether block based access (memory database) with a file based storage
|
---|
| 49 | will be one of the possibles and how to cope best with large databases sets.
|
---|
| 50 |
|
---|
| 51 | = Distributed databases =
|
---|
| 52 |
|
---|
| 53 | One area not covered by [Kriegel,2007] and [Gray,2004] it the development of
|
---|
| 54 | several Peta-bytes datasets (like the genome databases) that needed to be
|
---|
| 55 | accessed by many concurrent clients trough out the world, so link-layer latencies
|
---|
| 56 | comes in the picture.
|
---|
| 57 |
|
---|
| 58 | Finding ways of enabling this datasets for all clients at an acceptable/uniform
|
---|
| 59 | access time it something getting a major importance in the future as datasets
|
---|
| 60 | are rapidly growing due to the development of new sensors and image/video based
|
---|
| 61 | storage and more of those datasets have a heavily shared nature as more
|
---|
| 62 | research and business will be gathering and sharing from multiple (geographical)
|
---|
| 63 | locations, but are in need of centralized query interfaces.
|
---|
| 64 |
|
---|