1 | Papers read:
|
---|
2 | - J. Gray, The Next Database Revolution. SIGMOD 2004, pp. 13-18, June 2004.
|
---|
3 | - Hans-Peter Kriegel, et al. Future trends in data mining. Data Mining and
|
---|
4 | Knowledge Discovery [1384-5810] 2007 vol:15 iss:1 pg:87
|
---|
5 |
|
---|
6 | Current database limits [Gray, 2004] and the current limits in data-mining
|
---|
7 | [Kriegel, 2007] technology does not solely focus on the technological barriers of
|
---|
8 | current datasets, with regards to memory, latency and storage. But also focus
|
---|
9 | on a way on how to process this data efficiently. [Gray, 2004] talks about the
|
---|
10 | use of interfaces to connect databases to the clients e.g. providing direct
|
---|
11 | interfaces to the clients using SOAP calls for example. But the use of
|
---|
12 | distributed databases how not gone mentioned. But first things first:
|
---|
13 |
|
---|
14 | = Data-mining & Usability =
|
---|
15 |
|
---|
16 | Data-mining nowadays focus on subset solutions, with less attempt to generalize
|
---|
17 | the effort for mass-use. The tools and methods provided and used are merely the
|
---|
18 | building blocks for algorithms with focus on subset solutions with a well known
|
---|
19 | datasets or a lot of sanitized and known meta-data.
|
---|
20 |
|
---|
21 | With the ever increasing amount of data gathered and stored, generated human
|
---|
22 | understandable results (if any result at all) becomes harder and harder. The
|
---|
23 | underlying technique for generating results is often not to be explained by
|
---|
24 | logic human reasoning. Making the results hard to justify or even explain,
|
---|
25 | leaving potential good algorithms and strategies unused.
|
---|
26 |
|
---|
27 | (Near) future should show us whether we are capable of extracting results which
|
---|
28 | are of added value to understanding the process instead of showing heuristics,
|
---|
29 | allowing us to reason further about what is going on inside an process.
|
---|
30 |
|
---|
31 |
|
---|
32 | = Memory based databases with a file based backend =
|
---|
33 |
|
---|
34 | Reducing and elimination latency to the database objects on specific media has
|
---|
35 | been always been a major focus within the design of algorithms of database query
|
---|
36 | automation. Recent technology inventions and improvements has lead to
|
---|
37 | developments allowing us to run any average small size database fully into the
|
---|
38 | memory system. Hence reducing access to every object within the database to a
|
---|
39 | equal level, making the latency decisions in algorithms obsolete, clearing the
|
---|
40 | path for a new type of algorithm design focusing of spanning the whole data-set
|
---|
41 | as fast possible.
|
---|
42 |
|
---|
43 | Together with a full-memory database, comes the process of designing the
|
---|
44 | database in such way that it can be mirrored on persistent media for obvious
|
---|
45 | reasons (power failure, transport, backup, revisions). Instead of taking the
|
---|
46 | traditional block level disk access approach new disks comes with ability to do
|
---|
47 | clever queuing and latency reducing actions of file based objects. Future will
|
---|
48 | show whether block based access (memory database) with a file based storage
|
---|
49 | will be one of the possibles and how to cope best with large databases sets.
|
---|
50 |
|
---|
51 | = Distributed databases =
|
---|
52 |
|
---|
53 | One area not covered by [Kriegel,2007] and [Gray,2004] it the development of
|
---|
54 | several Peta-bytes datasets (like the genome databases) that needed to be
|
---|
55 | accessed by many concurrent clients trough out the world, so link-layer latencies
|
---|
56 | comes in the picture.
|
---|
57 |
|
---|
58 | Finding ways of enabling this datasets for all clients at an acceptable/uniform
|
---|
59 | access time it something getting a major importance in the future as datasets
|
---|
60 | are rapidly growing due to the development of new sensors and image/video based
|
---|
61 | storage and more of those datasets have a heavily shared nature as more
|
---|
62 | research and business will be gathering and sharing from multiple (geographical)
|
---|
63 | locations, but are in need of centralized query interfaces.
|
---|
64 |
|
---|