kb/data/en.wikipedia.org/wiki/ELKI-0.md

5.5 KiB

title chunk source category tags date_saved instance
ELKI 1/2 https://en.wikipedia.org/wiki/ELKI reference science, encyclopedia 2026-05-05T10:11:18.253498+00:00 kb-cron

ELKI (Environment for Developing KDD-Applications Supported by Index-Structures) is a data mining (KDD, knowledge discovery in databases) software framework developed for use in research and teaching. It was originally created by the database systems research unit at LMU Munich, Germany, led by Professor Hans-Peter Kriegel. The project has continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

== Description == The ELKI framework is written in Java and built around a modular architecture. Most currently included algorithms perform clustering, outlier detection, and database indexes. The object-oriented architecture allows the combination of arbitrary algorithms, data types, distance functions, indexes, and evaluation measures. The Java just-in-time compiler optimizes all combinations to a similar extent, making benchmarking results more comparable if they share large parts of the code. When developing new algorithms or index structures, the existing components can be easily reused, and the type safety of Java detects many programming errors at compile time. ELKI is a free tool for analyzing data, mainly focusing on finding patterns and unusual data points without needing labels. It's written in Java and aims to be fast and able to handle big datasets by using special structures. It's made for researchers and students to add their own methods and compare different algorithms easily. ELKI has been used in data science to cluster sperm whale codas, for phoneme clustering, for anomaly detection in spaceflight operations, for bike sharing redistribution, and traffic prediction.

== Objectives == The university project is developed for use in teaching and research. The source code is written with extensibility and reusability in mind, but is also optimized for performance. The experimental evaluation of algorithms depends on many environmental factors and implementation details can have a large impact on the runtime. ELKI aims at providing a shared codebase with comparable implementations of many algorithms. As research project, it currently does not offer integration with business intelligence applications or an interface to common database management systems via SQL. The copyleft (AGPL) license may also be a hindrance to an integration in commercial products; nevertheless it can be used to evaluate algorithms prior to developing an own implementation for a commercial product. Furthermore, the application of the algorithms requires knowledge about their usage, parameters, and study of original literature. The audience is students, researchers, data scientists, and software engineers.

== Architecture == ELKI is modeled around a database-inspired core, which uses a vertical data layout that stores data in column groups (similar to column families in NoSQL databases). This database core provides nearest neighbor search, range/radius search, and distance query functionality with index acceleration for a wide range of dissimilarity measures. Algorithms based on such queries (e.g. k-nearest-neighbor algorithm, local outlier factor and DBSCAN) can be implemented easily and benefit from the index acceleration. The database core also provides fast and memory efficient collections for object collections and associative structures such as nearest neighbor lists. ELKI makes extensive use of Java interfaces, so that it can be extended easily in many places. For example, custom data types, distance functions, index structures, algorithms, input parsers, and output modules can be added and combined without modifying the existing code. This includes the possibility of defining a custom distance function and using existing indexes for acceleration. ELKI uses a service loader architecture to allow publishing extensions as separate jar files. ELKI uses optimized collections for performance rather than the standard Java API. For loops for example are written similar to C++ iterators:

In contrast to typical Java iterators (which can only iterate over objects), this conserves memory, because the iterator can internally use primitive values for data storage. The reduced garbage collection improves the runtime. Optimized collections libraries such as GNU Trove3, Koloboke, and fastutil employ similar optimizations. ELKI includes data structures such as object collections and heaps (for, e.g., nearest neighbor search) using such optimizations.

== Visualization == The visualization module uses SVG for scalable graphics output, and Apache Batik for rendering of the user interface as well as lossless export into PostScript and PDF for easy inclusion in scientific publications in LaTeX. Exported files can be edited with SVG editors such as Inkscape. Since cascading style sheets are used, the graphics design can be restyled easily. Unfortunately, Batik is rather slow and memory intensive, so the visualizations are not very scalable to large data sets (for larger data sets, only a subsample of the data is visualized by default).

== Awards == Version 0.4, presented at the "Symposium on Spatial and Temporal Databases" 2011, which included various methods for spatial outlier detection, won the conference's "best demonstration paper award".

== Included algorithms == Select included algorithms: