Technology Search Home

Just-In-Time Analytics on Large File and Storage Systems | The George Washington University

Just-In-Time Analytics on Large File and Storage Systems

Case ID: 011-0008-Huang

Traditional file systems, such as NTFS and ext3, often struggle to support complex analytics due to limited metadata and the absence of global views. Performing aggregate or top-k queries typically requires exhaustive scanning or pre-built indexes, which are time-consuming and resource-intensive. These limitations hinder efficient data management and analysis, especially when dealing with large-scale or unfamiliar file systems.

GW researchers have developed a novel just-in-time sampling-based system that enables efficient and accurate analytics over large file systems without prior knowledge or extensive pre-processing. By utilizing minimal disk accesses, the system employs two algorithms—FS_Agg for aggregate queries and FS_TopK for top-k queries—to provide statistically accurate estimations. The approach is file-system agnostic, scalable to billions of files, and eliminates the need for disk crawling or index building. This innovation offers a practical solution for real-time analytics in dynamic and expansive data environments.

Figure: A block diagram of the system architecture in accordance with a preferred embodiment of the invention

Advantages:

Approximately 90% accuracy in query results using just 20% of the directories compared to a full file system crawl.
Scalable to file systems containing up to one billion files and millions of directories.
Eliminates the need for pre-built indexes or metadata.

Applications:

Real-time analytics over unfamiliar or hidden file systems without prior knowledge.
Efficient data management and archiving in large-scale storage environments.
Enhancement of mobile interfaces for accessing hidden databases through context-sensitive suggestions.