Traditional file systems, such as NTFS and ext3, often struggle to support complex analytics due to limited metadata and the absence of global views. Performing aggregate or top-k queries typically requires exhaustive scanning or pre-built indexes, which are time-consuming and resource-intensive. These limitations hinder efficient data management and analysis, especially when dealing with large-scale or unfamiliar file systems.
GW researchers have developed a novel just-in-time sampling-based system that enables efficient and accurate analytics over large file systems without prior knowledge or extensive pre-processing. By utilizing minimal disk accesses, the system employs two algorithms—FS_Agg for aggregate queries and FS_TopK for top-k queries—to provide statistically accurate estimations. The approach is file-system agnostic, scalable to billions of files, and eliminates the need for disk crawling or index building. This innovation offers a practical solution for real-time analytics in dynamic and expansive data environments.

Figure: A block diagram of the system architecture in accordance with a preferred embodiment of the invention
Advantages:
- Approximately 90% accuracy in query results using just 20% of the directories compared to a full file system crawl.
- Scalable to file systems containing up to one billion files and millions of directories.
- Eliminates the need for pre-built indexes or metadata.
Applications:
- Real-time analytics over unfamiliar or hidden file systems without prior knowledge.
- Efficient data management and archiving in large-scale storage environments.
- Enhancement of mobile interfaces for accessing hidden databases through context-sensitive suggestions.