File Inventory: Classifying the content of Enterprise Development Directories

Semantic Designs provides tools for determining what files exist, in support of an application. These tools list all files found, often verify thier content is what it is claimed to be, detects duplicate files, and discovers files containing right restrictions (copyrights, etc.), and produce project files to enable further processing of the various file types.

Application Content Discovery

Application development for Enterprise systems is often a messier process than any of us wish to admit. We build source files, build and test scripts, test cases, and supporting documentation. We keep variant versions for debugging. We get backup files from our editors. Sometimes we make duplicate files on purpose, sometimes by accident, sometimes as an insurance policy. We get compiler-generated temporary files, binary files, and files produced by tools that we use rarely. We include third-party libraries sources and binary files, some relevant, some not or no longer relevant for application support. We get organization memos and temporary documents. Finally, people (and sometimes, tools) make mistakes and produce junk files. Over the years, lots of artifacts accumulate.

Consequence: Our development systems fill up with file artifacts. Modern computer systems aggravate this with thier enormous disk capacities; they can store all the files we produce without complaint.

With good configuration management, only the essential files that make up the application are stored, and can be easily retrieved. However, in many cases some files are missed, or worse, configuration practices are not what they should be.

Eventually there comes a time when the application must be analyzed or modified as a whole, and careful inventory of its elements must be made. Whether one has a squeaky clean set of configuration-managed files, or a wooly set of development directories, one still needs analyze the file directories which contain the application, and classify all files found, producing a File Inventory. As part of this process, it is important to determine which files are of a specific type; file extensions are not a reliable indicator of content in practice at scale. It is important to discover exact duplicates; knowing which files are duplicates can save enormous time and energy by avoiding re-analysis. One needs to know which files are source texts and which are binary files produced by tools, and often one needs to know the character encoding of files; reading a UTF-8 file as ISO-8859-1 is possible but completely misinterprets the text. Finally, it is important to know what rights are claimed for file; this often indicates the corporate owner of file. Those files whose corporate owner do not match the organization doing the inventory can reasonably be ignored (or, there may be a hidden rights violation).

FileInventory

Semantic Designs' FileInventory tool will index a set of directories, and produce a report classifying every file found. It will:

Classify all files by domain (notational system used by that file type, often determined by file extension).
Determine which files are text, or are not text ("binary")
Determine which files are EBCDIC source versus ASCII or ASCII extensions
Discover files that are exact duplicates
Validate, where possible, that a file's content matches the domain of the file as determined by file extension/prefix content
Classify files by the rights declarations they contain
Manufacture lists of files for each domain, enabling processing of those domain files (often by other Semantic Designs tools).

The results of these analyses produce report documents with the findings.

File Classification Process

File classification can be complex. The FileInventory tool considers

file extension
a significant starting fragment of each file ("file prefix")
file text encoding
the file content to determine the exact classification

Initially the tool considers the file extension as an indication of the domain (as defined by DMS; a notational system, e.g., C++; it may have many file extensions associated with it, e.g., .cpp, .h, .cxx, ...). If the file has no extension, its starting fragment is inspected for well known conventions such as Unix sh marking (#!<domain>). In the absence of either, the File Inventory tool applies a few heuristics to guess at the domain (COBOL files tend to contain "MOVE" in caps, C++ files contain "#include" and "class", etc.).

If the FileInventory tool can find a Semantic Designs file "sniffer" (LexemeExtractor as used by the SD Search Engine), the file content will be scanned by the sniffer to establish a confidence-of-classification level. Files that completely pass sniffers are treated as classified 100% correctly; if a sniffer encounters a small number of errors, the confidence goes down; if the sniffer enounters a large number of errors, the confidence goes close to zero. Absence of a sniffer produces a low confidence in file classification. This confidence level is provide with the file in the inventory list. SD has a wide variety of such file sniffers available, and can easily build more.

Sample Inventory Reports

A set of typical reports produced by the FileInventory tool on a development system with over 100,000 files can be seen here:

Other languages

Semantic Designs can build language sniffers for almost any language. Contact us for your special case.