Overview of File System Processing

17 July 20168 February 2017 Gora LEYE 0 Comment C#

In this tutorial I will show an overview of how to delete files , empty folders, including empty sub-folders by improving system performances

Like my previous 2 posts Real-time FileWatcher System Monitor using TPL DataFlow , Asp.NET Web API, SignalR, ASP.net MVC and Angular JS and Using Spring.NET and Quartz.NET Job Scheduler , I will use asynchronous programming, Task Parallel Library , TPL Dataflow and Quartz.NET Job Scheduler

I will start by showing several ways of exploring file system and at the end of this tutorial, I will talk about performance

Abstract

Our goal is to delete all files using a criteria ( creation between to dates or created n days ago), But we cannot delete a directory if it contains subdirectories or files because this subdirectories or files may contain files that does not match our criteria.

So a better way is to order all directories descending by name

Suppose we have this filesystem

A way to explore our file system, can be as follows

I. GET ALL ORDERED DIRECTORIES AND ITERATE THROUGH EACH OF THEM

Lets create first some configuration settings :

DirectoryToProcess is the parent directory
DateStart is the date start of files to process
DateEnd is the date end of files to process
NumberOfKeepingDays is Number of days of files conservations : files created NumberOfKeepingDays days ago must be deleted
SearchPattern is a criteria to specify witch file or directory must be processed.

our goal is if NumberOfKeepingDays is not provided, we process file between DateStart and DateEnd

Lets create a GetAllDirectoriesWays class

II. ITERATE THROUGH ORDERED DIRECTORIES AND PROCESS ITEM BY ITEM

III. HANDLE DIRECTORIES RECURSIVELY

IV. RUN PROCESSSTEP AS A TASK

V. RUN PROCESSSTEP AS A TASK WITH CANCELATION TOKEN AND REPORT PROGRESS

VI. RUN PROCESSSTEP BY LOAD BALANCING USING TPL DATAFLOW

we want to just write the code, and the way we structure it results in no synchronization issues. So we don’t have to think about synchronization. In this world each object has its own private thread of execution, and only ever manipulates its own internal state.

Instead of one single thread executing through many objects by calling object methods, objects send asynchronous messages to each other.

If the object is busy processing a previous message, the message is queued. When the object is no longer busy it then processes the next message.
Fundamentally, if each object only has one thread of execution, then updating its own internal state is perfectly safe.

TPL Dataflow enable us to achieve this goal by building blocks. Blocks are
essentially a message source, target, or both. In addition to receiving and sending messages, a block represents an element of concurrency for processing the messages it receives.

Multiple blocks are linked together to produce networks of blocks. Messages are then posted asynchronously into the network for processing.

Lets first create a class that inherits from AbstractWays

Our system works as follow :

A transformBlock that tranform directory path to DirectoryInfo and post it as message to a BufferBlock
BufferBlock is linked to processorOne and processorTwo, so if processorOne is busy, then processorTwo will process message
processorOne and processorTwo are transformBlocks linked to processDirectoryBlock , processDirectoryBlock is a transfromBlock and has the responsability to delete files in current directory using a criteria

Consider the following file system

An execution of the previous code may produce the following output result

The folder 3 ( Z:\DATA\DumpDir\3 ) is empty but is not deleted , because runtime try to delete 3 before 33.

This is due to parallelism of directory processing. Even if we wait for all task to terminate before executing directory processing previous result may happen

So take care about parallelism.

VII. RUN PROCESSSTEP RECURSIVELY USING TPL DATAFLOW

VIII. RUN PROCESSSTEP USING PRODUCER CONSUMER

IX. RUN PROCESSSTEP USING ASYNCHROUNOUS PARALLEL PROCESSING

XI. OPTIMIZATION

before optimizing, lets analyse result first

we need to run all concrete classes on the same file system, so we can use directives to simulate deleting files and directories process. We assume that our system need 3 milliseconds to delete a file, you can increase or decrease this value according to use case.

We used an abstract class to define the skeleton of the algorithm in an operation (ProcessStep), and lets subclasses redefine the step of the algorithm without changing the algorithm’s structure

We instanciate concrete classes like this

private static void Main(string[] args)
{
var DirectoryToProcess = ConfigurationManager.AppSettings[“DirectoryToProcess”];

if (!Directory.Exists(DirectoryToProcess))
{
throw (new Exception(“Le dossier à traiter est introuvable”));
}

DateTime DateStart;

DateTime.TryParseExact(ConfigurationManager.AppSettings[“DateStart”],
“dd/MM/yyyy”,
CultureInfo.InvariantCulture,
DateTimeStyles.None,
out DateStart);

if (DateStart == DateTime.MinValue)
{
throw (new Exception(“La date de début est incorrecte “));
}

DateTime DateEnd;

DateTime.TryParseExact(ConfigurationManager.AppSettings[“DateEnd”],
“dd/MM/yyyy”,
CultureInfo.InvariantCulture,
DateTimeStyles.None,
out DateEnd);

if (DateEnd == DateTime.MinValue)
{
throw (new Exception(“La date de fin est incorrecte “));
}
var SearchPattern = ConfigurationManager.AppSettings[“SearchPattern”];

int numberOfKeepingDays;
int.TryParse(ConfigurationManager.AppSettings[“NumberOfKeepingDays”], out numberOfKeepingDays);

AbstractWays concreteWays = new GetAllDirectoriesWays(DirectoryToProcess, numberOfKeepingDays, DateStart, DateEnd, SearchPattern);
concreteWays.Execute();

concreteWays = new IterateDirectoriesWays(DirectoryToProcess, numberOfKeepingDays, DateStart, DateEnd, SearchPattern);
concreteWays.Execute();

concreteWays = new RecursiveDirectoryWays(DirectoryToProcess, numberOfKeepingDays, DateStart, DateEnd, SearchPattern);
concreteWays.Execute();

concreteWays = new TaskFactoryWays(DirectoryToProcess, numberOfKeepingDays, DateStart, DateEnd, SearchPattern);
concreteWays.Execute();

concreteWays = new AsynchronousWays(DirectoryToProcess, numberOfKeepingDays, DateStart, DateEnd, SearchPattern);
concreteWays.Execute();

concreteWays = new AsynchronousParallelProcessFilesWays(DirectoryToProcess, numberOfKeepingDays, DateStart, DateEnd, SearchPattern);
concreteWays.Execute();

concreteWays = new ProducerConsumersWays(DirectoryToProcess, numberOfKeepingDays, DateStart, DateEnd, SearchPattern);
concreteWays.Execute();

concreteWays = new DirectoryLoadBalancerWays(DirectoryToProcess, numberOfKeepingDays, DateStart, DateEnd, SearchPattern);
concreteWays.Execute();

concreteWays = new RecursiveDataflowWays(DirectoryToProcess, numberOfKeepingDays, DateStart, DateEnd, SearchPattern);
concreteWays.Execute();
Console.Read();
}

An execution of the code below may produce the following output

For our test, we have generated 360.000 files on 37 folders

Directory.GetDirectories (or Directory.GetFiles) vs Directory.EnumerateDirectories (or Directory.EnumerateFiles)

When we use EnumerateDirectories(or EnumerateFiles) , we can start enumerating the collection before the whole collection is returned

But when we use GetDirectories (or GetFiles) , we must wait for the whole array to be returned before we can access the array.

Therefore, when we work with many files and directories, EnumerateFiles can be more efficient.

But in a broadcast system where files arrive continuously, it is better to first get all files to process ( by using GetDirectories or GetFiles ) so as to ignore the latest files ( in our case latest file will not be deleted).

Using EnumerateDirectories or EnumerateFiles latest files may be processed because when the system is currently processing an item, new items can be added on directories, so on collection

Using Parallelism

We cannot parallelize directory processing because the runtime may start deleting C:\DATA\DumpDir\2 before C:\DATA\DumpDir\2\21

In our case C:\DATA\DumpDir\2\21 will be deleted but not C:\DATA\DumpDir\2.

We can wait processing of C:\DATA\DumpDir\2 until C:\DATA\DumpDir\2\21 is processed but we do not know at this moment if C:\DATA\DumpDir\2\21 will be deleted or not ( not contains files or subdirectories)

So take care about parallelizm.

First, we need to mind if a function can be parallelized?

Consider the algorithm to calculate the Fibonacci numbers
(1, 1, 2, 3, 5, 8, 13, 21, etc.). The next number in the sequence is the sum of the previous two numbers.

Therefore, to calculate the next issue, we have already calculated the previous two. This algorithm is inherently
sequentially, so as much as we may try, it can not be parallelized.

So we can parallelize file processing because files have no connection between them