The devil is in the details
Let us think a real-life case, in which we have a lot of CSV files in one specific directory. For example, we may have transaction files (containing cash flow tables, date schedule tables, parameters) produced by some front office system batch on a daily basis. These files are then used as feed for a third-party analytical software for some intensive calculations. The only problem is, that some of the information in feed files is not in recognizable form for third-party software. In a nutshell, feed files may contain some specific strings, which should be replaced with another string.
The next question is how to do this operation quickly and on a safe manner? Needless to say, there are pretty much as many approaches as there are people performing this heartbreaking operation. Personally, I have used
- find-and-replace strategy with notepad++, until the amount of files to be processed and parameters to be replaced grew a bit too large.
- custom batch script, until I realized that the script was sometimes not working as expected, by either leaving some cases out from replacement operations or resulting incorrect replacements. The scariest part was, that the script was actually working well most of the time.
- custom Powershell script (created by someone else), which was using hard-coded configurations for all replacement strings and all strings which should be replaced. All configurations (hosted in source code file) needed to be set in a specific order. This was actually working well up to a point, where those hard-coded configurations should have been changed to correspond changes made in other systems. Moreover, execution times were a bit too high.
Finally, I decided to create my own program for handling this task.
United we fall, divided we stand
After learning a bit of parallel algorithms, I soon realized that this kind of a scheme (due to task independency) would be suitable candidate for implementing parallelism, in order to improve program execution speed. Each source file (containing transaction information) has to be processed separately and all key-value pairs (string-to-be-replaced and corresponding replacement string) can be stored in concurrent-safe concurrent unordered map as string pairs.
This header file is consisting free functions for all required file handling operations.
First, there has to be a directory containing all source files, which are going to be processed by file processing program. Also, there has to be a specific file, which contains key-value pairs (key = string to be found and replaced, value = replacement string) for all desired string replacement cases. Main program is creating file processors into vector container (there will be as many processors as there are source files to be processed). Then, main program is executing all file processors. A single file processor is first reading source file information into string, looping through all key-value pairs and checking whether any occurences are to be found, performing string replacements and finally, writing modified information back to original source file.
In this particular case, a picture is worth more than thousand words.
A : arguments for executable to be used with command prompt
B : directory for key-value pairs file
C : the content of key-value pairs file
D : directory for source files.
Finally, thanks a lot again for reading this blog.-Mike