February 16, 2018

Convert txt files to csv with Golang

Recently I found a client that needed to convert a list of txt files into csv - and he wanted the solution to be written in Go. Looking at the source files, I’m assuming the text files are generated by a script or a tool that extracts the data from somewhere. The good thing is that the provided files were very simple, so they are perfect for a tutorial like this.

The example repo can be found HERE.

Converting txt files to another format, especially a simple one like csv, is a standard beginner programming problem. While solving this problem for my client, I couldn’t find a good tutorial on how to achieve this with Go (only few code snippets), so here’s one.

The files provided to me by the client (the ones found in /testdata) were encoded in UTF-16. Golang works with UTF-8 by default, so stdlib’s strings package didn’t behave well with provided text files (trimming, splitting). Therefore, I trimmed the extra spaces using regex.

Everything is done using the standard library only - which is a great asset in Go. Comming from Java, it feels so good not needing to use third party depedencies, even for simplest things.

The code with tests is available on GitHub. Below are some simple comments to get you through it.

The first part of this tool reads two flags - path to text files and resulting csv name. Flags are first-level citizen in go, and it’s quite easy to use them for basic variables (booleans, integers, and strings).

var (
    // First parameter to flag.String ("p") represents how this flag will be passed
    // For example, -p"/users/ribice/code/txtfiles/"
    // The second parameter holds the default value, if flag is not passed
    // The third parameter is flag's description, shown when -help is invoked.
    path        = flag.String("p", "./", "Directory path where files are located")
    csvFileName = flag.String("d", "result", "Resulting csv file name")
)
// flag.Parse parses the flags, and saves them in corresponding variables
flag.Parse()

The flag parameters are saved as pointers. Since I was using the path variable multiple times, I copied its value for ease of use.

Function ioutil.ReadDir() expects directories to have a trailing slash (filepath separator), so I needed to add it. Go has in-built filepath.Separator, which puts the correct separator depending on the OS it’s being executed on.

directory := *path
if !strings.HasSuffix(directory, string(filepath.Separator)) {
    directory += string(filepath.Separator)
}

Next step is reading the files. I achieve that using ioutil.ReadDir():

// ReadDir reads all files and folders from given path and returns []os.FileInfo
files, err := ioutil.ReadDir(path)
if err != nil {
    log.Fatal(err)
}

Next, filter ReadDir’s response to include only .txt files and save the names in slice of strings. The os.FileInfo has helper methods IsDir and Name, to filter out directories and files that don’t have .txt extension.

func filterTxtFiles(f []os.FileInfo) []string {
var txtFiles []string
for _, v := range f {
    // Ignore all directories, used to skip *.txt named directories
    if !v.IsDir() {
        // Append only files with .txt extension
        if filepath.Ext(v.Name()) == ".txt" {
            txtFiles = append(txtFiles, v.Name())
            }
        }
    }
    return txtFiles
}

Once the list of filenames is ready, I proceed to read their content and save it to a new csv file.

As the provided text files were encoded in UTF-16, standard string functions did not work (they work with UTF-8). If this wasn’t the case, it would be easier to trim the extra space using Trim or TrimSpace function from strings package. For this case, as usual, Regex solves the problem.

// Regex to escape all special characters except dot
reg, err := regexp.Compile("[^a-zA-Z0-9.]+")
if err != nil {
    log.Fatal(err)
}

Now, create the csv file using os.Create():

// Create new result.csv file
csvFile, err := os.Create(resFileName + ".csv")
if err != nil {
    log.Fatalf("Cannot create file: %v", err)
}
defer csvFile.Close()

Go’s standard library contains csv package, that easens dealing with csv’s . Here I use csvWriter (csv.NewWriter(csvFile)) and Flush to write the data from text files to csv.

csvWriter := csv.NewWriter(csvFile)
defer csvWriter.Flush()

Once the csv file is created, it’s time to write to it. Loop through all the files, and open them using os.Open(). Alternatively, ioutil’s ReadFile() could’ve been used.

file, err := os.Open(path + v)
if err != nil {
    log.Fatalf("Cannot open file %s, due to error %v", v, err)
}
// Get fileName, minus the dot and extension
fileName := []string{v[:len(v)-4]}

To actually read the file contents, I’ve used bufio’s scanner. Even though it’s not a common case in Go, an alterenative could be used here as well (bufio.ReadLine() - albeit in a bit different way).

scanner := bufio.NewScanner(file)
// Skip first three title rows in every file
for i := 0; i < 3; i++ {
    scanner.Scan()
}

At the end, I loop through all the lines in a file, and trim the extra spaces before writing the data to previously created writer.

for scanner.Scan() {
    // Replace all spaces and special characters with empty string using regex's ReplaceAllString() function.
    trimStr := reg.ReplaceAllString(scanner.Text(), "")
    // Skip empty lines
    if len(trimStr) > 0 {
        var row []string
        // Split into two parts by dot between them
        content := strings.Split(trimStr, ".")
        // The if case handles is used to handle lines without extension name.
        if len(content) == 2 {
            row = append(row, "."+content[1], content[0])
        } else {
            row = append(row, "", content[0])
        }
        // csv writer expect slice of strings.
        // Therefore I created a new slice of strings including:
        // - name of the file (as requested by client)
        // - row data (count and extension)
        data := append(fileName, row...)
        err := csvWriter.Write(data)
        if err != nil {
            log.Fatalf("Cannot write to csv file %v", err)
        }
    }
}
if err := scanner.Err(); err != nil {
    log.Fatal(err)
}

At the end, a new csv file is created that contains client’s expected output, making both me and the client quite happy.

2018 © Emir Ribic - Some rights reserved; please attribute properly and link back. Code snippets are MIT Licensed

Powered by Hugo & Kiss.