Randomise column in CSV using awk

21 March 2019 baxters Comments 0 Comment

In packaging our learning analytics automation system I have provided test data in CSV files on the Virtualbox machine.

These CSV files contained actual student ID numbers from our College so I needed to replace them with random, non-repeating information. I have substituted these with Student(n) to represent the real students for this post.

Student1,0,0,0,0,0,0,0,0,0,0,0,0,100,100,100,0,Class_1
Student2,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Class_1
Student3,0,0,0,0,0,0,1,0,0,0,0,0,100,100,100,0,Class_1
Student4,0,0,0,0,0,0,1,0,0,0,0,0,100,100,100,0,Class_1

I found an excellent AWK entry on the evergreen Stack Overflow which fit my particular use case.

awk -F", " -v OFS=", " '
    NR > 1 {$3 = ($3 in r ? r[$3] : r[$3]=int(rand()*100000000))} 
    {print}
' file

This is very close to what I need to do. The script above will replace the 3rd column in the CSV with a random number based on the value in column 3 so that if the same value was discovered again, the random* number would be the same.

The first entry is the provided file separator -F”, ” this is close but my file uses “,” (no space). The next entry -v OFS sets the target separator (OFS is an internal AWK variable). I set this to “,” because I don’t want that to change.

The next line NR > 1 basically says “for every line after line 1 do this”

The next part in the {} produces a random number between 1 and 100000000 keyed on the value of column 3 and sets the value of column 3 to the random number. {$3= ($3 in r ? r[$3] : r[$3]=int(rand()*100000)) }

That’s not much use for me because my column 3 is 0 usually so I am keying on the value of the studentid in column 1. This will prevent the same ‘random’ number from appearing because all of the student IDs are (should be!) unique.

At the end is the input filename. The result is shown below.

awk -F"," -v OFS="," 
 'NR > 1 {$1= ($1 in r ? r[$1] : r[$1]=int(rand()*100000000))  } {print}'  
Desktop/Analytics_20167_Week_3_Historic_HE.csv

I have tested having a repeating data row and the ‘random’ number generated is successfully keyed.

84018771,0,0,0,0,0,0,0,0,0,0,0,0,100,100,100,0,Class_1
39438292,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Class_1
78309922,0,0,0,0,0,0,1,0,0,0,0,0,100,100,100,0,Class_1
79844003,0,0,0,0,0,0,1,0,0,0,0,0,100,100,100,0,Class_1

*Random numbers are not random in awk so it would be possible for two data entries to have the same number were it not for the keying part.

Leave a Reply Cancel reply