The problem

Every semester, our university gets a lot of new students coming in. Usually, we get some kind of CSV from every school. We figure out the format of the CSV they are using, then either use an existing Python script, or write a new script to import the students to the system.

Swarmed by Python scripts

Because everyone seems to have their own idea when it comes to spreadsheet format, pretty much every time I have to write a new script to deal with whatever CSV I’m importing.

After a while I find myself swarmed by dozens of Python scripts… and they are mostly only slightly different, because, you know, different CSV formats, different requirements, etc. But those scripts are pretty simple, so that’s not much of a problem. They really just parse the data in the CSV, then makes a POST request to create/update their group memberships.

But I don’t really want to write code… because then I’d have to debug the script, write tests, make test cases, write documentations, etc etc… and it’s often easy make a mistake somewhere. Not to mention the headache dealing with concurrency, handling errors, and other stuff.

Oh yeah. BTW, the scripts are all single-threaded. So it loops thru every single row in the CSV, and updates the record on the server one-after-one. Not really that slow, but can certainly be improved.

As lazy as I am, I recall seeing a post called Taco Bell Programming by Ted Dziuba… which is about using the simplest tool set to solve complex problems.

🌮🔔 programming and nix magic ✨

Good old xargs

Finally fed up with all the chores, I decide to apply the techniques.

What I end up with is a dead-simple Python script, which really just makes a POST request to a hard-coded endpoint, with some arguments that I don’t want to have to type every single time.

I have this script called which takes just 2 arguments:

  • student email (string)
  • the ID of the group we’re adding them to (int)

Now, I’m not great at Linux. I just Google stuff a lot and (hopefully) eventually remember them. Let’s just see some CSVs.

CSV example #1

Say I have a CSV that only has 1 column, without header, like this

[email protected]
[email protected]
[email protected]

I know the group ID the users need to be added to, because my boss told me. Say it’s group 12345

cat ./import.csv | tr '\n' '\0' | xargs -0 -n1 -P32 -I email ./ email 12345

Pretty simple, and we are running 32 processes in parallel.

32 concurrent parallel parsing processes and zero bullshit to manage. Requirement satisfied.

Taco Bell Programming (2010) by Ted Dziuba

CSV example #2

This one is a little tricky. We got this nasty header row. It’s also got 2 comma-separated columns. We also have different groups they want the students to be added to

[email protected],11111
[email protected],22222
[email protected],22222

No sweat, just sed the crap out 💩

sed 1d ./import.csv | tr '\n' '\0' | tr , '\0' | xargs -0 -n2 -P32 ./

💥 BAM, requirement satisfied, again.

CSV example #3

This is a small variation of example #2, but it has all the missing. It’s implied to be the university domain anyway.


Let’s see. I’m not great at seding, so here I’m just going to use another xargs with echo

sed 1d ./import.csv | tr ',' '\0' | xargs -0 -n1 -I username echo [email protected] | tr '\n' '\0' | xargs -0 -n2 ./

That works, but looks rather hacky. I quickly looked up stack overflow nixCraft the man page of sed, and tried again.

sed -e 1d -e s/,/,/ ./import.csv | tr ',' '\0' | tr '\n' '\0' | xargs -0 -n2 -P32 ./

🔥 Works like a charm!

More work done with less coding

Now I more or less understand what Ted Dziuba was talking about in his Taco Bell Programming, when it comes to liability.

Functionality is an asset, but code is a liability. […] Every time you write code or introduce third-party services, you are introducing the possibility of failure into your system.

Taco Bell Programming (2010) by Ted Dziuba

Really appreciate the inspiration.