Thursday, May 04, 2006

Normalizing a Column

Recently I joined a networking group, and I received a spreadsheet containing their membership rolls. Of course the first thing I wanted to do was import it into MySQL. Then someone in the forums asked about the very same thing, so I thought I'd document this process here.

The first challenge was the fact that the membership list was on three separate worksheets, and each had a different column layout. I saved the worksheets into text files and imported them into Microsoft Access since it has the facility to create tables using columns derived from text in the first row of the file. I then imported these tables into a MySQL database.

From there I was able to juxtapose the columns correctly and merge these tables into one, with the addition of a new column indicating which worksheet each row originally came from. Then the fun began. There were two columns, one for current employer and another with a list of past employers.

Present employer was pretty easy. I created two additional tables, `companies` and `people_companies`, using the following two statements to initially populate them:
INSERT INTO companies (name)
SELECT company FROM people GROUP BY company;

INSERT INTO people_companies (personid, companyid)
SELECT p.id, c.id
FROM people AS p
JOIN companies AS c ON p.company = c.name;
I then created a stored function with a cursor to process the PriorCompanies column from the People table. This field presented a real challenge. People had entered data into this field any old way, by-and-large using commas as separators, with the occasional slash for good measure. So I replaced these characters with vertical bars, did some special handling of ", Inc" and then looked in the docs for a good element extracting function.

What I was looking for was something like the STR$ELEMENT or F$ELEMENT function I remember from my VMS days. F$ELEMENT is a DCL lexical function, F$ELEMENT(index,delimiter,string) which parses string into pieces based on delimiter as a separator, and returns the index'th part of it. I couldn't find one so I wrote my own:
create function element (idx int unsigned, del char(1),
str varchar(255)) returns varchar(255)
loop begin
declare pos int default 1;

set pos = locate(del, str, pos);
if idx = 0 then
if pos > 0 then
set str = left(str, pos-1);
end if;
return str;
end if;
if pos > 0 then
set str = mid(str, pos+1);
else
set str = NULL;
end if;
set idx = idx - 1;
end; end loop;;
Then all that remained was to create a stored procedure with a cursor and a couple of loops to pull out each prior company name and insert rows as necessary to the companies and people_companies tables:
create procedure normalize_prior () begin
declare done tinyint default 0;
declare c1 cursor for
SELECT Id, PriorCompanies
FROM people
WHERE PriorCompanies IS NOT NULL;

open c1;
loop1: repeat begin
declare pc varchar(255);
declare idx int default 0;
declare co varchar(255);
declare cid int;
declare pid int;
declare continue handler for sqlstate '02000'
set done = 1;

fetch c1 into pid, pc;
set pc = trim('"' FROM pc);
set pc = replace(pc, ', ', ',');
set pc = trim(',' FROM pc);
set pc = replace(pc, ',', '|');
set pc = replace(pc, '/', '|');
set pc = replace(pc, '|Inc', ', Inc');
loop2: loop
set co = element(idx, '|', pc);
if co IS NULL then
leave loop2;
end if;
set co = trim(' ' FROM co);
INSERT IGNORE INTO companies (name)
VALUES (co);
set cid = last_insert_id();
if cid = 0 then
SELECT Id INTO cid
FROM companies WHERE name = co;
end if;
if cid != 0 AND cid IS NOT NULL then
INSERT IGNORE INTO people_companies
(personid, companyid)
VALUES (pid, cid);
end if;
set idx = idx + 1;
end loop;
end; until done end repeat;
close c1;
end;;
Of course the real fun now was going over all the tables and cleaning up numerous issues, but the main work of organizing the lists was accomplished automatically.