Monday, 20 October 2014

Apache Web Log Analysis using PIG

Apache Web Log Analysis using PIG

Enter into the Pig shell. using the 'pig -x local'

Load the log file into Pig using the LOAD command.

grunt>raw_logs = LOAD '/home/hadoop/work/input/apacheLog.log' 
           USING TextLoader AS (line:chararray);

Parse the log file and assign different field to different varriable.

logs_base = FOREACH raw_logs GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"') )
AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray,  time: chararray, 
request: chararray, status: int, bytes_string: chararray, eferrer: chararray, browser: chararray);

We need only time (time), IP Address (remoteAddr), and user (remoteLogname). So we extract 
these three variables for each record and assign them to a placeholder.

logs =  FOREACH logs_base GENERATE remoteAddr,remoteLogname, time;

Now we need to find out the number of hits and number of unique users based on time.
We can achieve this in Pig by grouping all the records based on some variable 
or combination of variables. In our case, it would be datetime.

group_time = GROUP logs BY (time);

In this grouping, we need to find out the count of number of hits and number of unique users.
In order to find out the number of hits, we simply take count of the number of IP addresses
 in a given year using the COUNT.

Putting it all together, we can find out the number of hits and number of unique users
(but in our case it will come 1 because name of user is '-') for each time using this statement.

X = FOREACH group_time { 
            unique_users = DISTINCT logs.remoteLogname;
            GENERATE FLATTEN(group), COUNT(unique_users) AS UniqueUsers,
            COUNT(logs) as counts;
       }


(Results are in the form of Time, Unique Users, No. of Hits)


Related Posts Plugin for WordPress, Blogger...