1. Pandas

[1]:
!ls # magic command
Python basics - day 2.ipynb http.log
Python basics.ipynb         requirements.txt
Untitled.ipynb              venv
Untitled1.ipynb

1.1. Installing packages in jupyter-notebook

[53]:
!pip install pandas
!pip install matplotlib
Requirement already satisfied: pandas in ./venv/lib/python3.6/site-packages (0.25.1)
Requirement already satisfied: numpy>=1.13.3 in ./venv/lib/python3.6/site-packages (from pandas) (1.17.2)
Requirement already satisfied: python-dateutil>=2.6.1 in ./venv/lib/python3.6/site-packages (from pandas) (2.8.0)
Requirement already satisfied: pytz>=2017.2 in ./venv/lib/python3.6/site-packages (from pandas) (2019.2)
Requirement already satisfied: six>=1.5 in ./venv/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas) (1.12.0)
You are using pip version 18.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting matplotlib
  Downloading https://files.pythonhosted.org/packages/cf/a4/d5387a74204542a60ad1baa84cd2d3353c330e59be8cf2d47c0b11d3cde8/matplotlib-3.1.1-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (14.4MB)
    100% |████████████████████████████████| 14.4MB 1.0MB/s ta 0:00:011   37% |████████████                    | 5.4MB 1.8MB/s eta 0:00:05
Requirement already satisfied: numpy>=1.11 in ./venv/lib/python3.6/site-packages (from matplotlib) (1.17.2)
Collecting cycler>=0.10 (from matplotlib)
  Using cached https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Requirement already satisfied: python-dateutil>=2.1 in ./venv/lib/python3.6/site-packages (from matplotlib) (2.8.0)
Collecting kiwisolver>=1.0.1 (from matplotlib)
  Downloading https://files.pythonhosted.org/packages/49/5d/d1726d2a2fd471a69ef5014ca42812e1ccb8a13085c42bfcb238a5611f39/kiwisolver-1.1.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (113kB)
    100% |████████████████████████████████| 122kB 2.1MB/s ta 0:00:01
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib)
  Using cached https://files.pythonhosted.org/packages/11/fa/0160cd525c62d7abd076a070ff02b2b94de589f1a9789774f17d7c54058e/pyparsing-2.4.2-py2.py3-none-any.whl
Requirement already satisfied: six in ./venv/lib/python3.6/site-packages (from cycler>=0.10->matplotlib) (1.12.0)
Requirement already satisfied: setuptools in ./venv/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) (40.6.2)
Installing collected packages: cycler, kiwisolver, pyparsing, matplotlib
Successfully installed cycler-0.10.0 kiwisolver-1.1.0 matplotlib-3.1.1 pyparsing-2.4.2
You are using pip version 18.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

1.2. Importing pandas and matplotlib

[2]:
import pandas as pd
from matplotlib import pyplot as plt

1.3. File we will be working on

cleaned_access_log

1.4. Downloading using request

[31]:
!pip install requests
Collecting requests
  Using cached https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2 (from requests)
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests)
  Using cached https://files.pythonhosted.org/packages/e6/60/247f23a7121ae632d62811ba7f273d0e58972d75e58a94d329d51550a47d/urllib3-1.25.3-py2.py3-none-any.whl
Collecting idna<2.9,>=2.5 (from requests)
  Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
Collecting certifi>=2017.4.17 (from requests)
  Using cached https://files.pythonhosted.org/packages/18/b0/8146a4f8dd402f60744fa380bc73ca47303cccf8b9190fd16a827281eac2/certifi-2019.9.11-py2.py3-none-any.whl
Installing collected packages: chardet, urllib3, idna, certifi, requests
Successfully installed certifi-2019.9.11 chardet-3.0.4 idna-2.8 requests-2.22.0 urllib3-1.25.3
You are using pip version 18.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
[33]:
import requests

url = 'https://python.variantcore.com/cleaned_access_log'
r = requests.get(url, allow_redirects=True)
open('access_log', 'wb').write(r.content)
[33]:
2468755
[34]:
!ls
Python basics - day 2.ipynb http.log
Python basics.ipynb         pandas.ipynb
Untitled.ipynb              requirements.txt
access_log                  venv
apache_logs.txt             xx.log
cleaned_access_log

1.5. Reading text file

[7]:
data = pd.read_csv('cleaned_access_log')
[10]:
data.head(5)
[10]:
ip time request status size referer user_agent
0 83.149.9.216 [17/May/2015:10:05:03 +0000] "GET /presentations/logstash-monitorama-2013/i... 200 203023.0 "http://semicomplete.com/presentations/logstas... "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1...
1 83.149.9.216 [17/May/2015:10:05:43 +0000] "GET /presentations/logstash-monitorama-2013/i... 200 171717.0 "http://semicomplete.com/presentations/logstas... "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1...
2 83.149.9.216 [17/May/2015:10:05:47 +0000] "GET /presentations/logstash-monitorama-2013/p... 200 26185.0 "http://semicomplete.com/presentations/logstas... "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1...
3 83.149.9.216 [17/May/2015:10:05:12 +0000] "GET /presentations/logstash-monitorama-2013/p... 200 7697.0 "http://semicomplete.com/presentations/logstas... "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1...
4 83.149.9.216 [17/May/2015:10:05:07 +0000] "GET /presentations/logstash-monitorama-2013/p... 200 2892.0 "http://semicomplete.com/presentations/logstas... "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1...

1.8. Simple EDA

[8]:
data.columns
[8]:
Index(['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'], dtype='object')
[57]:
data['time'].describe()
[57]:
count                            10000
unique                            4363
top       [19/May/2015:00:05:25 +0000]
freq                                 9
Name: time, dtype: object

1.7. Clean up

[11]:
from datetime import datetime
[67]:
datetime.strptime('[20/May/2015:21:05:28 +0000]', '[%d/%b/%Y:%H:%M:%S %z]')
[67]:
datetime.datetime(2015, 5, 20, 21, 5, 28, tzinfo=datetime.timezone.utc)
[15]:
data['time'].apply(lambda x: datetime.strptime(x, '[%d/%b/%Y:%H:%M:%S %z]'))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-f20f25333a1e> in <module>
----> 1 data['time'].apply(lambda x: datetime.strptime(x, '[%d/%b/%Y:%H:%M:%S %z]'))

~/code/cisco_python/venv/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   4040             else:
   4041                 values = self.astype(object).values
-> 4042                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4043
   4044         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-15-f20f25333a1e> in <lambda>(x)
----> 1 data['time'].apply(lambda x: datetime.strptime(x, '[%d/%b/%Y:%H:%M:%S %z]'))

~/.pyenv/versions/3.6.9/lib/python3.6/_strptime.py in _strptime_datetime(cls, data_string, format)
    563     """Return a class cls instance based on the input string and the
    564     format string."""
--> 565     tt, fraction = _strptime(data_string, format)
    566     tzname, gmtoff = tt[-2:]
    567     args = tt[:6] + (fraction,)

~/.pyenv/versions/3.6.9/lib/python3.6/_strptime.py in _strptime(data_string, format)
    360     if not found:
    361         raise ValueError("time data %r does not match format %r" %
--> 362                          (data_string, format))
    363     if len(data_string) != found.end():
    364         raise ValueError("unconverted data remains: %s" %

ValueError: time data '(compatible;' does not match format '[%d/%b/%Y:%H:%M:%S %z]'
[17]:
data['time'][0:8899]
[17]:
0       [17/May/2015:10:05:03 +0000]
1       [17/May/2015:10:05:43 +0000]
2       [17/May/2015:10:05:47 +0000]
3       [17/May/2015:10:05:12 +0000]
4       [17/May/2015:10:05:07 +0000]
                    ...
8894    [20/May/2015:12:05:35 +0000]
8895    [20/May/2015:12:05:34 +0000]
8896    [20/May/2015:12:05:26 +0000]
8897    [20/May/2015:12:05:48 +0000]
8898                    (compatible;
Name: time, Length: 8899, dtype: object
[12]:
data['time'][8899:].apply(lambda x: datetime.strptime(x, '[%d/%b/%Y:%H:%M:%S %z]'))
[12]:
8899   2015-05-20 12:05:25+00:00
8900   2015-05-20 12:05:59+00:00
8901   2015-05-20 12:05:16+00:00
8902   2015-05-20 12:05:54+00:00
8903   2015-05-20 12:05:39+00:00
                  ...
9995   2015-05-20 21:05:28+00:00
9996   2015-05-20 21:05:50+00:00
9997   2015-05-20 21:05:00+00:00
9998   2015-05-20 21:05:56+00:00
9999   2015-05-20 21:05:15+00:00
Name: time, Length: 1101, dtype: datetime64[ns, UTC]
[19]:
data['time'][8898]
[19]:
'(compatible;'
[24]:
data.drop([8898], inplace=True)

1.7.1. Check if deleted

[33]:
8898 in data.index
[33]:
True
[132]:
data['time'][8898]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/code/cisco_python/venv/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'time'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-132-9dcbd30e4101> in <module>
----> 1 data['time'][8898]

~/code/cisco_python/venv/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~/code/cisco_python/venv/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'time'
[38]:
data['time'] = data['time'].apply(lambda x: datetime.strptime(x, '[%d/%b/%Y:%H:%M:%S %z]'))
[136]:
data.index = data.index.tz_localize(None)
[137]:
data.index
[137]:
DatetimeIndex(['2015-05-17 10:05:03', '2015-05-17 10:05:43',
               '2015-05-17 10:05:47', '2015-05-17 10:05:12',
               '2015-05-17 10:05:07', '2015-05-17 10:05:34',
               '2015-05-17 10:05:57', '2015-05-17 10:05:50',
               '2015-05-17 10:05:24', '2015-05-17 10:05:50',
               ...
               '2015-05-20 21:05:11', '2015-05-20 21:05:29',
               '2015-05-20 21:05:34', '2015-05-20 21:05:15',
               '2015-05-20 21:05:01', '2015-05-20 21:05:28',
               '2015-05-20 21:05:50', '2015-05-20 21:05:00',
               '2015-05-20 21:05:56', '2015-05-20 21:05:15'],
              dtype='datetime64[ns]', name='time', length=9999, freq=None)
[39]:
data['time'][0]
[39]:
Timestamp('2015-05-17 10:05:03+0000', tz='UTC')
[49]:
data['user_agent'].apply(str.upper)
[49]:
0       "MOZILLA/5.0 (MACINTOSH; INTEL MAC OS X 10_9_1...
1       "MOZILLA/5.0 (MACINTOSH; INTEL MAC OS X 10_9_1...
2       "MOZILLA/5.0 (MACINTOSH; INTEL MAC OS X 10_9_1...
3       "MOZILLA/5.0 (MACINTOSH; INTEL MAC OS X 10_9_1...
4       "MOZILLA/5.0 (MACINTOSH; INTEL MAC OS X 10_9_1...
                              ...
9995            "TINY TINY RSS/1.11 (HTTP://TT-RSS.ORG/)"
9996            "TINY TINY RSS/1.11 (HTTP://TT-RSS.ORG/)"
9997    "MOZILLA/5.0 (COMPATIBLE; GOOGLEBOT/2.1; +HTTP...
9998    "MOZILLA/5.0 (WINDOWS NT 5.1; RV:6.0.2) GECKO/...
9999    "UNIVERSALFEEDPARSER/4.2-PRE-314-SVN +HTTP://F...
Name: user_agent, Length: 9999, dtype: object

1.8. Simple EDA

[51]:
import re
[47]:
data['user_agent'][911]
[47]:
'"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0"'
[59]:
data['status'] = data.status.astype(int)
[60]:
data.status
[60]:
0       200
1       200
2       200
3       200
4       200
       ...
9995    200
9996    200
9997    200
9998    200
9999    200
Name: status, Length: 9999, dtype: int64
[55]:
data[data.user_agent.str.contains('Linux', regex= True, na=False, flags=re.IGNORECASE)]
[55]:
ip time request status size referer user_agent
23 24.236.252.67 2015-05-17 10:05:40+00:00 "GET /favicon.ico HTTP/1.1" 200 3638.0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26...
24 93.114.45.13 2015-05-17 10:05:14+00:00 "GET /articles/dynamic-dns-with-dhcp/ HTTP/1.1" 200 18848.0 "http://www.google.ro/url?sa=t&rct=j&q=&esrc=s... "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Geck...
25 93.114.45.13 2015-05-17 10:05:04+00:00 "GET /reset.css HTTP/1.1" 200 1015.0 "http://www.semicomplete.com/articles/dynamic-... "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Geck...
26 93.114.45.13 2015-05-17 10:05:45+00:00 "GET /style2.css HTTP/1.1" 200 4877.0 "http://www.semicomplete.com/articles/dynamic-... "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Geck...
27 93.114.45.13 2015-05-17 10:05:14+00:00 "GET /favicon.ico HTTP/1.1" 200 3638.0 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Geck...
... ... ... ... ... ... ... ...
9949 91.151.182.109 2015-05-20 21:05:13+00:00 "GET /images/web/2009/banner.png HTTP/1.1" 200 52315.0 "http://www.semicomplete.com/projects/xdotool/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...
9950 91.151.182.109 2015-05-20 21:05:50+00:00 "GET /favicon.ico HTTP/1.1" 200 3638.0 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...
9953 63.140.98.80 2015-05-20 21:05:27+00:00 "GET /projects/xdotool/ HTTP/1.1" 200 12292.0 "http://stackoverflow.com/questions/3983946/ge... "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...
9954 63.140.98.80 2015-05-20 21:05:58+00:00 "GET /images/jordan-80.png HTTP/1.1" 200 6146.0 "http://www.semicomplete.com/projects/xdotool/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...
9955 63.140.98.80 2015-05-20 21:05:11+00:00 "GET /files/logstash/logstash-1.3.2-monolithic... 404 324.0 "-" "Chef Client/10.18.2 (ruby-1.9.3-p327; ohai-6....

2314 rows × 7 columns

[47]:
data[['ip', 'status', 'user_agent']].describe()
[47]:
ip status user_agent
count 10000 10000 9999
unique 1754 9 558
top 66.249.73.135 200 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...
freq 482 9125 1044
[56]:
data['status'].hist()
[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x120f6a5f8>
../_images/pandas_pandas_40_1.png
[66]:
data.set_index('time', inplace=True)
[75]:
min(data.index)
[75]:
Timestamp('2015-05-17 10:05:00+0000', tz='UTC')
[103]:
fig, ax = plt.subplots( nrows=1, ncols=1 )
../_images/pandas_pandas_43_0.png
[107]:
xa = data['size'].resample('H').mean()
[90]:
from matplotlib import pyplot as plt
[110]:
plt.xticks(rotation=90)
plt.plot(xa)
plt.savefig('myfig')
../_images/pandas_pandas_46_0.png
[114]:
xa = data['size'].resample('H').mean().plot()
/Users/kamil/code/cisco_python/venv/lib/python3.6/site-packages/pandas/core/arrays/datetimes.py:1269: UserWarning: Converting to PeriodArray/Index representation will drop timezone information.
  UserWarning,
../_images/pandas_pandas_47_1.png
[112]:
xa.plot()
/Users/kamil/code/cisco_python/venv/lib/python3.6/site-packages/pandas/core/arrays/datetimes.py:1269: UserWarning: Converting to PeriodArray/Index representation will drop timezone information.
  UserWarning,
[112]:
<matplotlib.axes._subplots.AxesSubplot at 0x122415320>
../_images/pandas_pandas_48_2.png
[ ]:
data[data.user_agent.str.contains('Linux', regex= True, na=False, flags=re.IGNORECASE)]
[128]:
data.to_excel?
[129]:
!pip install openpyxl
Collecting openpyxl
  Downloading https://files.pythonhosted.org/packages/f5/39/942a406621c1ff0de38d7e4782991b1bac046415bf54a66655c959ee66e8/openpyxl-2.6.3.tar.gz (173kB)
    100% |████████████████████████████████| 174kB 1.8MB/s ta 0:00:01
Collecting jdcal (from openpyxl)
  Downloading https://files.pythonhosted.org/packages/f0/da/572cbc0bc582390480bbd7c4e93d14dc46079778ed915b505dc494b37c57/jdcal-1.4.1-py2.py3-none-any.whl
Collecting et_xmlfile (from openpyxl)
  Downloading https://files.pythonhosted.org/packages/22/28/a99c42aea746e18382ad9fb36f64c1c1f04216f41797f2f0fa567da11388/et_xmlfile-1.0.1.tar.gz
Installing collected packages: jdcal, et-xmlfile, openpyxl
  Running setup.py install for et-xmlfile ... done
  Running setup.py install for openpyxl ... done
Successfully installed et-xmlfile-1.0.1 jdcal-1.4.1 openpyxl-2.6.3
You are using pip version 18.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
[138]:
data.to_excel('report.xlsx')
[139]:
!ls
cleaned_access_log myfig.png          report.xlsx
foo.png            pandas.ipynb

1.9. Exercise - report creation

  • Report should say about different web browsers user are using,