1. Pandas¶
[1]:
!ls # magic command
Python basics - day 2.ipynb http.log
Python basics.ipynb requirements.txt
Untitled.ipynb venv
Untitled1.ipynb
1.1. Installing packages in jupyter-notebook¶
[53]:
!pip install pandas
!pip install matplotlib
Requirement already satisfied: pandas in ./venv/lib/python3.6/site-packages (0.25.1)
Requirement already satisfied: numpy>=1.13.3 in ./venv/lib/python3.6/site-packages (from pandas) (1.17.2)
Requirement already satisfied: python-dateutil>=2.6.1 in ./venv/lib/python3.6/site-packages (from pandas) (2.8.0)
Requirement already satisfied: pytz>=2017.2 in ./venv/lib/python3.6/site-packages (from pandas) (2019.2)
Requirement already satisfied: six>=1.5 in ./venv/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas) (1.12.0)
You are using pip version 18.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting matplotlib
Downloading https://files.pythonhosted.org/packages/cf/a4/d5387a74204542a60ad1baa84cd2d3353c330e59be8cf2d47c0b11d3cde8/matplotlib-3.1.1-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (14.4MB)
100% |████████████████████████████████| 14.4MB 1.0MB/s ta 0:00:011 37% |████████████ | 5.4MB 1.8MB/s eta 0:00:05
Requirement already satisfied: numpy>=1.11 in ./venv/lib/python3.6/site-packages (from matplotlib) (1.17.2)
Collecting cycler>=0.10 (from matplotlib)
Using cached https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Requirement already satisfied: python-dateutil>=2.1 in ./venv/lib/python3.6/site-packages (from matplotlib) (2.8.0)
Collecting kiwisolver>=1.0.1 (from matplotlib)
Downloading https://files.pythonhosted.org/packages/49/5d/d1726d2a2fd471a69ef5014ca42812e1ccb8a13085c42bfcb238a5611f39/kiwisolver-1.1.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (113kB)
100% |████████████████████████████████| 122kB 2.1MB/s ta 0:00:01
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib)
Using cached https://files.pythonhosted.org/packages/11/fa/0160cd525c62d7abd076a070ff02b2b94de589f1a9789774f17d7c54058e/pyparsing-2.4.2-py2.py3-none-any.whl
Requirement already satisfied: six in ./venv/lib/python3.6/site-packages (from cycler>=0.10->matplotlib) (1.12.0)
Requirement already satisfied: setuptools in ./venv/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) (40.6.2)
Installing collected packages: cycler, kiwisolver, pyparsing, matplotlib
Successfully installed cycler-0.10.0 kiwisolver-1.1.0 matplotlib-3.1.1 pyparsing-2.4.2
You are using pip version 18.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
1.2. Importing pandas and matplotlib¶
[2]:
import pandas as pd
from matplotlib import pyplot as plt
1.3. File we will be working on¶
1.4. Downloading using request¶
[31]:
!pip install requests
Collecting requests
Using cached https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2 (from requests)
Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests)
Using cached https://files.pythonhosted.org/packages/e6/60/247f23a7121ae632d62811ba7f273d0e58972d75e58a94d329d51550a47d/urllib3-1.25.3-py2.py3-none-any.whl
Collecting idna<2.9,>=2.5 (from requests)
Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
Collecting certifi>=2017.4.17 (from requests)
Using cached https://files.pythonhosted.org/packages/18/b0/8146a4f8dd402f60744fa380bc73ca47303cccf8b9190fd16a827281eac2/certifi-2019.9.11-py2.py3-none-any.whl
Installing collected packages: chardet, urllib3, idna, certifi, requests
Successfully installed certifi-2019.9.11 chardet-3.0.4 idna-2.8 requests-2.22.0 urllib3-1.25.3
You are using pip version 18.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
[33]:
import requests
url = 'https://python.variantcore.com/cleaned_access_log'
r = requests.get(url, allow_redirects=True)
open('access_log', 'wb').write(r.content)
[33]:
2468755
[34]:
!ls
Python basics - day 2.ipynb http.log
Python basics.ipynb pandas.ipynb
Untitled.ipynb requirements.txt
access_log venv
apache_logs.txt xx.log
cleaned_access_log
1.5. Reading text file¶
[7]:
data = pd.read_csv('cleaned_access_log')
[10]:
data.head(5)
[10]:
ip | time | request | status | size | referer | user_agent | |
---|---|---|---|---|---|---|---|
0 | 83.149.9.216 | [17/May/2015:10:05:03 +0000] | "GET /presentations/logstash-monitorama-2013/i... | 200 | 203023.0 | "http://semicomplete.com/presentations/logstas... | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1... |
1 | 83.149.9.216 | [17/May/2015:10:05:43 +0000] | "GET /presentations/logstash-monitorama-2013/i... | 200 | 171717.0 | "http://semicomplete.com/presentations/logstas... | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1... |
2 | 83.149.9.216 | [17/May/2015:10:05:47 +0000] | "GET /presentations/logstash-monitorama-2013/p... | 200 | 26185.0 | "http://semicomplete.com/presentations/logstas... | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1... |
3 | 83.149.9.216 | [17/May/2015:10:05:12 +0000] | "GET /presentations/logstash-monitorama-2013/p... | 200 | 7697.0 | "http://semicomplete.com/presentations/logstas... | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1... |
4 | 83.149.9.216 | [17/May/2015:10:05:07 +0000] | "GET /presentations/logstash-monitorama-2013/p... | 200 | 2892.0 | "http://semicomplete.com/presentations/logstas... | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1... |
1.8. Simple EDA¶
[8]:
data.columns
[8]:
Index(['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'], dtype='object')
[57]:
data['time'].describe()
[57]:
count 10000
unique 4363
top [19/May/2015:00:05:25 +0000]
freq 9
Name: time, dtype: object
1.7. Clean up¶
[11]:
from datetime import datetime
[67]:
datetime.strptime('[20/May/2015:21:05:28 +0000]', '[%d/%b/%Y:%H:%M:%S %z]')
[67]:
datetime.datetime(2015, 5, 20, 21, 5, 28, tzinfo=datetime.timezone.utc)
[15]:
data['time'].apply(lambda x: datetime.strptime(x, '[%d/%b/%Y:%H:%M:%S %z]'))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-f20f25333a1e> in <module>
----> 1 data['time'].apply(lambda x: datetime.strptime(x, '[%d/%b/%Y:%H:%M:%S %z]'))
~/code/cisco_python/venv/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
4040 else:
4041 values = self.astype(object).values
-> 4042 mapped = lib.map_infer(values, f, convert=convert_dtype)
4043
4044 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-15-f20f25333a1e> in <lambda>(x)
----> 1 data['time'].apply(lambda x: datetime.strptime(x, '[%d/%b/%Y:%H:%M:%S %z]'))
~/.pyenv/versions/3.6.9/lib/python3.6/_strptime.py in _strptime_datetime(cls, data_string, format)
563 """Return a class cls instance based on the input string and the
564 format string."""
--> 565 tt, fraction = _strptime(data_string, format)
566 tzname, gmtoff = tt[-2:]
567 args = tt[:6] + (fraction,)
~/.pyenv/versions/3.6.9/lib/python3.6/_strptime.py in _strptime(data_string, format)
360 if not found:
361 raise ValueError("time data %r does not match format %r" %
--> 362 (data_string, format))
363 if len(data_string) != found.end():
364 raise ValueError("unconverted data remains: %s" %
ValueError: time data '(compatible;' does not match format '[%d/%b/%Y:%H:%M:%S %z]'
[17]:
data['time'][0:8899]
[17]:
0 [17/May/2015:10:05:03 +0000]
1 [17/May/2015:10:05:43 +0000]
2 [17/May/2015:10:05:47 +0000]
3 [17/May/2015:10:05:12 +0000]
4 [17/May/2015:10:05:07 +0000]
...
8894 [20/May/2015:12:05:35 +0000]
8895 [20/May/2015:12:05:34 +0000]
8896 [20/May/2015:12:05:26 +0000]
8897 [20/May/2015:12:05:48 +0000]
8898 (compatible;
Name: time, Length: 8899, dtype: object
[12]:
data['time'][8899:].apply(lambda x: datetime.strptime(x, '[%d/%b/%Y:%H:%M:%S %z]'))
[12]:
8899 2015-05-20 12:05:25+00:00
8900 2015-05-20 12:05:59+00:00
8901 2015-05-20 12:05:16+00:00
8902 2015-05-20 12:05:54+00:00
8903 2015-05-20 12:05:39+00:00
...
9995 2015-05-20 21:05:28+00:00
9996 2015-05-20 21:05:50+00:00
9997 2015-05-20 21:05:00+00:00
9998 2015-05-20 21:05:56+00:00
9999 2015-05-20 21:05:15+00:00
Name: time, Length: 1101, dtype: datetime64[ns, UTC]
[19]:
data['time'][8898]
[19]:
'(compatible;'
[24]:
data.drop([8898], inplace=True)
1.7.1. Check if deleted¶
[33]:
8898 in data.index
[33]:
True
[132]:
data['time'][8898]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/code/cisco_python/venv/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2896 try:
-> 2897 return self._engine.get_loc(key)
2898 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'time'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-132-9dcbd30e4101> in <module>
----> 1 data['time'][8898]
~/code/cisco_python/venv/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
2978 if self.columns.nlevels > 1:
2979 return self._getitem_multilevel(key)
-> 2980 indexer = self.columns.get_loc(key)
2981 if is_integer(indexer):
2982 indexer = [indexer]
~/code/cisco_python/venv/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2897 return self._engine.get_loc(key)
2898 except KeyError:
-> 2899 return self._engine.get_loc(self._maybe_cast_indexer(key))
2900 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2901 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'time'
[38]:
data['time'] = data['time'].apply(lambda x: datetime.strptime(x, '[%d/%b/%Y:%H:%M:%S %z]'))
[136]:
data.index = data.index.tz_localize(None)
[137]:
data.index
[137]:
DatetimeIndex(['2015-05-17 10:05:03', '2015-05-17 10:05:43',
'2015-05-17 10:05:47', '2015-05-17 10:05:12',
'2015-05-17 10:05:07', '2015-05-17 10:05:34',
'2015-05-17 10:05:57', '2015-05-17 10:05:50',
'2015-05-17 10:05:24', '2015-05-17 10:05:50',
...
'2015-05-20 21:05:11', '2015-05-20 21:05:29',
'2015-05-20 21:05:34', '2015-05-20 21:05:15',
'2015-05-20 21:05:01', '2015-05-20 21:05:28',
'2015-05-20 21:05:50', '2015-05-20 21:05:00',
'2015-05-20 21:05:56', '2015-05-20 21:05:15'],
dtype='datetime64[ns]', name='time', length=9999, freq=None)
[39]:
data['time'][0]
[39]:
Timestamp('2015-05-17 10:05:03+0000', tz='UTC')
[49]:
data['user_agent'].apply(str.upper)
[49]:
0 "MOZILLA/5.0 (MACINTOSH; INTEL MAC OS X 10_9_1...
1 "MOZILLA/5.0 (MACINTOSH; INTEL MAC OS X 10_9_1...
2 "MOZILLA/5.0 (MACINTOSH; INTEL MAC OS X 10_9_1...
3 "MOZILLA/5.0 (MACINTOSH; INTEL MAC OS X 10_9_1...
4 "MOZILLA/5.0 (MACINTOSH; INTEL MAC OS X 10_9_1...
...
9995 "TINY TINY RSS/1.11 (HTTP://TT-RSS.ORG/)"
9996 "TINY TINY RSS/1.11 (HTTP://TT-RSS.ORG/)"
9997 "MOZILLA/5.0 (COMPATIBLE; GOOGLEBOT/2.1; +HTTP...
9998 "MOZILLA/5.0 (WINDOWS NT 5.1; RV:6.0.2) GECKO/...
9999 "UNIVERSALFEEDPARSER/4.2-PRE-314-SVN +HTTP://F...
Name: user_agent, Length: 9999, dtype: object
1.8. Simple EDA¶
[51]:
import re
[47]:
data['user_agent'][911]
[47]:
'"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0"'
[59]:
data['status'] = data.status.astype(int)
[60]:
data.status
[60]:
0 200
1 200
2 200
3 200
4 200
...
9995 200
9996 200
9997 200
9998 200
9999 200
Name: status, Length: 9999, dtype: int64
[55]:
data[data.user_agent.str.contains('Linux', regex= True, na=False, flags=re.IGNORECASE)]
[55]:
ip | time | request | status | size | referer | user_agent | |
---|---|---|---|---|---|---|---|
23 | 24.236.252.67 | 2015-05-17 10:05:40+00:00 | "GET /favicon.ico HTTP/1.1" | 200 | 3638.0 | "-" | "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26... |
24 | 93.114.45.13 | 2015-05-17 10:05:14+00:00 | "GET /articles/dynamic-dns-with-dhcp/ HTTP/1.1" | 200 | 18848.0 | "http://www.google.ro/url?sa=t&rct=j&q=&esrc=s... | "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Geck... |
25 | 93.114.45.13 | 2015-05-17 10:05:04+00:00 | "GET /reset.css HTTP/1.1" | 200 | 1015.0 | "http://www.semicomplete.com/articles/dynamic-... | "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Geck... |
26 | 93.114.45.13 | 2015-05-17 10:05:45+00:00 | "GET /style2.css HTTP/1.1" | 200 | 4877.0 | "http://www.semicomplete.com/articles/dynamic-... | "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Geck... |
27 | 93.114.45.13 | 2015-05-17 10:05:14+00:00 | "GET /favicon.ico HTTP/1.1" | 200 | 3638.0 | "-" | "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Geck... |
... | ... | ... | ... | ... | ... | ... | ... |
9949 | 91.151.182.109 | 2015-05-20 21:05:13+00:00 | "GET /images/web/2009/banner.png HTTP/1.1" | 200 | 52315.0 | "http://www.semicomplete.com/projects/xdotool/" | "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5... |
9950 | 91.151.182.109 | 2015-05-20 21:05:50+00:00 | "GET /favicon.ico HTTP/1.1" | 200 | 3638.0 | "-" | "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5... |
9953 | 63.140.98.80 | 2015-05-20 21:05:27+00:00 | "GET /projects/xdotool/ HTTP/1.1" | 200 | 12292.0 | "http://stackoverflow.com/questions/3983946/ge... | "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5... |
9954 | 63.140.98.80 | 2015-05-20 21:05:58+00:00 | "GET /images/jordan-80.png HTTP/1.1" | 200 | 6146.0 | "http://www.semicomplete.com/projects/xdotool/" | "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5... |
9955 | 63.140.98.80 | 2015-05-20 21:05:11+00:00 | "GET /files/logstash/logstash-1.3.2-monolithic... | 404 | 324.0 | "-" | "Chef Client/10.18.2 (ruby-1.9.3-p327; ohai-6.... |
2314 rows × 7 columns
[47]:
data[['ip', 'status', 'user_agent']].describe()
[47]:
ip | status | user_agent | |
---|---|---|---|
count | 10000 | 10000 | 9999 |
unique | 1754 | 9 | 558 |
top | 66.249.73.135 | 200 | "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK... |
freq | 482 | 9125 | 1044 |
[56]:
data['status'].hist()
[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x120f6a5f8>
[66]:
data.set_index('time', inplace=True)
[75]:
min(data.index)
[75]:
Timestamp('2015-05-17 10:05:00+0000', tz='UTC')
[103]:
fig, ax = plt.subplots( nrows=1, ncols=1 )
[107]:
xa = data['size'].resample('H').mean()
[90]:
from matplotlib import pyplot as plt
[110]:
plt.xticks(rotation=90)
plt.plot(xa)
plt.savefig('myfig')
[114]:
xa = data['size'].resample('H').mean().plot()
/Users/kamil/code/cisco_python/venv/lib/python3.6/site-packages/pandas/core/arrays/datetimes.py:1269: UserWarning: Converting to PeriodArray/Index representation will drop timezone information.
UserWarning,
[112]:
xa.plot()
/Users/kamil/code/cisco_python/venv/lib/python3.6/site-packages/pandas/core/arrays/datetimes.py:1269: UserWarning: Converting to PeriodArray/Index representation will drop timezone information.
UserWarning,
[112]:
<matplotlib.axes._subplots.AxesSubplot at 0x122415320>
[ ]:
data[data.user_agent.str.contains('Linux', regex= True, na=False, flags=re.IGNORECASE)]
[128]:
data.to_excel?
[129]:
!pip install openpyxl
Collecting openpyxl
Downloading https://files.pythonhosted.org/packages/f5/39/942a406621c1ff0de38d7e4782991b1bac046415bf54a66655c959ee66e8/openpyxl-2.6.3.tar.gz (173kB)
100% |████████████████████████████████| 174kB 1.8MB/s ta 0:00:01
Collecting jdcal (from openpyxl)
Downloading https://files.pythonhosted.org/packages/f0/da/572cbc0bc582390480bbd7c4e93d14dc46079778ed915b505dc494b37c57/jdcal-1.4.1-py2.py3-none-any.whl
Collecting et_xmlfile (from openpyxl)
Downloading https://files.pythonhosted.org/packages/22/28/a99c42aea746e18382ad9fb36f64c1c1f04216f41797f2f0fa567da11388/et_xmlfile-1.0.1.tar.gz
Installing collected packages: jdcal, et-xmlfile, openpyxl
Running setup.py install for et-xmlfile ... done
Running setup.py install for openpyxl ... done
Successfully installed et-xmlfile-1.0.1 jdcal-1.4.1 openpyxl-2.6.3
You are using pip version 18.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
[138]:
data.to_excel('report.xlsx')
[139]:
!ls
cleaned_access_log myfig.png report.xlsx
foo.png pandas.ipynb
1.9. Exercise - report creation¶
Report should say about different web browsers user are using,