Content Covered

The Web Scraping series of articles cover various methods and libraries that achieve the end goal of crawling websites and performing data extraction via scraping. Making use of a websites REST API has been covered by previous articles. These articles will focus on the methods available to us when we encounter a site without a provided application programming interface.

The Requests & BeautifulSoup Libraries
- IPs, geolocation, dowloading images
- Three examples of pagination
- DataFrame string presentation

The Basics Of The Requests Library

Requests Documentation

All of Requests’ functionality can be accessed by 7 methods. They all return an instance of the Response object. For the purposes of this article we will be focusing mainly upon the get and post methods and how they relate to the request method via **kwargs.

The requests.request() Method

Documentation

Source Code

import requests

def main():
    url  = "https://httpbin.org/get"
    head = {"accept": "application/json"}
    res  = requests.request('GET', url, headers=head)
    
    if res.ok:
        print(res.text)
    
if __name__ == "__main__":
    main()

{
  "args": {},
  "headers": {
    "Accept": "application/json",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.24.0",
    "X-Amzn-Trace-Id": "Root=1-61f8b284-21f6e43a74afc2812004b83e"
  },
  "origin": "138.199.21.31",
  "url": "https://httpbin.org/get"
}

The requests.get() Method

Documentation

Source Code

RFC: GET Method

Let’s use one of the Console Services from the linked repository to demonstrate the get method functionality. Continuing on from the requests.request() example, let’s take the provided IP address and geolocate it’s origin using ip-api.com.

import requests

def main():
    ip  = "138.199.21.31"
    url = f"http://ip-api.com/json/{ip}"
    res = requests.get(url)
    
    if res.ok:
        for key, value in res.json().items():
            print(f"{key.title():<11}: {value}")
        
if __name__ == "__main__":
    main()

Status     : success
Country    : Japan
Countrycode: JP
Region     : 13
Regionname : Tokyo
City       : Tokyo
Zip        : 102-0082
Lat        : 35.6893
Lon        : 139.6899
Timezone   : Asia/Tokyo
Isp        : Datacamp Limited
Org        : Datacamp Limited
As         : AS212238 Datacamp Limited
Query      : 138.199.21.31

The requests.post() Method

Documentation

Source Code

RFC: POST Method

import requests

def main():
    url  = "https://httpbin.org/post"
    head = {"accept": "application/json"}
    res  = requests.post(url, headers=head)
    
    if res.ok:
        print(res.text)

if __name__ == "__main__":
    main()

{
  "args": {},
  "data": "",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "application/json",
    "Accept-Encoding": "gzip, deflate",
    "Content-Length": "0",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.24.0",
    "X-Amzn-Trace-Id": "Root=1-61f8b35e-17d43f54571b9c9e201b27d3"
  },
  "json": null,
  "origin": "138.199.21.31",
  "url": "https://httpbin.org/post"
}

Sessions & Persistence

Advanced

Documentation

Source Code

import requests

def main():
    url = "https://google.com"
    
    with requests.Session() as s:
        res = s.get(url)
        
        if res.ok:
            for key, value in res.cookies.items():
                print(f"{key:<6}: {value}")

if __name__ == "__main__":
    main()                

1P_JAR: 2022-02-01-05
NID   : 511=GbFw1r7NlDnQm0sznSvZEv_CMReI3stJikhPuuxmzcVPG9XSQDZxrJnfluAMtA4wNNZYoX0uaO5VaKA1tO73H6bv0cQ1S784Qe4onYK40aUtj6jodxVcIid-Rwdzp0IE6RMT6NzYMb2TQI1CVFjWqqK_16kBGICqPZ8DB2TjtVI

import requests

def main():
    urlSet = "https://httpbin.org/cookies/set"
    urlDel = "https://httpbin.org/cookies/delete"
    
    param  = {"hello" : "cookie"}
    head   = {"accept": "text/plain"}
    
    with requests.Session() as s:
        res = s.get(urlSet, params=param, headers=head)
    
        if res.ok:
            print(res.text)
            
            res = s.get(urlDel, params=param, headers=head)
            print(res.text)
            
            # urlGet = "https://httpbin.org/cookies"
            # head = {"accept": "application/json"}
            # res  = s.get(urlGet, headers=head)
            # print(res.text)

if __name__ == "__main__":
    main()

{
  "cookies": {
    "hello": "cookie"
  }
}

{
  "cookies": {}
}

Downloading Images

import requests

def main():
    url  = "https://httpbin.org/image/jpeg"
    head = {"accept": "image/jpeg"}
    res  = requests.get(url, headers=head)
    
    if res.ok:
        with open('image.jpg', 'wb') as f:
            f.write(res.content)

if __name__ == "__main__":
    main()

import requests

def main():
    url  = "https://httpbin.org/image/jpeg"
    head = {"accept": "image/jpeg"}
    res  = requests.get(url, headers=head, stream=True)
    
    if res.ok:
        with open('image.jpg', 'wb') as f:
            for chunk in res.iter_content(chunk_size=2**10):
                f.write(chunk)

if __name__ == "__main__":
    main()

The Basics of BeautifulSoup

BeautifulSoup Documentation

<!DOCTYPE html>
<html lang="en-us">
    <head>
        <meta charset="UTF-8" />
        <meta name="viewport" content="width=device-width,initial-scale=1" />
        <title>BS4 Example</title>
        <style>
            table,td,th{border:1px solid #000}
        </style>
    </head>

    <body>
        <h1>Header Example</h1>
        <p>Paragraph Example</p>
        <table style="width:50%">
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
            <tr>
                <td>Jack</td>
                <td>35</td>
            </tr>
            <tr>
                <td>Jill</td>
                <td>25</td>
            </tr>
        </table>
    </body>
</html>

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.

from bs4 import BeautifulSoup

def main():
    html = """<!DOCTYPE html><html lang="en-us"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width,initial-scale=1"><title>BS4 Example</title><style>table,td,th{border:1px solid #000}</style></head><body><h1>Header Example</h1><p>Paragraph Example</p><table style="width:50%"><tr><th>Name</th><th>Age</th></tr><tr><td>Jack</td><td>35</td></tr><tr><td>Jill</td><td>25</td></tr></table></body></html>"""

    # Pass the document into the BeautifulSoup constructor to parse it. 
    soup = BeautifulSoup(html, 'html.parser')

    # First, the document is converted to Unicode, and HTML entities are converted to Unicode characters
    print(type(soup), soup)
    
    '''
    <class 'bs4.BeautifulSoup'> <!DOCTYPE html>
    <html lang="en-us"><head><meta charset="utf-8"/><meta content="width=device-width,initial-scale=1" name="viewport"/><title>BS4 Example</title><style>table,td,th{border:1px solid #000}</style></head><body><h1>Header Example</h1><p>Paragraph Example</p><table style="width:50%"><tr><th>Name</th><th>Age</th></tr><tr><td>Jack</td><td>35</td></tr><tr><td>Jill</td><td>25</td></tr></table></body></html>
    '''
    
    print(type(soup.head), soup.head)
    
    '''
    <class 'bs4.element.Tag'> <head><meta charset="utf-8"/><meta content="width=device-width,initial-scale=1" name="viewport"/><title>BS4 Example</title><style>table,td,th{border:1px solid #000}</style></head>
    '''
    
    contents = soup.head.contents
    print(type(contents), contents)
    
    for i in contents:
        print(i)
    
    '''
    <class 'list'> [<meta charset="utf-8"/>, <meta content="width=device-width,initial-scale=1" name="viewport"/>, <title>BS4 Example</title>, <style>table,td,th{border:1px solid #000}</style>]
    
    <meta charset="utf-8"/>
    <meta content="width=device-width,initial-scale=1" name="viewport"/>
    <title>BS4 Example</title>
    <style>table,td,th{border:1px solid #000}</style>
    '''
    
    children = soup.head.children
    print(type(children), children)
    
    '''
    <class 'list_iterator'> <list_iterator object at 0x0000018C4A512400>
    
    <meta charset="utf-8"/>
    <meta content="width=device-width,initial-scale=1" name="viewport"/>
    <title>BS4 Example</title>
    <style>table,td,th{border:1px solid #000}</style>
    '''
    
    descendants = soup.head.descendants
    print(type(descendants), descendants)
    
    for i in descendants:
        print(i)
    
    '''
    <class 'generator'> <generator object Tag.descendants at 0x00000283F3996350>
    
    <meta charset="utf-8"/>
    <meta content="width=device-width,initial-scale=1" name="viewport"/>
    <title>BS4 Example</title>
    BS4 Example
    <style>table,td,th{border:1px solid #000}</style>
    table,td,th{border:1px solid #000}
    '''
    
    print(type(soup.body), soup.body)
    
    '''
    <class 'bs4.element.Tag'> <body><h1>Header Example</h1><p>Paragraph Example</p><table style="width:50%"><tr><th>Name</th><th>Age</th></tr><tr><td>Jack</td><td>35</td></tr><tr><td>Jill</td><td>25</td></tr></table></body>
    '''
    
    # A Tag object corresponds to an XML or HTML tag in the original document
    # Every tag has a name, accessible as .name, If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup
    tag = soup.p # soup.body.p
    print(type(tag), tag, tag.text, tag.name, sep=", ")
    
    '''
    <class 'bs4.element.Tag'>, <p>Paragraph Example</p>, Paragraph Example, p
    '''
    
    # A tag may have any number of attributes. The tag <b id="boldest"> has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary
    tag = soup.meta
    print(type(tag), tag.attrs, tag['charset'], sep=", ")
    
    '''
    <class 'bs4.element.Tag'>, {'charset': 'UTF-8'}, UTF-8
    '''
    
    tags = soup.find_all('meta')
    print(type(tags), tags)
    
    '''
    <class 'bs4.element.ResultSet'> [<meta charset="utf-8"/>, <meta content="width=device-width,initial-scale=1" name="viewport"/>]
    '''
    
    # Another common task is extracting all the text from a page
    print(soup.get_text())
    
    '''
    BS4 ExampleHeader ExampleParagraph ExampleNameAgeJack35Jill25
    '''
    
    # If a tag has only one child, and that child is a NavigableString, the child is made available as .string
    # If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None
    h = soup.find('h1')
    print(h.string)
    
    '''
    Header Example
    '''
    
if __name__ == "__main__":
    main()

Example I. The Simplest Example Of Pagination

There’s two ways we can go about implementing pagination. The first is where we’re already privy to the amount of pages, and the query string is easily manipulated. For example: /?p=1, becomes /?p=2, then /?p=3, and so on. https://nyaa.si/, a popular anime torrent site, is just such an example.

Inspecting The Source Code

<ul class="pagination">
    <li class="disabled"><a href="#">«</a></li>
        <li class="active"><a href="#">1 <span class="sr-only">(current)</span></a></li>
        <li><a href="/?p=2">2</a></li>
        <li><a href="/?p=3">3</a></li>
        <li><a href="/?p=4">4</a></li>
        <li><a href="/?p=5">5</a></li>
        <li><a href="/?p=6">6</a></li>

    <li><a rel="next" href="/?p=2">»</a></li>
</ul>

The Main Program

import requests

from bs4 import BeautifulSoup

def main():
    url = "https://nyaa.si/"
    
    for page in range(1,100+1):
        param = {'s':'seeders', 'o':'desc', 'p':page}
        res   = requests.get(url, params=param)
    
        if not res.ok:
            raise requests.HTTPError(res.url)
        
        print(f"[*] Scraping: {res.url}")
        
        # Process Data
    
        soup = BeautifulSoup(res.text, 'html.parser')
        rows = soup.table.tbody.find_all('tr')
        
        for r, row in enumerate(rows, start=1):
            for t, td in enumerate(row.find_all('td'), start=1):
                print(f"[*] Analyzing <pg: {page}; tr: {r}; td: {t}>\n")
                print(td.prettify(), end="\n\n")
            
            # Only first row
            break
        
        # Only first page
        break

if __name__ == "__main__":
    main()

Example II: Pagination By Extracting Next

The second is the best way, and instead of hard coding the page count and iterating through said range, we instead extract the next page’s whereabouts so we do not have to guess. This is especially useful when dealing with websites with a confusing and complex design where the page count is obscured or not as predictable.

Inspecting The Source Code

In the last example, the “next” string was found in a hyperlink’s relation attribute. Now that we’ve made a query, we find that the “next” string is not in a rel attribute, but in a list item’s class attribute.

<ul class="pagination">
    <li class="previous disabled unavailable"><a> « </a></li>
    <li class="active"><a>1</a></li>
    <li><a href="/?f=0&amp;c=0_0&amp;q=one+punch+man&amp;s=seeders&amp;o=desc&amp;p=2">2</a></li>
    <li><a href="/?f=0&amp;c=0_0&amp;q=one+punch+man&amp;s=seeders&amp;o=desc&amp;p=3">3</a></li>
    <li><a href="/?f=0&amp;c=0_0&amp;q=one+punch+man&amp;s=seeders&amp;o=desc&amp;p=4">4</a></li>
    <li><a href="/?f=0&amp;c=0_0&amp;q=one+punch+man&amp;s=seeders&amp;o=desc&amp;p=5">5</a></li>
    <li class="disabled"><a>...</a></li>
    <li><a href="/?f=0&amp;c=0_0&amp;q=one+punch+man&amp;s=seeders&amp;o=desc&amp;p=13">13</a></li>
    <li><a href="/?f=0&amp;c=0_0&amp;q=one+punch+man&amp;s=seeders&amp;o=desc&amp;p=14">14</a></li>
    <li class="next"><a href="/?f=0&amp;c=0_0&amp;q=one+punch+man&amp;s=seeders&amp;o=desc&amp;p=2">»</a></li>
</ul>

The Main Program

import requests

from bs4 import BeautifulSoup

def main():
    base = "https://nyaa.si"
    link = "/?f=0&c=0_0&q=one+punch+man&s=seeders&o=desc"
    page = 1 # A simple counter
    
    while True:
        url = base if not link else base+link
        res = requests.get(url)
    
        if not res.ok:
            raise requests.HTTPError(res.url)
        
        print(f"[*] Scraping: {res.url}")
        
        # Process Data
        soup = BeautifulSoup(res.text, 'html.parser')
        rows = soup.table.tbody.find_all('tr')
        
        for r, row in enumerate(rows, start=1):
            for t, td in enumerate(row.find_all('td'), start=1):
                print(f"[*] Analyzing <pg: {page}; tr: {r}; td: {t}>\n")
                print(td.prettify(), end="\n\n")
            
            # Only first row
            break
        
        # # Find link for next page
        try:
            next_ = soup.find('li', {"class": "next"})
            link = next_.a.get('href')
        except:
            data = soup.find('a', rel="next")
            link = data.get('href')
            
        # Handle last page
        if not link:
            print(f"[-] last page reached ...")
            break

        # Increase page counter
        page+=1
        
        # Only first 2 pages
        if page > 2:
            break

if __name__ == "__main__":
    main()

The Output

[*] Scraping: https://nyaa.si/?f=0&c=0_0&q=one+punch+man&s=seeders&o=desc
[*] Analyzing <pg: 1; tr: 1; td: 1>

<td>
 <a href="/?c=1_2" title="Anime - English-translated">
  <img alt="Anime - English-translated" class="category-icon" src="/static/img/icons/nyaa/1_2.png"/>
 </a>
</td>


[*] Analyzing <pg: 1; tr: 1; td: 2>

<td colspan="2">
 <a class="comments" href="/view/1329204#comments" title="22 comments">
  <i class="fa fa-comments-o">
  </i>
  22
 </a>
 <a href="/view/1329204" title="[Judas] One Punch Man (Seasons 1-2 + OVAs + Specials) [BD 1080p][HEVC x265 10bit][Dual-Audio][Multi-Subs] (Batch)">
  [Judas] One Punch Man (Seasons 1-2 + OVAs + Specials) [BD 1080p][HEVC x265 10bit][Dual-Audio][Multi-Subs] (Batch)
 </a>
</td>


[*] Analyzing <pg: 1; tr: 1; td: 3>

<td class="text-center">
 <a href="/download/1329204.torrent">
  <i class="fa fa-fw fa-download">
  </i>
 </a>
 <a href="magnet:?xt=urn:btih:31e236514cbba971c479b47bd09581ff56b8a93c&amp;dn=%5BJudas%5D%20One%20Punch%20Man%20%28Seasons%201-2%20%2B%20OVAs%20%2B%20Specials%29%20%5BBD%201080p%5D%5BHEVC%20x265%2010bit%5D%5BDual-Audio%5D%5BMulti-Subs%5D%20%28Batch%29&amp;tr=http%3A%2F%2Fnyaa.tracker.wf%3A7777%2Fannounce&amp;tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&amp;tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce">
  <i class="fa fa-fw fa-magnet">
  </i>
 </a>
</td>


[*] Analyzing <pg: 1; tr: 1; td: 4>

<td class="text-center">
 11.2 GiB
</td>


[*] Analyzing <pg: 1; tr: 1; td: 5>

<td class="text-center" data-timestamp="1610919900">
 2021-01-17 21:45
</td>


[*] Analyzing <pg: 1; tr: 1; td: 6>

<td class="text-center">
 200
</td>


[*] Analyzing <pg: 1; tr: 1; td: 7>

<td class="text-center">
 38
</td>


[*] Analyzing <pg: 1; tr: 1; td: 8>

<td class="text-center">
 7661
</td>


[*] Scraping: https://nyaa.si/?f=0&c=0_0&q=one+punch+man&s=seeders&o=desc&p=2
[*] Analyzing <pg: 2; tr: 1; td: 1>

<td>
 <a href="/?c=1_2" title="Anime - English-translated">
  <img alt="Anime - English-translated" class="category-icon" src="/static/img/icons/nyaa/1_2.png"/>
 </a>
</td>


[*] Analyzing <pg: 2; tr: 1; td: 2>

<td colspan="2">
 <a href="/view/1111733" title="[Erai-raws] One Punch Man - 01 ~ 12 [1080p][Multiple Subtitle]">
  [Erai-raws] One Punch Man - 01 ~ 12 [1080p][Multiple Subtitle]
 </a>
</td>


[*] Analyzing <pg: 2; tr: 1; td: 3>

<td class="text-center">
 <a href="/download/1111733.torrent">
  <i class="fa fa-fw fa-download">
  </i>
 </a>
 <a href="magnet:?xt=urn:btih:1b4894accb757b18b30959ac86bbd08303008f61&amp;dn=%5BErai-raws%5D%20One%20Punch%20Man%20-%2001%20~%2012%20%5B1080p%5D%5BMultiple%20Subtitle%5D&amp;tr=http%3A%2F%2Fnyaa.tracker.wf%3A7777%2Fannounce&amp;tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&amp;tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce">
  <i class="fa fa-fw fa-magnet">
  </i>
 </a>
</td>


[*] Analyzing <pg: 2; tr: 1; td: 4>

<td class="text-center">
 9.5 GiB
</td>


[*] Analyzing <pg: 2; tr: 1; td: 5>

<td class="text-center" data-timestamp="1548004730">
 2019-01-20 17:18
</td>


[*] Analyzing <pg: 2; tr: 1; td: 6>

<td class="text-center">
 7
</td>


[*] Analyzing <pg: 2; tr: 1; td: 7>

<td class="text-center">
 1
</td>


[*] Analyzing <pg: 2; tr: 1; td: 8>

<td class="text-center">
 2354
</td>

Example III: A More Thorough Example Of Pagination

In this example we will organize our code by making use of object orientation. We will bifurcate the functionality of our code into two classes. One that handles networking via requests, and the other which handles the extraction and output via BeautifulSoup. The latter will inherit from the former.

The Nest Class (Networking)

This class will be handling all networking functionality.

class Nest:
    def __init__(self, url):
        self.__url  = url
        self.__page = 1

@property
def url(self):
    return self.__url

@url.setter
def url(self, value):
    if not isinstance(value, str):
        raise TypeError("Must be a string!")
    
    self.__url = value

@property
def page(self):
    return self.__page

@page.setter
def page(self, value):
    if not isinstance(value, int):
        raise TypeError("Must be an integer!")
    
    self.__page = value

@classmethod
def nyaa(cls, link=None):
    base = "https://nyaa.si"

    if not link:
        link = "/?" + urlencode({
            'f': '0',             # Filter
            'c': '0_0',           # Category
            'q': 'one punch man', # Query
            's': 'seeders',       # Sort by
            'o': 'desc'           # Order
        }, doseq = True, quote_via = quote_plus)

    else:
        if not isinstance(link, str):
            raise TypeError("Must be a string!")

    return cls(base+link)

def fetch(self, url=None):
    if url is not None: # if url:
        self.url = url

    responseObject = requests.get(self.url)

    if not responseObject.ok:
        raise requests.HTTPError(responseObject.url)
    
    soup = BeautifulSoup(responseObject.text, 'html.parser')
    
    return soup, responseObject

The Spider Class (Extraction)

class Spider(Nest):
    def __init__(self, *args):
        super().__init__(*args)
        columns      = ['category', 'name', 'magnet', 'size', 'date', 'seeds', 'leeches', 'downloads']
        self.column  = dict(zip( range(1, len(columns)+1), columns ))

# Handle Table Data (Innermost)
def __handle_data(self, row):
    for index, data in enumerate(row.find_all('td'), start=1):
        if   index == 1:
            yield self.column[index], data.a.get('title'),
        elif index == 2: 
            category = self.column[index]
            name     = data.find_all('a')[-1].text
            
            try:
                comments = data.find('a', {'class':'comments'}).text
            except:
                comments = '0'

            nameAscii = ''.join([i for i in name.strip() if i.isascii()])
            yield category.strip(), nameAscii
            yield "comments", comments.strip()
        elif index == 3: 
            yield self.column[index], data.find_all('a')[-1].get('href')
        else:
            yield self.column[index], data.text

{'category': 'Anime - English-translated',
 'comments': '22',
 'date': '2021-01-17 21:45',
 'downloads': '7654',
 'leeches': '27',
 'magnet': 'magnet:?xt=urn:btih:31e236514cbba971c479b47bd09581ff56b8a93c&dn=%5BJudas%5D%20One%20Punch%20Man%20%28Seasons%201-2%20%2B%20OVAs%20%2B%20Specials%29%20%5BBD%201080p%5D%5BHEVC%20x265%2010bit%5D%5BDual-Audio%5D%5BMulti-Subs%5D%20%28Batch%29&tr=http%3A%2F%2Fnyaa.tracker.wf%3A7777%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce',
 'name': '[Judas] One Punch Man (Seasons 1-2 + OVAs + Specials) [BD '
         '1080p][HEVC x265 10bit][Dual-Audio][Multi-Subs] (Batch)',
 'seeds': '176',
 'size': '11.2 GiB'}

# Handle Table Rows (Outer)
def __handle_rows(self, rows):
    for r, row in enumerate(rows, start=1):
        yield r, dict(self.__handle_data(row))

{1: {'category': 'Anime - English-translated',
     'comments': '22',
     'date': '2021-01-17 21:45',
     'downloads': '7654',
     'leeches': '27',
     'magnet': 'magnet:?xt=urn:btih:31e236514cbba971c479b47bd09581ff56b8a93c&dn=%5BJudas%5D%20One%20Punch%20Man%20%28Seasons%201-2%20%2B%20OVAs%20%2B%20Specials%29%20%5BBD%201080p%5D%5BHEVC%20x265%2010bit%5D%5BDual-Audio%5D%5BMulti-Subs%5D%20%28Batch%29&tr=http%3A%2F%2Fnyaa.tracker.wf%3A7777%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce',
     'name': '[Judas] One Punch Man (Seasons 1-2 + OVAs + Specials) [BD '
             '1080p][HEVC x265 10bit][Dual-Audio][Multi-Subs] (Batch)',
     'seeds': '176',
     'size': '11.2 GiB'}}

# Handle Page Data (Outmost)
def __handle_pages(self, max_pages):
    base   = self.url[:15]
    link   = None
    data   = {}

    # Mainloop ("Spiderweb")
    while True:

        # Retrieve objects
        soup, res = self.fetch(base+link) if link else self.fetch()
        print(f"[*] Scraping: {res.url}")

        # Process soup data
        rows = soup.table.tbody.find_all('tr')
        data[self.page] = dict(self.__handle_rows(rows))

        # Find link for next page
        next_ = soup.find('li', {"class": "next"})
        link = next_.a.get('href')

        # Handle last page
        if not link:
            print(f"[-] last page reached ...")
            break

        # Increase page counter
        self.page += 1
        
        # Only first n pages
        if max_pages:
            if self.page > max_pages:
                break

    return data

We can examine the __handle_pages() return data by extracting only the first row from each page, we can then take a closer look at the data structure. The structure is a dictionary of pages, associated to another dictionary of rows which contains an even more deeply nested dictionary of table data.

{1: {1: {'category': 'Anime - English-translated',
         'comments': '22',
         'date': '2021-01-17 21:45',
         'downloads': '7654',
         'leeches': '26',
         'magnet': 'magnet:?xt=urn:btih:31e236514cbba971c479b47bd09581ff56b8a93c&dn=%5BJudas%5D%20One%20Punch%20Man%20%28Seasons%201-2%20%2B%20OVAs%20%2B%20Specials%29%20%5BBD%201080p%5D%5BHEVC%20x265%2010bit%5D%5BDual-Audio%5D%5BMulti-Subs%5D%20%28Batch%29&tr=http%3A%2F%2Fnyaa.tracker.wf%3A7777%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce',
         'name': '[Judas] One Punch Man (Seasons 1-2 + OVAs + Specials) [BD '
                 '1080p][HEVC x265 10bit][Dual-Audio][Multi-Subs] (Batch)',
         'seeds': '156',
         'size': '11.2 GiB'}},
 2: {1: {'category': 'Anime - English-translated',
         'comments': '0',
         'date': '2019-07-03 01:04',
         'downloads': '10218',
         'leeches': '0',
         'magnet': 'magnet:?xt=urn:btih:b5d1c8d3e1a63055b6f3c126ff798161fb0953e8&dn=One%20Punch%20Man%20S02E12%20%281080p%20WEBRip%20x265%20HEVC%2010bit%20AAC%202.0%20theincognito%29%20%5BUTR%5D&tr=http%3A%2F%2Fnyaa.tracker.wf%3A7777%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce',
         'name': 'One Punch Man S02E12 (1080p WEBRip x265 HEVC 10bit AAC 2.0 '
                 'theincognito) [UTR]',
         'seeds': '8',
         'size': '330.8 MiB'}}}

def crawl(self, max_pages=None):

    data = self.__handle_pages(max_pages)

    # Combine each pages rows into one DataFrame
    def __format():
        counter = 1
        for page, rows in data.items():
            for row, td in rows.items():
                yield counter, td
                counter += 1

    df = pd.DataFrame.from_dict(data=dict(__format()), orient='index')

    return df

The Main Program

import requests

from bs4          import BeautifulSoup
from urllib.parse import urlencode, quote_plus

import pandas as pd

def main():
    # Set url via @classmethod
    spider = Spider.nyaa()

    # Crawl 2 pages
    df = spider.crawl(2)

    # Display neatly
    print(
        df.to_string(
            justify         = 'left', 
            max_rows        = 16, 
            max_colwidth    = 32, 
            show_dimensions = True
        )
    )

if __name__ == "__main__":
    main()

The Output

[*] Scraping: https://nyaa.si/?f=0&c=0_0&q=one+punch+man&s=seeders&o=desc
[*] Scraping: https://nyaa.si/?f=0&c=0_0&q=one+punch+man&s=seeders&o=desc&p=2
    category                         name                             comments magnet                           size       date              seeds leeches downloads
       Anime - English-translated  [Judas] One Punch Man (Seaso...  22       magnet:?xt=urn:btih:31e23651...   11.2 GiB  2021-01-17 21:45  162   28       7654
       Anime - English-translated  [HorribleSubs] One Punch Man...  23       magnet:?xt=urn:btih:b9dd7539...  909.7 MiB  2019-05-28 22:38   45    0      80178
       Anime - English-translated  [HorribleSubs] One Punch Man...  39       magnet:?xt=urn:btih:f29fda82...  852.6 MiB  2019-05-21 17:43   44    0      84523
       Anime - English-translated  [HorribleSubs] One Punch Man...  21       magnet:?xt=urn:btih:f2da25e0...  769.1 MiB  2019-06-18 17:40   44    0      80148
       Anime - English-translated  [HorribleSubs] One Punch Man...  22       magnet:?xt=urn:btih:610f81ab...    1.0 GiB  2019-06-25 17:47   43    0      78400
       Anime - English-translated  [HorribleSubs] One Punch Man...  45       magnet:?xt=urn:btih:7498c4c8...  835.8 MiB  2019-04-30 17:40   43    0      84956
       Anime - English-translated  [HorribleSubs] One Punch Man...  49       magnet:?xt=urn:btih:6959d8ee...  803.3 MiB  2019-04-09 18:36   42    1      84984
       Anime - English-translated  [HorribleSubs] One Punch Man...  23       magnet:?xt=urn:btih:e5c44615...  793.8 MiB  2019-04-16 17:44   42    0      82664
..                               ...                              ...      ...                              ...        ...               ...   ...     ...       ...
     Anime - English-translated  [Erai-raws] One Punch Man (2...   0       magnet:?xt=urn:btih:7cc5a563...    1.2 GiB  2019-04-18 19:43    3    0       4144
     Anime - English-translated  One.Punch.Man.S02E12.Cleanin...   0       magnet:?xt=urn:btih:e49a8342...  519.8 MiB  2019-07-02 18:18    3    0       5673
     Anime - English-translated  [LostYears] One Punch Man - ...   1       magnet:?xt=urn:btih:5e4dfc14...    1.5 GiB  2019-11-03 22:15    3    0       1205
     Anime - English-translated  [SSA] One Punch Man (Season ...   2       magnet:?xt=urn:btih:55fae8bc...    1.8 GiB  2021-05-18 05:32    3    0        270
     Anime - English-translated  One.Punch.Man.S02E07.The.Cla...   1       magnet:?xt=urn:btih:a412fba6...  354.8 MiB  2019-05-21 18:07    3    0      10503
     Anime - English-translated  [HorribleRips] One-Punch Man...   3       magnet:?xt=urn:btih:d49fde98...   10.0 GiB  2020-10-23 22:25    3    1        665
     Anime - English-translated  [Erai-raws] One Punch Man (2...   1       magnet:?xt=urn:btih:2b8939f9...  361.8 MiB  2019-06-18 18:15    3    0       7067
Literature - English-translated  [Pajeet] One Punch Man (Orig...   7       magnet:?xt=urn:btih:1221db3b...  417.8 MiB  2020-04-13 10:56    3    1        669

[150 rows x 9 columns]