Spiegel Scraper

I am learning Data Science and one of the Key Points is to collect and clean information.

One option is to filter the content of a Website, like articles, columns or another information. If there is no API of the website available, you need to extract yourself the information. This is called Web Scraping.

For this purpose I am using the library BeautifulSoup4 and Request available via PIP

"Spiegel Scraper" saves every single Theme published in the Newspaper, for every letter and character available. This can help you to identify which themes are available and after that automate a specific search for this particular theme.

    {
        "#": {
            "1. FC Heidenheim": "/thema/1_fc_heidenheim/",
            "1. FC Kaiserslautern": "/thema/1_fc_kaiserslautern/",
            "1. FC Köln": "/thema/1_fc_koeln/",
            "1. FC Nürnberg": "/thema/1_fc_nuernberg/",
            "1. FC Saarbrücken": "/thema/1_fc_saarbruecken/",
            "1. FFC Frankfurt": "/thema/1_ffc_frankfurt/",
            "100-Meter-Lauf der Männer": "/thema/100_meter_lauf_der_maenner/",
            "1000 Fragen": "/thema/1000_fragen/",
            "11 FREUNDE": "/thema/11_freunde/",
            "11 Wall Street": "/thema/kolumne_11_wall_street/",
            "11. September 2001": "/thema/terroranschlaege_vom_11_september_2001/",
            "16 Länder, 16 Leben": "/thema/16_laender_16_leben/",
            "1860 München": "/thema/1860_muenchen/",
            "2. Fußball-Bundesliga": "/thema/zweite_fussball_bundesliga/",
            "200-Meter-Lauf der Männer": "/thema/200_meter_lauf_der_maenne/",
            "2012 (Film)": "/thema/2012_film/",
            "2015 - Was wird aus...?": "/thema/2015_was_wird_aus/",
            "2020 - Die Zeitungsdebatte": "/thema/2020_die_zeitungsdebatte/",
            "21 unter 21": "/thema/21_unter_21/",
            "24 (Fernsehserie)": "/thema/24_fernsehserie/",
            "24-Stunden-Rennen von Le Mans": "/thema/24_stunden_rennen_von_le_mans/",
            "25 Jahre Mauerfall": "/thema/25_jahre_mauerfall/",
            "2raumwohnung": "/thema/2raumwohnung/",
            "3. Liga": "/thema/dritte_liga/",
            "32. America's Cup": "/thema/32_americas_cup/",
            "32c3": "/thema/32c3/",
            "33. America's Cup": "/thema/33_americas_cup/",
            "34. America's Cup": "/thema/34_americas_cup/",
            "37 Grad": "/thema/37_grad/",
            "3D": "/thema/3d/",
            "3D-Drucker": "/thema/3d_drucker/",
            "3M": "/thema/3m/",
            "4 um die Welt": "/thema/4_um_die_welt/",
            "50 Cent": "/thema/50_cent/",
            "50 Jahre Bundesliga": "/thema/50_jahre_bundesliga/",
            "60 Jahre Israel": "/thema/60_jahre_israel/",
            "60 deutsche Autos": "/thema/60_deutsche_autos/",
            "68er-Bewegung": "/thema/68er_bewegung/",
            "70 Jahre DER SPIEGEL": "/thema/70_jahre_der_spiegel/",
            "9 mal klug": "/thema/9_mal_klug/"
        }, (...)
    }
  

More info and the code in:

Github