Как спарсить что  угодно с любого сайта
medium

Как спарсить что угодно с любого сайта

Например, заголовки или содержимое статьи.

Сегодняшний проект послужит основой многих наших дальнейших программ. Мы научимся собирать с сайтов любые данные, которые нам нужны. 

У нас есть рабочий проект на цепях Маркова. Цепи Маркова — это несложный алгоритм, который анализирует сочетаемость слов в заданном тексте и выдаёт новый текст на основе старого. Похоже на работу нейронок, но на самом деле это просто перебор слов и бессмысленное их сочетание. 

Для работы наших первых проектов на цепях Маркова мы скачали книгу с рассказами Чехова. Программа анализирует сочетаемость чеховских слов и выдаёт текст в чеховском духе (хотя и бессмысленный).

Но что, если мы хотим сделать текст не в духе Чехова, а в духе журнала «Код»? Или в духе какого-нибудь издания-иноагента? Или сделать генератор статей в духе какого-нибудь блогера? 

Решение — написать программу, которая посмотрит на сайте наши статьи и вытащит оттуда весь значимый текст. Единственное, что для этого понадобится, — список ссылок на статьи, но мы их уже собрали, когда делали проект с гаданием на статьях Кода.

Логика работы

Программа будет работать на Python на локальной машине. Алгоритм:

  1. Подключаем нужные библиотеки.
  2. Готовим переменную со списком сайтов для обхода.
  3. Заходим на первый сайт и получаем оттуда исходный код.
  4. Указываем, откуда брать текст.
  5. Сохраняем найденный текст в отдельном файле.
  6. Повторяем цикл до тех пор, пока не пройдём все страницы из списка.

 👉 Главное в таких проектах — знать структуру содержимого страницы и понимать, где именно и в каких тегах находятся нужные для вас данные. 

Чтобы было проще, на старте сделаем программу, которая собирает названия страниц. Как освоимся — сделаем что посложнее.

Изучаем исходный код страницы

Прежде чем заниматься парсингом (сбором) со страницы чего угодно, нужно выяснить, где это лежит и в какой кодировке. Мы знаем, что все статьи Кода созданы по одному и тому же шаблону, поэтому нам достаточно посмотреть, как устроена одна, чтобы понять их все. 

Смотрим исходный код любой нашей статьи. Нас интересуют два момента — кодировка страницы и тег <title>. Нам нужно убедиться, что в этом теге прописано название. 

Сначала кодировка:

Как утащить что  угодно с любого сайта

Эта строчка означает, что страница работает с кодировкой UTF-8. Запомним это.

Теперь пролистываем исходный код ниже и находим тег <title> — именно он отвечает за заголовок страницы. Убеждаемся, что он есть и с ним всё в порядке:

Как утащить что  угодно с любого сайта

Библиотеки для работы

В проекте нам понадобятся две библиотеки: urllib и BeautifulSoup.

Первая отвечает за доступ к страницам по их адресу, причём оттуда нам будет нужна только одна команда urlopen().read — она отправляется по указанному адресу и получает весь исходный код страницы.

Вторая библиотека входит в состав большой библиотеки bs4 — в ней уже собраны все команды для парсинга исходного HTML-кода и разбора тегов. Чтобы установить bs4, запускаем терминал и пишем:

pip3 install bs4
Как утащить что  угодно с любого сайта

Пишем код

Сначала подключим все нужные библиотеки:

# подключаем urlopen из модуля urllib
from urllib.request import urlopen

# подключаем библиотеку BeautifulSoup
from bs4 import BeautifulSoup

Теперь объявим список страниц, которые нужно посетить и забрать оттуда заголовки. Мы уже составили такой список для проекта с гаданием на статьях Кода, поэтому просто возьмём его оттуда и адаптируем под Python:

url = [

"https://thecode.media/is-not-defined-jquery/",

"https://thecode.media/arduino-projects-2/",

"https://thecode.media/10-raspberry/",

"https://thecode.media/easy-css/",

"https://thecode.media/to-be-front/",

"https://thecode.media/cryptex/",

"https://thecode.media/ali-coders/",

"https://thecode.media/po-glandy/",

"https://thecode.media/megaexcel/",

"https://thecode.media/chat-bot-generators/",

"https://thecode.media/wifi/",

"https://thecode.media/andri-oxa/",

"https://thecode.media/free-hosting/",

"https://thecode.media/hotwheels/",

"https://thecode.media/do-not-disturb/",

"https://thecode.media/dyno-ai/",

"https://thecode.media/snake-ai/",

"https://thecode.media/leet/",

"https://thecode.media/ninja/",

"https://thecode.media/supergirl/",

"https://thecode.media/vpn/ ",

"https://thecode.media/what-is-wordpress/",

"https://thecode.media/hardware/",

"https://thecode.media/division/",

"https://thecode.media/nuggets/",

"https://thecode.media/binary-notation/",

"https://thecode.media/bootstrap/",

"https://thecode.media/chat-bot/",

"https://thecode.media/myadblock3000/",

"https://thecode.media/trello/",

"https://thecode.media/python-time/",

"https://thecode.media/editor/",

"https://thecode.media/timer/",

"https://thecode.media/intro-bootstrap/",

"https://thecode.media/php-form/",

"https://thecode.media/hr-quiz/",

"https://thecode.media/c-sharp/",

"https://thecode.media/showtime/",

"https://thecode.media/uchtel-rasskazhi/",

"https://thecode.media/sshhhh/",

"https://thecode.media/marry-me-python/",

"https://thecode.media/haters-gonna-code/",

"https://thecode.media/speed-css/",

"https://thecode.media/fired/",

"https://thecode.media/zabuhal/",

"https://thecode.media/est-tri-shkatulki/",

"https://thecode.media/milk-that/",

"https://thecode.media/binary-mouse/",

"https://thecode.media/bowling/",

"https://thecode.media/dealership/",

"https://thecode.media/best-seller/",

"https://thecode.media/hr/",

"https://thecode.media/no-comments/",

"https://thecode.media/drakoni-yajca/",

"https://thecode.media/who-is-who/",

"https://thecode.media/get-a-room/",

"https://thecode.media/alps/",

"https://thecode.media/handshake/",

"https://thecode.media/choose-life/",

"https://thecode.media/high-voltage/",

"https://thecode.media/spy/",

"https://thecode.media/squirrelrrel/",

"https://thecode.media/so-agile/",

"https://thecode.media/wedding/",

"https://thecode.media/supper/",

"https://thecode.media/le-tarakan/",

"https://thecode.media/batareyki-besyat/",

"https://thecode.media/dr_jekyll/",

"https://thecode.media/everybody_lies/",

"https://thecode.media/electrician/",

"https://thecode.media/einstein/",

"https://thecode.media/bugz/",

"https://thecode.media/needforspeed/",

"https://thecode.media/be-smart/",

"https://thecode.media/bot-online/",

"https://thecode.media/microb/",

"https://thecode.media/jquery/",

"https://thecode.media/split-screen/",

"https://thecode.media/calculus/",

"https://thecode.media/big-data-sales/",

"https://thecode.media/ambient/",

"https://thecode.media/fatality/",

"https://thecode.media/biggest-loser/",

"https://thecode.media/wifi/",

"https://thecode.media/nosock/",

"https://thecode.media/variables/",

"https://thecode.media/start_python/",

"https://thecode.media/i-gonna-code/",

"https://thecode.media/sigi-est/",

"https://thecode.media/nes-game/",

"https://thecode.media/live-view/",

"https://thecode.media/remote/",

"https://thecode.media/arduino-code/",

"https://thecode.media/horses/",

"https://thecode.media/runinstein/",

"https://thecode.media/wp-template/",

"https://thecode.media/tilda/",

"https://thecode.media/todo/",

"https://thecode.media/telebot/",

"https://thecode.media/summator-2/",

"https://thecode.media/get-rich-coding/",

"https://thecode.media/content-manager/",

"https://thecode.media/vzrosly-stal/",

"https://thecode.media/py-install/",

"https://thecode.media/quantum/",

"https://thecode.media/dns/",

"https://thecode.media/practicum/",

"https://thecode.media/react/",

"https://thecode.media/1september/",

"https://thecode.media/summator/",

"https://thecode.media/vds/",

"https://thecode.media/made-in-china/",

"https://thecode.media/bar/",

"https://thecode.media/zodiac/",

"https://thecode.media/crc32/",

"https://thecode.media/css-links/",

"https://thecode.media/oop_battle/",

"https://thecode.media/be-combo/",

"https://thecode.media/unity/",

"https://thecode.media/data-science/",

"https://thecode.media/junior/",

"https://thecode.media/qc/",

"https://thecode.media/be-middle/",

"https://thecode.media/senior/",

"https://thecode.media/teamlead/",

"https://thecode.media/frontend/",

"https://thecode.media/lift/",

"https://thecode.media/be-fuzzy/",

"https://thecode.media/best-2020/",

"https://thecode.media/git/",

"https://thecode.media/stt-cloud/",

"https://thecode.media/matrix-pills/",

"https://thecode.media/na-stile/",

"https://thecode.media/no-coffee/",

"https://thecode.media/framelibs/",

"https://thecode.media/children/",

"https://thecode.media/balls-possibly/",

"https://thecode.media/le-meduza/",

"https://thecode.media/electricity/",

"https://thecode.media/tailored-swift/",

"https://thecode.media/objective/",

"https://thecode.media/host/",

"https://thecode.media/go-public/",

"https://thecode.media/how-internet-works-1/",

"https://thecode.media/domain/",

"https://thecode.media/this-is-object/",

"https://thecode.media/ole-ole-ole/",

"https://thecode.media/thousand/",

"https://thecode.media/average/",

"https://thecode.media/stt-python/",

"https://thecode.media/ping-pong/",

"https://thecode.media/pygames/",

"https://thecode.media/odobreno/",

"https://thecode.media/qwerty123/",

"https://thecode.media/neurocorrector/",

"https://thecode.media/neuro-cam/",

"https://thecode.media/10-jquery/",

"https://thecode.media/repeat/",

"https://thecode.media/assembler/",

"https://thecode.media/sublime-one-love/",

"https://thecode.media/zloy/",

"https://thecode.media/mariya-ivanovna/",

"https://thecode.media/ruby/",

"https://thecode.media/electron-password/",

"https://thecode.media/plane/",

"https://thecode.media/glitch/",

"https://thecode.media/security/",

"https://thecode.media/stupid-2019/",

"https://thecode.media/jquery-search/",

"https://thecode.media/pimp-my-pass/",

"https://thecode.media/text-ultimate/",

"https://thecode.media/hurry/",

"https://thecode.media/siri/",

"https://thecode.media/zero-cool/",

"https://thecode.media/small-talk/",

"https://thecode.media/die-hard/",

"https://thecode.media/le-piton/",

"https://thecode.media/hr-code/",

"https://thecode.media/nano-code/",

"https://thecode.media/the_question/",

"https://thecode.media/godlike/",

"https://thecode.media/be-logic/",

"https://thecode.media/snake-js/",

"https://thecode.media/be-mobile/",

"https://thecode.media/baboolya/",

"https://thecode.media/timelag/",

"https://thecode.media/doors/",

"https://thecode.media/phone-code/",

"https://thecode.media/snake-arduino/",

"https://thecode.media/css-intro/",

"https://thecode.media/le-timer/",

"https://thecode.media/oop_battle/",

"https://thecode.media/good-morning/",

"https://thecode.media/study-bot/",

"https://thecode.media/python-bot/",

"https://thecode.media/robot-quiz/",

"https://thecode.media/hacking-quiz/",

"https://thecode.media/lulz-quiz/",

"https://thecode.media/hard-quiz/",

"https://thecode.media/torrent/",

"https://thecode.media/travel/",

"https://thecode.media/le-snob/",

"https://thecode.media/no-spagetti/",

"https://thecode.media/house/",

"https://thecode.media/cryptorush/",

"https://thecode.media/coronarelax/",

"https://thecode.media/pure/",

"https://thecode.media/c-cpp/",

"https://thecode.media/machine-loving/",

"https://thecode.media/orwell/",

"https://thecode.media/darknet/",

"https://thecode.media/ai/",

"https://thecode.media/oop-class/",

"https://thecode.media/cookie/",

"https://thecode.media/malware/",

"https://thecode.media/ftp/",

"https://thecode.media/html/",

"https://thecode.media/java/",

"https://thecode.media/php-haters/",

"https://thecode.media/tor/",

"https://thecode.media/crack-safe/",

"https://thecode.media/epidemic/",

"https://thecode.media/hash-brown/",

"https://thecode.media/java-js/",

"https://thecode.media/js-types/",

"https://thecode.media/losers/",

"https://thecode.media/ssl/",

"https://thecode.media/uncaughtsyntaxerror-unexpected-identifier/",

"https://thecode.media/uncaughtsyntaxerror-unexpected-token/",

"https://thecode.media/uncaughttyperrror-cannot-read-property/",

"https://thecode.media/mobile-dev/",

"https://thecode.media/verevka/",

"https://thecode.media/speed/",

"https://thecode.media/buckwheat/",

"https://thecode.media/distance/",

"https://thecode.media/node-js/",

"https://thecode.media/pascal/",

"https://thecode.media/ill-be-clean/",

"https://thecode.media/to-be-back/",

"https://thecode.media/replaceable/",

"https://thecode.media/code-review/",

"https://thecode.media/gasoline/",

"https://thecode.media/to-be-test/",

"https://thecode.media/scala/",

"https://thecode.media/row-power/",

"https://thecode.media/percent/",

"https://thecode.media/things/",

"https://thecode.media/backend/",

"https://thecode.media/immortal-pong/",

"https://thecode.media/blind/",

"https://thecode.media/go-faster/",

"https://thecode.media/cpp/",

"https://thecode.media/uncaught-syntaxerror-unexpected-end-of-input/",

"https://thecode.media/stress-quiz/",

"https://thecode.media/secret-pong/",

"https://thecode.media/override/",

"https://thecode.media/whg/",

"https://thecode.media/profit/",

"https://thecode.media/memas/",

"https://thecode.media/digital-sound/",

"https://thecode.media/api/",

"https://thecode.media/be-math-2/",

"https://thecode.media/backup/",

"https://thecode.media/backup-master/",

"https://thecode.media/glvrd/",

"https://thecode.media/id/",

"https://thecode.media/uncaught-syntaxerror-missing-after-argument-list/",

"https://thecode.media/ex-startup/",

"https://thecode.media/doom-everywhere/",

"https://thecode.media/template-one/",

"https://thecode.media/david-roganov/",

"https://thecode.media/spacex/",

"https://thecode.media/webstorm/",

"https://thecode.media/json/",

"https://thecode.media/treger/",

"https://thecode.media/ya-blitz/",

"https://thecode.media/radius/",

"https://thecode.media/xhr/",

"https://thecode.media/treger2/",

"https://thecode.media/raidemption/",

"https://thecode.media/chief-technical-officer/",

"https://thecode.media/summary/",

"https://thecode.media/ex-wallpaper/",

"https://thecode.media/soap/",

"https://thecode.media/decompose/",

"https://thecode.media/desc/",

"https://thecode.media/sprint/",

"https://thecode.media/bye-or-die/",

"https://thecode.media/who-win/",

"https://thecode.media/vladimir-olokhtonov/",

"https://thecode.media/lossless/",

"https://thecode.media/parse/",

"https://thecode.media/typeerror-is-not-an-abject/",

"https://thecode.media/backup-me/",

"https://thecode.media/stress-test/",

"https://thecode.media/syntaxerror-missing-formal-parameter/",

"https://thecode.media/start-fast/",

"https://thecode.media/halkechev/",

"https://thecode.media/halkechev2/",

"https://thecode.media/le-design/",

"https://thecode.media/syntaxerror-missing-after-property-id/",

"https://thecode.media/attrb-mthd/",

"https://thecode.media/headphones/",

"https://thecode.media/active-noise-cancelling/",

"https://thecode.media/remote-work-quiz/",

"https://thecode.media/garbage/",

"https://thecode.media/ubuntu-linux/",

"https://thecode.media/trie/",

"https://thecode.media/func/",

"https://thecode.media/laravel/",

"https://thecode.media/save-json/",

"https://thecode.media/syntaxerror-missing-after-formal-parameters/",

"https://thecode.media/recursion/",

"https://thecode.media/haskell/",

"https://thecode.media/gen/",

"https://thecode.media/db/",

"https://thecode.media/boosting/",

"https://thecode.media/pavel-sviridov/",

"https://thecode.media/mnogo/",

"https://thecode.media/sokr/",

"https://thecode.media/dbsm/",

"https://thecode.media/pik-balmera/",

"https://thecode.media/kanban/",

"https://thecode.media/check-list/",

"https://thecode.media/text-quiz/",

"https://thecode.media/mysql/",

"https://thecode.media/mysql/",

"https://thecode.media/rust/",

"https://thecode.media/manage-this/",

"https://thecode.media/altshuller/",

"https://thecode.media/interview/",

"https://thecode.media/fotorama/",

"https://thecode.media/tetris/",

"https://thecode.media/ai-tetris/",

"https://thecode.media/scrum/",

"https://thecode.media/speed-two/",

"https://thecode.media/quick-share/",

"https://thecode.media/stack/",

"https://thecode.media/mobile-first/",

"https://thecode.media/nosql/",

"https://thecode.media/narazves/",

"https://thecode.media/oop-class-2/",

"https://thecode.media/design-first/",

"https://thecode.media/arcanoid/",

"https://thecode.media/donut/",

"https://thecode.media/casino/",

"https://thecode.media/heap/",

"https://thecode.media/rust/",

"https://thecode.media/float/",

"https://thecode.media/markdown/",

"https://thecode.media/books/",

"https://thecode.media/daniil-popov/",

"https://thecode.media/android-developer/",

"https://thecode.media/symbols/",

"https://thecode.media/oauth/",

"https://thecode.media/kotlin/",

"https://thecode.media/todo/",

"https://thecode.media/plotly/",

"https://thecode.media/no-digit-code/",

"https://thecode.media/asymmetric/",

"https://thecode.media/qi/",

"https://thecode.media/vernam/",

"https://thecode.media/vernam-js/",

"https://thecode.media/shtykov/",

"https://thecode.media/memory/",

"https://thecode.media/ark/",

"https://thecode.media/7-oshibok-na-sobesedovanii/",

"https://thecode.media/dh/",

"https://thecode.media/typescript/",

"https://thecode.media/stark/",

"https://thecode.media/crypto/",

"https://thecode.media/zapusk-2/",

"https://thecode.media/fingerprint/",

"https://thecode.media/puzzle/",

"https://thecode.media/python-time-2/",

"https://thecode.media/no-chance/",

"https://thecode.media/lossy/",

"https://thecode.media/1wire/",

"https://thecode.media/pasha-flipper/",

"https://thecode.media/perl/",

"https://thecode.media/alexey-vasilev/",

"https://thecode.media/viasat/",

"https://thecode.media/podcast/",

"https://thecode.media/copy-ya-ru/",

"https://thecode.media/mircrosd/",

"https://thecode.media/bash/",

"https://thecode.media/rotation/",

"https://thecode.media/css-grid/",

"https://thecode.media/train/",

"https://thecode.media/grid-2/",

"https://thecode.media/za-proezd/",

"https://thecode.media/grid-3/",

"https://thecode.media/david/",

"https://thecode.media/alien-vs-predator/",

"https://thecode.media/grid-portfolio/",

"https://thecode.media/podcast-lavka/",

"https://thecode.media/it-start-2/",

"https://thecode.media/anastasiya-nikulina/",

"https://thecode.media/linter/",

"https://thecode.media/bomberman/",

"https://thecode.media/5-linters/",

"https://thecode.media/lineynaya-algebra-vektory/",

"https://thecode.media/no-nda/",

"https://thecode.media/code-swap/",

"https://thecode.media/how-to-start/",

"https://thecode.media/referenceerror-invalid-left-hand-side-in-assignment/",

"https://thecode.media/vectors-operations/",

"https://thecode.media/oop-abstract/",

"https://thecode.media/leonov/",

"https://thecode.media/lucky-strike/",

"https://thecode.media/browser/",

"https://thecode.media/2020/",

"https://thecode.media/3d-stars/",

"https://thecode.media/anna-leonova/",

"https://thecode.media/cold-fusion/",

"https://thecode.media/normalize/",

"https://thecode.media/hotkey/",

"https://thecode.media/oven/",

"https://thecode.media/vim/",

"https://thecode.media/draw/",

"https://thecode.media/visual-studio-code/",

"https://thecode.media/tetris-2/",

"https://thecode.media/start-now/",

"https://thecode.media/lapsha-1/",

"https://thecode.media/cubism/",

"https://thecode.media/lapsha-2/",

"https://thecode.media/zerocode/",

"https://thecode.media/cat/",

"https://thecode.media/static/",

"https://thecode.media/komm/",

"https://thecode.media/path-js/",

"https://thecode.media/cloudly/",

"https://thecode.media/haters-gonna-code-2/",

"https://thecode.media/csp/",

"https://thecode.media/mitin-says-no/",

"https://thecode.media/hire-js/",

"https://thecode.media/tableau/",

"https://thecode.media/impossible/",

"https://thecode.media/csp-on/",

"https://thecode.media/maze/",

"https://thecode.media/mix/",

"https://thecode.media/le-beton/",

"https://thecode.media/3d-print/",

"https://thecode.media/5-and-a-half/",

"https://thecode.media/lineynaya-zavisimost-vektorov/",

"https://thecode.media/fast-m1/",

"https://thecode.media/ninja-run/",

"https://thecode.media/matrix-101/",

"https://thecode.media/arm-x86/",

"https://thecode.media/piano-js/",

"https://thecode.media/10-swift/",

"https://thecode.media/travel-plane/",

"https://thecode.media/extention/",

"https://thecode.media/obratnaya-matritsa/",

"https://thecode.media/svg/",

"https://thecode.media/freelance/",

"https://thecode.media/brat-2/",

"https://thecode.media/angular/",

"https://thecode.media/rgb/",

"https://thecode.media/10-go/",

"https://thecode.media/coffee/",

]

Теперь перебираем все элементы этого массива в цикле, используя всю мощь библиотек. Обратите внимание на строчку, где мы получаем исходный код страницы — мы сразу конвертируем его в нужную кодировку, которую выяснили на предыдущем этапе:

# открываем текстовый файл, куда будем добавлять заголовки
file = open("zag.txt", "a")

# перебираем все адреса из списка
for x in url:
    # получаем исходный код очередной страницы из списка
    html_code = str(urlopen(x).read(),'utf-8')
    # отправляем исходный код страницы на обработку в библиотеку
    soup = BeautifulSoup(html_code, "html.parser")

    # находим название страницы с помощью метода find()
    s = soup.find('title').text

    # выводим его на экран
    print(s)

    # сохраняем заголовок в файле и переносим курсор на новую строку
    file.write(s + '\n')

# закрываем файл
file.close()

Запускаем и смотрим на результат:

Как утащить что  угодно с любого сайта

# подключаем urlopen из модуля urllib
from urllib.request import urlopen

# подключаем библиотеку BeautifulSout
from bs4 import BeautifulSoup

url = [
"https://thecode.media/is-not-defined-jquery/",
"https://thecode.media/arduino-projects-2/",
"https://thecode.media/10-raspberry/",
"https://thecode.media/easy-css/",
"https://thecode.media/to-be-front/",
"https://thecode.media/cryptex/",
"https://thecode.media/ali-coders/",
"https://thecode.media/po-glandy/",
"https://thecode.media/megaexcel/",
"https://thecode.media/chat-bot-generators/",
"https://thecode.media/wifi/",
"https://thecode.media/andri-oxa/",
"https://thecode.media/free-hosting/",
"https://thecode.media/hotwheels/",
"https://thecode.media/do-not-disturb/",
"https://thecode.media/dyno-ai/",
"https://thecode.media/snake-ai/",
"https://thecode.media/leet/",
"https://thecode.media/ninja/",
"https://thecode.media/supergirl/",
"https://thecode.media/vpn/ ",
"https://thecode.media/what-is-wordpress/",
"https://thecode.media/hardware/",
"https://thecode.media/division/",
"https://thecode.media/nuggets/",
"https://thecode.media/binary-notation/",
"https://thecode.media/bootstrap/",
"https://thecode.media/chat-bot/",
"https://thecode.media/myadblock3000/",
"https://thecode.media/trello/",
"https://thecode.media/python-time/",
"https://thecode.media/editor/",
"https://thecode.media/timer/",
"https://thecode.media/intro-bootstrap/",
"https://thecode.media/php-form/",
"https://thecode.media/hr-quiz/",
"https://thecode.media/c-sharp/",
"https://thecode.media/showtime/",
"https://thecode.media/uchtel-rasskazhi/",
"https://thecode.media/sshhhh/",
"https://thecode.media/marry-me-python/",
"https://thecode.media/haters-gonna-code/",
"https://thecode.media/speed-css/",
"https://thecode.media/fired/",
"https://thecode.media/zabuhal/",
"https://thecode.media/est-tri-shkatulki/",
"https://thecode.media/milk-that/",
"https://thecode.media/binary-mouse/",
"https://thecode.media/bowling/",
"https://thecode.media/dealership/",
"https://thecode.media/best-seller/",
"https://thecode.media/hr/",
"https://thecode.media/no-comments/",
"https://thecode.media/drakoni-yajca/",
"https://thecode.media/who-is-who/",
"https://thecode.media/get-a-room/",
"https://thecode.media/alps/",
"https://thecode.media/handshake/",
"https://thecode.media/choose-life/",
"https://thecode.media/high-voltage/",
"https://thecode.media/spy/",
"https://thecode.media/squirrelrrel/",
"https://thecode.media/so-agile/",
"https://thecode.media/wedding/",
"https://thecode.media/supper/",
"https://thecode.media/le-tarakan/",
"https://thecode.media/batareyki-besyat/",
"https://thecode.media/dr_jekyll/",
"https://thecode.media/everybody_lies/",
"https://thecode.media/electrician/",
"https://thecode.media/einstein/",
"https://thecode.media/bugz/",
"https://thecode.media/needforspeed/",
"https://thecode.media/be-smart/",
"https://thecode.media/bot-online/",
"https://thecode.media/microb/",
"https://thecode.media/jquery/",
"https://thecode.media/split-screen/",
"https://thecode.media/calculus/",
"https://thecode.media/big-data-sales/",
"https://thecode.media/ambient/",
"https://thecode.media/fatality/",
"https://thecode.media/biggest-loser/",
"https://thecode.media/wifi/",
"https://thecode.media/nosock/",
"https://thecode.media/variables/",
"https://thecode.media/start_python/",
"https://thecode.media/i-gonna-code/",
"https://thecode.media/sigi-est/",
"https://thecode.media/nes-game/",
"https://thecode.media/live-view/",
"https://thecode.media/remote/",
"https://thecode.media/arduino-code/",
"https://thecode.media/horses/",
"https://thecode.media/runinstein/",
"https://thecode.media/wp-template/",
"https://thecode.media/tilda/",
"https://thecode.media/todo/",
"https://thecode.media/telebot/",
"https://thecode.media/summator-2/",
"https://thecode.media/get-rich-coding/",
"https://thecode.media/content-manager/",
"https://thecode.media/vzrosly-stal/",
"https://thecode.media/py-install/",
"https://thecode.media/quantum/",
"https://thecode.media/dns/",
"https://thecode.media/practicum/",
"https://thecode.media/react/",
"https://thecode.media/1september/",
"https://thecode.media/summator/",
"https://thecode.media/vds/",
"https://thecode.media/made-in-china/",
"https://thecode.media/bar/",
"https://thecode.media/zodiac/",
"https://thecode.media/crc32/",
"https://thecode.media/css-links/",
"https://thecode.media/oop_battle/",
"https://thecode.media/be-combo/",
"https://thecode.media/unity/",
"https://thecode.media/data-science/",
"https://thecode.media/junior/",
"https://thecode.media/qc/",
"https://thecode.media/be-middle/",
"https://thecode.media/senior/",
"https://thecode.media/teamlead/",
"https://thecode.media/frontend/",
"https://thecode.media/lift/",
"https://thecode.media/be-fuzzy/",
"https://thecode.media/best-2020/",
"https://thecode.media/git/",
"https://thecode.media/stt-cloud/",
"https://thecode.media/matrix-pills/",
"https://thecode.media/na-stile/",
"https://thecode.media/no-coffee/",
"https://thecode.media/framelibs/",
"https://thecode.media/children/",
"https://thecode.media/balls-possibly/",
"https://thecode.media/le-meduza/",
"https://thecode.media/electricity/",
"https://thecode.media/tailored-swift/",
"https://thecode.media/objective/",
"https://thecode.media/host/",
"https://thecode.media/go-public/",
"https://thecode.media/how-internet-works-1/",
"https://thecode.media/domain/",
"https://thecode.media/this-is-object/",
"https://thecode.media/ole-ole-ole/",
"https://thecode.media/thousand/",
"https://thecode.media/average/",
"https://thecode.media/stt-python/",
"https://thecode.media/ping-pong/",
"https://thecode.media/pygames/",
"https://thecode.media/odobreno/",
"https://thecode.media/qwerty123/",
"https://thecode.media/neurocorrector/",
"https://thecode.media/neuro-cam/",
"https://thecode.media/10-jquery/",
"https://thecode.media/repeat/",
"https://thecode.media/assembler/",
"https://thecode.media/sublime-one-love/",
"https://thecode.media/zloy/",
"https://thecode.media/mariya-ivanovna/",
"https://thecode.media/ruby/",
"https://thecode.media/electron-password/",
"https://thecode.media/plane/",
"https://thecode.media/glitch/",
"https://thecode.media/security/",
"https://thecode.media/stupid-2019/",
"https://thecode.media/jquery-search/",
"https://thecode.media/pimp-my-pass/",
"https://thecode.media/text-ultimate/",
"https://thecode.media/hurry/",
"https://thecode.media/siri/",
"https://thecode.media/zero-cool/",
"https://thecode.media/small-talk/",
"https://thecode.media/die-hard/",
"https://thecode.media/le-piton/",
"https://thecode.media/hr-code/",
"https://thecode.media/nano-code/",
"https://thecode.media/the_question/",
"https://thecode.media/godlike/",
"https://thecode.media/be-logic/",
"https://thecode.media/snake-js/",
"https://thecode.media/be-mobile/",
"https://thecode.media/baboolya/",
"https://thecode.media/timelag/",
"https://thecode.media/doors/",
"https://thecode.media/phone-code/",
"https://thecode.media/snake-arduino/",
"https://thecode.media/css-intro/",
"https://thecode.media/le-timer/",
"https://thecode.media/oop_battle/",
"https://thecode.media/good-morning/",
"https://thecode.media/study-bot/",
"https://thecode.media/python-bot/",
"https://thecode.media/robot-quiz/",
"https://thecode.media/hacking-quiz/",
"https://thecode.media/lulz-quiz/",
"https://thecode.media/hard-quiz/",
"https://thecode.media/torrent/",
"https://thecode.media/travel/",
"https://thecode.media/le-snob/",
"https://thecode.media/no-spagetti/",
"https://thecode.media/house/",
"https://thecode.media/cryptorush/",
"https://thecode.media/coronarelax/",
"https://thecode.media/pure/",
"https://thecode.media/c-cpp/",
"https://thecode.media/machine-loving/",
"https://thecode.media/orwell/",
"https://thecode.media/darknet/",
"https://thecode.media/ai/",
"https://thecode.media/oop-class/",
"https://thecode.media/cookie/",
"https://thecode.media/malware/",
"https://thecode.media/ftp/",
"https://thecode.media/html/",
"https://thecode.media/java/",
"https://thecode.media/php-haters/",
"https://thecode.media/tor/",
"https://thecode.media/crack-safe/",
"https://thecode.media/epidemic/",
"https://thecode.media/hash-brown/",
"https://thecode.media/java-js/",
"https://thecode.media/js-types/",
"https://thecode.media/losers/",
"https://thecode.media/ssl/",
"https://thecode.media/uncaughtsyntaxerror-unexpected-identifier/",
"https://thecode.media/uncaughtsyntaxerror-unexpected-token/",
"https://thecode.media/uncaughttyperrror-cannot-read-property/",
"https://thecode.media/mobile-dev/",
"https://thecode.media/verevka/",
"https://thecode.media/speed/",
"https://thecode.media/buckwheat/",
"https://thecode.media/distance/",
"https://thecode.media/node-js/",
"https://thecode.media/pascal/",
"https://thecode.media/ill-be-clean/",
"https://thecode.media/to-be-back/",
"https://thecode.media/replaceable/",
"https://thecode.media/code-review/",
"https://thecode.media/gasoline/",
"https://thecode.media/to-be-test/",
"https://thecode.media/scala/",
"https://thecode.media/row-power/",
"https://thecode.media/percent/",
"https://thecode.media/things/",
"https://thecode.media/backend/",
"https://thecode.media/immortal-pong/",
"https://thecode.media/blind/",
"https://thecode.media/go-faster/",
"https://thecode.media/cpp/",
"https://thecode.media/uncaught-syntaxerror-unexpected-end-of-input/",
"https://thecode.media/stress-quiz/",
"https://thecode.media/secret-pong/",
"https://thecode.media/override/",
"https://thecode.media/whg/",
"https://thecode.media/profit/",
"https://thecode.media/memas/",
"https://thecode.media/digital-sound/",
"https://thecode.media/api/",
"https://thecode.media/be-math-2/",
"https://thecode.media/backup/",
"https://thecode.media/backup-master/",
"https://thecode.media/glvrd/",
"https://thecode.media/id/",
"https://thecode.media/uncaught-syntaxerror-missing-after-argument-list/",
"https://thecode.media/ex-startup/",
"https://thecode.media/doom-everywhere/",
"https://thecode.media/template-one/",
"https://thecode.media/david-roganov/",
"https://thecode.media/spacex/",
"https://thecode.media/webstorm/",
"https://thecode.media/json/",
"https://thecode.media/treger/",
"https://thecode.media/ya-blitz/",
"https://thecode.media/radius/",
"https://thecode.media/xhr/",
"https://thecode.media/treger2/",
"https://thecode.media/raidemption/",
"https://thecode.media/chief-technical-officer/",
"https://thecode.media/summary/",
"https://thecode.media/ex-wallpaper/",
"https://thecode.media/soap/",
"https://thecode.media/decompose/",
"https://thecode.media/desc/",
"https://thecode.media/sprint/",
"https://thecode.media/bye-or-die/",
"https://thecode.media/who-win/",
"https://thecode.media/vladimir-olokhtonov/",
"https://thecode.media/lossless/",
"https://thecode.media/parse/",
"https://thecode.media/typeerror-is-not-an-abject/",
"https://thecode.media/backup-me/",
"https://thecode.media/stress-test/",
"https://thecode.media/syntaxerror-missing-formal-parameter/",
"https://thecode.media/start-fast/",
"https://thecode.media/halkechev/",
"https://thecode.media/halkechev2/",
"https://thecode.media/le-design/",
"https://thecode.media/syntaxerror-missing-after-property-id/",
"https://thecode.media/attrb-mthd/",
"https://thecode.media/headphones/",
"https://thecode.media/active-noise-cancelling/",
"https://thecode.media/remote-work-quiz/",
"https://thecode.media/garbage/",
"https://thecode.media/ubuntu-linux/",
"https://thecode.media/trie/",
"https://thecode.media/func/",
"https://thecode.media/laravel/",
"https://thecode.media/save-json/",
"https://thecode.media/syntaxerror-missing-after-formal-parameters/",
"https://thecode.media/recursion/",
"https://thecode.media/haskell/",
"https://thecode.media/gen/",
"https://thecode.media/db/",
"https://thecode.media/boosting/",
"https://thecode.media/pavel-sviridov/",
"https://thecode.media/mnogo/",
"https://thecode.media/sokr/",
"https://thecode.media/dbsm/",
"https://thecode.media/pik-balmera/",
"https://thecode.media/kanban/",
"https://thecode.media/check-list/",
"https://thecode.media/text-quiz/",
"https://thecode.media/mysql/",
"https://thecode.media/mysql/",
"https://thecode.media/rust/",
"https://thecode.media/manage-this/",
"https://thecode.media/altshuller/",
"https://thecode.media/interview/",
"https://thecode.media/fotorama/",
"https://thecode.media/tetris/",
"https://thecode.media/ai-tetris/",
"https://thecode.media/scrum/",
"https://thecode.media/speed-two/",
"https://thecode.media/quick-share/",
"https://thecode.media/stack/",
"https://thecode.media/mobile-first/",
"https://thecode.media/nosql/",
"https://thecode.media/narazves/",
"https://thecode.media/oop-class-2/",
"https://thecode.media/design-first/",
"https://thecode.media/arcanoid/",
"https://thecode.media/donut/",
"https://thecode.media/casino/",
"https://thecode.media/heap/",
"https://thecode.media/rust/",
"https://thecode.media/float/",
"https://thecode.media/markdown/",
"https://thecode.media/books/",
"https://thecode.media/daniil-popov/",
"https://thecode.media/android-developer/",
"https://thecode.media/symbols/",
"https://thecode.media/oauth/",
"https://thecode.media/kotlin/",
"https://thecode.media/todo/",
"https://thecode.media/plotly/",
"https://thecode.media/no-digit-code/",
"https://thecode.media/asymmetric/",
"https://thecode.media/qi/",
"https://thecode.media/vernam/",
"https://thecode.media/vernam-js/",
"https://thecode.media/shtykov/",
"https://thecode.media/memory/",
"https://thecode.media/ark/",
"https://thecode.media/7-oshibok-na-sobesedovanii/",
"https://thecode.media/dh/",
"https://thecode.media/typescript/",
"https://thecode.media/stark/",
"https://thecode.media/crypto/",
"https://thecode.media/zapusk-2/",
"https://thecode.media/fingerprint/",
"https://thecode.media/puzzle/",
"https://thecode.media/python-time-2/",
"https://thecode.media/no-chance/",
"https://thecode.media/lossy/",
"https://thecode.media/1wire/",
"https://thecode.media/pasha-flipper/",
"https://thecode.media/perl/",
"https://thecode.media/alexey-vasilev/",
"https://thecode.media/viasat/",
"https://thecode.media/podcast/",
"https://thecode.media/copy-ya-ru/",
"https://thecode.media/mircrosd/",
"https://thecode.media/bash/",
"https://thecode.media/rotation/",
"https://thecode.media/css-grid/",
"https://thecode.media/train/",
"https://thecode.media/grid-2/",
"https://thecode.media/za-proezd/",
"https://thecode.media/grid-3/",
"https://thecode.media/david/",
"https://thecode.media/alien-vs-predator/",
"https://thecode.media/grid-portfolio/",
"https://thecode.media/podcast-lavka/",
"https://thecode.media/it-start-2/",
"https://thecode.media/anastasiya-nikulina/",
"https://thecode.media/linter/",
"https://thecode.media/bomberman/",
"https://thecode.media/5-linters/",
"https://thecode.media/lineynaya-algebra-vektory/",
"https://thecode.media/no-nda/",
"https://thecode.media/code-swap/",
"https://thecode.media/how-to-start/",
"https://thecode.media/referenceerror-invalid-left-hand-side-in-assignment/",
"https://thecode.media/vectors-operations/",
"https://thecode.media/oop-abstract/",
"https://thecode.media/leonov/",
"https://thecode.media/lucky-strike/",
"https://thecode.media/browser/",
"https://thecode.media/2020/",
"https://thecode.media/3d-stars/",
"https://thecode.media/anna-leonova/",
"https://thecode.media/cold-fusion/",
"https://thecode.media/normalize/",
"https://thecode.media/hotkey/",
"https://thecode.media/oven/",
"https://thecode.media/vim/",
"https://thecode.media/draw/",
"https://thecode.media/visual-studio-code/",
"https://thecode.media/tetris-2/",
"https://thecode.media/start-now/",
"https://thecode.media/lapsha-1/",
"https://thecode.media/cubism/",
"https://thecode.media/lapsha-2/",
"https://thecode.media/zerocode/",
"https://thecode.media/cat/",
"https://thecode.media/static/",
"https://thecode.media/komm/",
"https://thecode.media/path-js/",
"https://thecode.media/cloudly/",
"https://thecode.media/haters-gonna-code-2/",
"https://thecode.media/csp/",
"https://thecode.media/mitin-says-no/",
"https://thecode.media/hire-js/",
"https://thecode.media/tableau/",
"https://thecode.media/impossible/",
"https://thecode.media/csp-on/",
"https://thecode.media/maze/",
"https://thecode.media/mix/",
"https://thecode.media/le-beton/",
"https://thecode.media/3d-print/",
"https://thecode.media/5-and-a-half/",
"https://thecode.media/lineynaya-zavisimost-vektorov/",
"https://thecode.media/fast-m1/",
"https://thecode.media/ninja-run/",
"https://thecode.media/matrix-101/",
"https://thecode.media/arm-x86/",
"https://thecode.media/piano-js/",
"https://thecode.media/10-swift/",
"https://thecode.media/travel-plane/",
"https://thecode.media/extention/",
"https://thecode.media/obratnaya-matritsa/",
"https://thecode.media/svg/",
"https://thecode.media/freelance/",
"https://thecode.media/brat-2/",
"https://thecode.media/angular/",
"https://thecode.media/rgb/",
"https://thecode.media/10-go/",
"https://thecode.media/coffee/",
]

# открываем текстовый файл, куда будем добавлять заголовки
file = open("zag.txt", "a")

# перебираем все адреса из списка
for x in url:
    # получаем исходный код очередной страницы из списка
    html_code = str(urlopen(x).read(),'utf-8')
    # отправляем исходный код страницы на обработку в библиотеку
    soup = BeautifulSoup(html_code, "html.parser")

    # находим название страницы с помощью метода find()
    s = soup.find('title').text

    # выводим его на экран
    print(s)

    # сохраняем заголовок в файле и переносим курсор на новую строку
    file.write(s + '. ')

# закрываем файл
file.close()

Что дальше

А дальше логичное продолжение — программа на цепях Маркова, которая будет генерировать заголовки для статей Кода на основе наших старых заголовков.

Редактура:

Максим Ильяхов

Художник:

Даня Берковский

Корректор:

Ирина Михеева

Вёрстка:

Мария Дронова

Соцсети:

Олег Вешкурцев

Получите ИТ-профессию
В «Яндекс Практикуме» можно стать разработчиком, тестировщиком, аналитиком и менеджером цифровых продуктов. Первая часть обучения всегда бесплатная, чтобы попробовать и найти то, что вам по душе. Дальше — программы трудоустройства.
Получите ИТ-профессию Получите ИТ-профессию Получите ИТ-профессию Получите ИТ-профессию
Вам может быть интересно
medium