Сегодняшний проект послужит основой многих наших дальнейших программ. Мы научимся собирать с сайтов любые данные, которые нам нужны.
У нас есть рабочий проект на цепях Маркова. Цепи Маркова — это несложный алгоритм, который анализирует сочетаемость слов в заданном тексте и выдаёт новый текст на основе старого. Похоже на работу нейронок, но на самом деле это просто перебор слов и бессмысленное их сочетание.
Для работы наших первых проектов на цепях Маркова мы скачали книгу с рассказами Чехова. Программа анализирует сочетаемость чеховских слов и выдаёт текст в чеховском духе (хотя и бессмысленный).
Но что, если мы хотим сделать текст не в духе Чехова, а в духе журнала «Код»? Или в духе какого-нибудь издания-иноагента? Или сделать генератор статей в духе какого-нибудь блогера?
Решение — написать программу, которая посмотрит на сайте наши статьи и вытащит оттуда весь значимый текст. Единственное, что для этого понадобится, — список ссылок на статьи, но мы их уже собрали, когда делали проект с гаданием на статьях Кода.
Логика работы
Программа будет работать на Python на локальной машине. Алгоритм:
- Подключаем нужные библиотеки.
- Готовим переменную со списком сайтов для обхода.
- Заходим на первый сайт и получаем оттуда исходный код.
- Указываем, откуда брать текст.
- Сохраняем найденный текст в отдельном файле.
- Повторяем цикл до тех пор, пока не пройдём все страницы из списка.
👉 Главное в таких проектах — знать структуру содержимого страницы и понимать, где именно и в каких тегах находятся нужные для вас данные.
Чтобы было проще, на старте сделаем программу, которая собирает названия страниц. Как освоимся — сделаем что посложнее.
Изучаем исходный код страницы
Прежде чем заниматься парсингом (сбором) со страницы чего угодно, нужно выяснить, где это лежит и в какой кодировке. Мы знаем, что все статьи Кода созданы по одному и тому же шаблону, поэтому нам достаточно посмотреть, как устроена одна, чтобы понять их все.
Смотрим исходный код любой нашей статьи. Нас интересуют два момента — кодировка страницы и тег <title>
. Нам нужно убедиться, что в этом теге прописано название.
Сначала кодировка:
Эта строчка означает, что страница работает с кодировкой UTF-8. Запомним это.
Теперь пролистываем исходный код ниже и находим тег <title>
— именно он отвечает за заголовок страницы. Убеждаемся, что он есть и с ним всё в порядке:
Библиотеки для работы
В проекте нам понадобятся две библиотеки: urllib и BeautifulSoup.
Первая отвечает за доступ к страницам по их адресу, причём оттуда нам будет нужна только одна команда urlopen().read — она отправляется по указанному адресу и получает весь исходный код страницы.
Вторая библиотека входит в состав большой библиотеки bs4 — в ней уже собраны все команды для парсинга исходного HTML-кода и разбора тегов. Чтобы установить bs4, запускаем терминал и пишем:
pip3 install bs4
Пишем код
Сначала подключим все нужные библиотеки:
# подключаем urlopen из модуля urllib
from urllib.request import urlopen
# подключаем библиотеку BeautifulSoup
from bs4 import BeautifulSoup
Теперь объявим список страниц, которые нужно посетить и забрать оттуда заголовки. Мы уже составили такой список для проекта с гаданием на статьях Кода, поэтому просто возьмём его оттуда и адаптируем под Python:
url = [
"https://thecode.media/is-not-defined-jquery/",
"https://thecode.media/arduino-projects-2/",
"https://thecode.media/10-raspberry/",
"https://thecode.media/easy-css/",
"https://thecode.media/to-be-front/",
"https://thecode.media/cryptex/",
"https://thecode.media/ali-coders/",
"https://thecode.media/po-glandy/",
"https://thecode.media/megaexcel/",
"https://thecode.media/chat-bot-generators/",
"https://thecode.media/wifi/",
"https://thecode.media/andri-oxa/",
"https://thecode.media/free-hosting/",
"https://thecode.media/hotwheels/",
"https://thecode.media/do-not-disturb/",
"https://thecode.media/dyno-ai/",
"https://thecode.media/snake-ai/",
"https://thecode.media/leet/",
"https://thecode.media/ninja/",
"https://thecode.media/supergirl/",
"https://thecode.media/vpn/ ",
"https://thecode.media/what-is-wordpress/",
"https://thecode.media/hardware/",
"https://thecode.media/division/",
"https://thecode.media/nuggets/",
"https://thecode.media/binary-notation/",
"https://thecode.media/bootstrap/",
"https://thecode.media/chat-bot/",
"https://thecode.media/myadblock3000/",
"https://thecode.media/trello/",
"https://thecode.media/python-time/",
"https://thecode.media/editor/",
"https://thecode.media/timer/",
"https://thecode.media/intro-bootstrap/",
"https://thecode.media/php-form/",
"https://thecode.media/hr-quiz/",
"https://thecode.media/c-sharp/",
"https://thecode.media/showtime/",
"https://thecode.media/uchtel-rasskazhi/",
"https://thecode.media/sshhhh/",
"https://thecode.media/marry-me-python/",
"https://thecode.media/haters-gonna-code/",
"https://thecode.media/speed-css/",
"https://thecode.media/fired/",
"https://thecode.media/zabuhal/",
"https://thecode.media/est-tri-shkatulki/",
"https://thecode.media/milk-that/",
"https://thecode.media/binary-mouse/",
"https://thecode.media/bowling/",
"https://thecode.media/dealership/",
"https://thecode.media/best-seller/",
"https://thecode.media/hr/",
"https://thecode.media/no-comments/",
"https://thecode.media/drakoni-yajca/",
"https://thecode.media/who-is-who/",
"https://thecode.media/get-a-room/",
"https://thecode.media/alps/",
"https://thecode.media/handshake/",
"https://thecode.media/choose-life/",
"https://thecode.media/high-voltage/",
"https://thecode.media/spy/",
"https://thecode.media/squirrelrrel/",
"https://thecode.media/so-agile/",
"https://thecode.media/wedding/",
"https://thecode.media/supper/",
"https://thecode.media/le-tarakan/",
"https://thecode.media/batareyki-besyat/",
"https://thecode.media/dr_jekyll/",
"https://thecode.media/everybody_lies/",
"https://thecode.media/electrician/",
"https://thecode.media/einstein/",
"https://thecode.media/bugz/",
"https://thecode.media/needforspeed/",
"https://thecode.media/be-smart/",
"https://thecode.media/bot-online/",
"https://thecode.media/microb/",
"https://thecode.media/jquery/",
"https://thecode.media/split-screen/",
"https://thecode.media/calculus/",
"https://thecode.media/big-data-sales/",
"https://thecode.media/ambient/",
"https://thecode.media/fatality/",
"https://thecode.media/biggest-loser/",
"https://thecode.media/wifi/",
"https://thecode.media/nosock/",
"https://thecode.media/variables/",
"https://thecode.media/start_python/",
"https://thecode.media/i-gonna-code/",
"https://thecode.media/sigi-est/",
"https://thecode.media/nes-game/",
"https://thecode.media/live-view/",
"https://thecode.media/remote/",
"https://thecode.media/arduino-code/",
"https://thecode.media/horses/",
"https://thecode.media/runinstein/",
"https://thecode.media/wp-template/",
"https://thecode.media/tilda/",
"https://thecode.media/todo/",
"https://thecode.media/telebot/",
"https://thecode.media/summator-2/",
"https://thecode.media/get-rich-coding/",
"https://thecode.media/content-manager/",
"https://thecode.media/vzrosly-stal/",
"https://thecode.media/py-install/",
"https://thecode.media/quantum/",
"https://thecode.media/dns/",
"https://thecode.media/practicum/",
"https://thecode.media/react/",
"https://thecode.media/1september/",
"https://thecode.media/summator/",
"https://thecode.media/vds/",
"https://thecode.media/made-in-china/",
"https://thecode.media/bar/",
"https://thecode.media/zodiac/",
"https://thecode.media/crc32/",
"https://thecode.media/css-links/",
"https://thecode.media/oop_battle/",
"https://thecode.media/be-combo/",
"https://thecode.media/unity/",
"https://thecode.media/data-science/",
"https://thecode.media/junior/",
"https://thecode.media/qc/",
"https://thecode.media/be-middle/",
"https://thecode.media/senior/",
"https://thecode.media/teamlead/",
"https://thecode.media/frontend/",
"https://thecode.media/lift/",
"https://thecode.media/be-fuzzy/",
"https://thecode.media/best-2020/",
"https://thecode.media/git/",
"https://thecode.media/stt-cloud/",
"https://thecode.media/matrix-pills/",
"https://thecode.media/na-stile/",
"https://thecode.media/no-coffee/",
"https://thecode.media/framelibs/",
"https://thecode.media/children/",
"https://thecode.media/balls-possibly/",
"https://thecode.media/le-meduza/",
"https://thecode.media/electricity/",
"https://thecode.media/tailored-swift/",
"https://thecode.media/objective/",
"https://thecode.media/host/",
"https://thecode.media/go-public/",
"https://thecode.media/how-internet-works-1/",
"https://thecode.media/domain/",
"https://thecode.media/this-is-object/",
"https://thecode.media/ole-ole-ole/",
"https://thecode.media/thousand/",
"https://thecode.media/average/",
"https://thecode.media/stt-python/",
"https://thecode.media/ping-pong/",
"https://thecode.media/pygames/",
"https://thecode.media/odobreno/",
"https://thecode.media/qwerty123/",
"https://thecode.media/neurocorrector/",
"https://thecode.media/neuro-cam/",
"https://thecode.media/10-jquery/",
"https://thecode.media/repeat/",
"https://thecode.media/assembler/",
"https://thecode.media/sublime-one-love/",
"https://thecode.media/zloy/",
"https://thecode.media/mariya-ivanovna/",
"https://thecode.media/ruby/",
"https://thecode.media/electron-password/",
"https://thecode.media/plane/",
"https://thecode.media/glitch/",
"https://thecode.media/security/",
"https://thecode.media/stupid-2019/",
"https://thecode.media/jquery-search/",
"https://thecode.media/pimp-my-pass/",
"https://thecode.media/text-ultimate/",
"https://thecode.media/hurry/",
"https://thecode.media/siri/",
"https://thecode.media/zero-cool/",
"https://thecode.media/small-talk/",
"https://thecode.media/die-hard/",
"https://thecode.media/le-piton/",
"https://thecode.media/hr-code/",
"https://thecode.media/nano-code/",
"https://thecode.media/the_question/",
"https://thecode.media/godlike/",
"https://thecode.media/be-logic/",
"https://thecode.media/snake-js/",
"https://thecode.media/be-mobile/",
"https://thecode.media/baboolya/",
"https://thecode.media/timelag/",
"https://thecode.media/doors/",
"https://thecode.media/phone-code/",
"https://thecode.media/snake-arduino/",
"https://thecode.media/css-intro/",
"https://thecode.media/le-timer/",
"https://thecode.media/oop_battle/",
"https://thecode.media/good-morning/",
"https://thecode.media/study-bot/",
"https://thecode.media/python-bot/",
"https://thecode.media/robot-quiz/",
"https://thecode.media/hacking-quiz/",
"https://thecode.media/lulz-quiz/",
"https://thecode.media/hard-quiz/",
"https://thecode.media/torrent/",
"https://thecode.media/travel/",
"https://thecode.media/le-snob/",
"https://thecode.media/no-spagetti/",
"https://thecode.media/house/",
"https://thecode.media/cryptorush/",
"https://thecode.media/coronarelax/",
"https://thecode.media/pure/",
"https://thecode.media/c-cpp/",
"https://thecode.media/machine-loving/",
"https://thecode.media/orwell/",
"https://thecode.media/darknet/",
"https://thecode.media/ai/",
"https://thecode.media/oop-class/",
"https://thecode.media/cookie/",
"https://thecode.media/malware/",
"https://thecode.media/ftp/",
"https://thecode.media/html/",
"https://thecode.media/java/",
"https://thecode.media/php-haters/",
"https://thecode.media/tor/",
"https://thecode.media/crack-safe/",
"https://thecode.media/epidemic/",
"https://thecode.media/hash-brown/",
"https://thecode.media/java-js/",
"https://thecode.media/js-types/",
"https://thecode.media/losers/",
"https://thecode.media/ssl/",
"https://thecode.media/uncaughtsyntaxerror-unexpected-identifier/",
"https://thecode.media/uncaughtsyntaxerror-unexpected-token/",
"https://thecode.media/uncaughttyperrror-cannot-read-property/",
"https://thecode.media/mobile-dev/",
"https://thecode.media/verevka/",
"https://thecode.media/speed/",
"https://thecode.media/buckwheat/",
"https://thecode.media/distance/",
"https://thecode.media/node-js/",
"https://thecode.media/pascal/",
"https://thecode.media/ill-be-clean/",
"https://thecode.media/to-be-back/",
"https://thecode.media/replaceable/",
"https://thecode.media/code-review/",
"https://thecode.media/gasoline/",
"https://thecode.media/to-be-test/",
"https://thecode.media/scala/",
"https://thecode.media/row-power/",
"https://thecode.media/percent/",
"https://thecode.media/things/",
"https://thecode.media/backend/",
"https://thecode.media/immortal-pong/",
"https://thecode.media/blind/",
"https://thecode.media/go-faster/",
"https://thecode.media/cpp/",
"https://thecode.media/uncaught-syntaxerror-unexpected-end-of-input/",
"https://thecode.media/stress-quiz/",
"https://thecode.media/secret-pong/",
"https://thecode.media/override/",
"https://thecode.media/whg/",
"https://thecode.media/profit/",
"https://thecode.media/memas/",
"https://thecode.media/digital-sound/",
"https://thecode.media/api/",
"https://thecode.media/be-math-2/",
"https://thecode.media/backup/",
"https://thecode.media/backup-master/",
"https://thecode.media/glvrd/",
"https://thecode.media/id/",
"https://thecode.media/uncaught-syntaxerror-missing-after-argument-list/",
"https://thecode.media/ex-startup/",
"https://thecode.media/doom-everywhere/",
"https://thecode.media/template-one/",
"https://thecode.media/david-roganov/",
"https://thecode.media/spacex/",
"https://thecode.media/webstorm/",
"https://thecode.media/json/",
"https://thecode.media/treger/",
"https://thecode.media/ya-blitz/",
"https://thecode.media/radius/",
"https://thecode.media/xhr/",
"https://thecode.media/treger2/",
"https://thecode.media/raidemption/",
"https://thecode.media/chief-technical-officer/",
"https://thecode.media/summary/",
"https://thecode.media/ex-wallpaper/",
"https://thecode.media/soap/",
"https://thecode.media/decompose/",
"https://thecode.media/desc/",
"https://thecode.media/sprint/",
"https://thecode.media/bye-or-die/",
"https://thecode.media/who-win/",
"https://thecode.media/vladimir-olokhtonov/",
"https://thecode.media/lossless/",
"https://thecode.media/parse/",
"https://thecode.media/typeerror-is-not-an-abject/",
"https://thecode.media/backup-me/",
"https://thecode.media/stress-test/",
"https://thecode.media/syntaxerror-missing-formal-parameter/",
"https://thecode.media/start-fast/",
"https://thecode.media/halkechev/",
"https://thecode.media/halkechev2/",
"https://thecode.media/le-design/",
"https://thecode.media/syntaxerror-missing-after-property-id/",
"https://thecode.media/attrb-mthd/",
"https://thecode.media/headphones/",
"https://thecode.media/active-noise-cancelling/",
"https://thecode.media/remote-work-quiz/",
"https://thecode.media/garbage/",
"https://thecode.media/ubuntu-linux/",
"https://thecode.media/trie/",
"https://thecode.media/func/",
"https://thecode.media/laravel/",
"https://thecode.media/save-json/",
"https://thecode.media/syntaxerror-missing-after-formal-parameters/",
"https://thecode.media/recursion/",
"https://thecode.media/haskell/",
"https://thecode.media/gen/",
"https://thecode.media/db/",
"https://thecode.media/boosting/",
"https://thecode.media/pavel-sviridov/",
"https://thecode.media/mnogo/",
"https://thecode.media/sokr/",
"https://thecode.media/dbsm/",
"https://thecode.media/pik-balmera/",
"https://thecode.media/kanban/",
"https://thecode.media/check-list/",
"https://thecode.media/text-quiz/",
"https://thecode.media/mysql/",
"https://thecode.media/mysql/",
"https://thecode.media/rust/",
"https://thecode.media/manage-this/",
"https://thecode.media/altshuller/",
"https://thecode.media/interview/",
"https://thecode.media/fotorama/",
"https://thecode.media/tetris/",
"https://thecode.media/ai-tetris/",
"https://thecode.media/scrum/",
"https://thecode.media/speed-two/",
"https://thecode.media/quick-share/",
"https://thecode.media/stack/",
"https://thecode.media/mobile-first/",
"https://thecode.media/nosql/",
"https://thecode.media/narazves/",
"https://thecode.media/oop-class-2/",
"https://thecode.media/design-first/",
"https://thecode.media/arcanoid/",
"https://thecode.media/donut/",
"https://thecode.media/casino/",
"https://thecode.media/heap/",
"https://thecode.media/rust/",
"https://thecode.media/float/",
"https://thecode.media/markdown/",
"https://thecode.media/books/",
"https://thecode.media/daniil-popov/",
"https://thecode.media/android-developer/",
"https://thecode.media/symbols/",
"https://thecode.media/oauth/",
"https://thecode.media/kotlin/",
"https://thecode.media/todo/",
"https://thecode.media/plotly/",
"https://thecode.media/no-digit-code/",
"https://thecode.media/asymmetric/",
"https://thecode.media/qi/",
"https://thecode.media/vernam/",
"https://thecode.media/vernam-js/",
"https://thecode.media/shtykov/",
"https://thecode.media/memory/",
"https://thecode.media/ark/",
"https://thecode.media/7-oshibok-na-sobesedovanii/",
"https://thecode.media/dh/",
"https://thecode.media/typescript/",
"https://thecode.media/stark/",
"https://thecode.media/crypto/",
"https://thecode.media/zapusk-2/",
"https://thecode.media/fingerprint/",
"https://thecode.media/puzzle/",
"https://thecode.media/python-time-2/",
"https://thecode.media/no-chance/",
"https://thecode.media/lossy/",
"https://thecode.media/1wire/",
"https://thecode.media/pasha-flipper/",
"https://thecode.media/perl/",
"https://thecode.media/alexey-vasilev/",
"https://thecode.media/viasat/",
"https://thecode.media/podcast/",
"https://thecode.media/copy-ya-ru/",
"https://thecode.media/mircrosd/",
"https://thecode.media/bash/",
"https://thecode.media/rotation/",
"https://thecode.media/css-grid/",
"https://thecode.media/train/",
"https://thecode.media/grid-2/",
"https://thecode.media/za-proezd/",
"https://thecode.media/grid-3/",
"https://thecode.media/david/",
"https://thecode.media/alien-vs-predator/",
"https://thecode.media/grid-portfolio/",
"https://thecode.media/podcast-lavka/",
"https://thecode.media/it-start-2/",
"https://thecode.media/anastasiya-nikulina/",
"https://thecode.media/linter/",
"https://thecode.media/bomberman/",
"https://thecode.media/5-linters/",
"https://thecode.media/lineynaya-algebra-vektory/",
"https://thecode.media/no-nda/",
"https://thecode.media/code-swap/",
"https://thecode.media/how-to-start/",
"https://thecode.media/referenceerror-invalid-left-hand-side-in-assignment/",
"https://thecode.media/vectors-operations/",
"https://thecode.media/oop-abstract/",
"https://thecode.media/leonov/",
"https://thecode.media/lucky-strike/",
"https://thecode.media/browser/",
"https://thecode.media/2020/",
"https://thecode.media/3d-stars/",
"https://thecode.media/anna-leonova/",
"https://thecode.media/cold-fusion/",
"https://thecode.media/normalize/",
"https://thecode.media/hotkey/",
"https://thecode.media/oven/",
"https://thecode.media/vim/",
"https://thecode.media/draw/",
"https://thecode.media/visual-studio-code/",
"https://thecode.media/tetris-2/",
"https://thecode.media/start-now/",
"https://thecode.media/lapsha-1/",
"https://thecode.media/cubism/",
"https://thecode.media/lapsha-2/",
"https://thecode.media/zerocode/",
"https://thecode.media/cat/",
"https://thecode.media/static/",
"https://thecode.media/komm/",
"https://thecode.media/path-js/",
"https://thecode.media/cloudly/",
"https://thecode.media/haters-gonna-code-2/",
"https://thecode.media/csp/",
"https://thecode.media/mitin-says-no/",
"https://thecode.media/hire-js/",
"https://thecode.media/tableau/",
"https://thecode.media/impossible/",
"https://thecode.media/csp-on/",
"https://thecode.media/maze/",
"https://thecode.media/mix/",
"https://thecode.media/le-beton/",
"https://thecode.media/3d-print/",
"https://thecode.media/5-and-a-half/",
"https://thecode.media/lineynaya-zavisimost-vektorov/",
"https://thecode.media/fast-m1/",
"https://thecode.media/ninja-run/",
"https://thecode.media/matrix-101/",
"https://thecode.media/arm-x86/",
"https://thecode.media/piano-js/",
"https://thecode.media/10-swift/",
"https://thecode.media/travel-plane/",
"https://thecode.media/extention/",
"https://thecode.media/obratnaya-matritsa/",
"https://thecode.media/svg/",
"https://thecode.media/freelance/",
"https://thecode.media/brat-2/",
"https://thecode.media/angular/",
"https://thecode.media/rgb/",
"https://thecode.media/10-go/",
"https://thecode.media/coffee/",
]
Теперь перебираем все элементы этого массива в цикле, используя всю мощь библиотек. Обратите внимание на строчку, где мы получаем исходный код страницы — мы сразу конвертируем его в нужную кодировку, которую выяснили на предыдущем этапе:
# открываем текстовый файл, куда будем добавлять заголовки
file = open("zag.txt", "a")
# перебираем все адреса из списка
for x in url:
# получаем исходный код очередной страницы из списка
html_code = str(urlopen(x).read(),'utf-8')
# отправляем исходный код страницы на обработку в библиотеку
soup = BeautifulSoup(html_code, "html.parser")
# находим название страницы с помощью метода find()
s = soup.find('title').text
# выводим его на экран
print(s)
# сохраняем заголовок в файле и переносим курсор на новую строку
file.write(s + '\n')
# закрываем файл
file.close()
Запускаем и смотрим на результат:
# подключаем urlopen из модуля urllib
from urllib.request import urlopen
# подключаем библиотеку BeautifulSout
from bs4 import BeautifulSoup
url = [
"https://thecode.media/is-not-defined-jquery/",
"https://thecode.media/arduino-projects-2/",
"https://thecode.media/10-raspberry/",
"https://thecode.media/easy-css/",
"https://thecode.media/to-be-front/",
"https://thecode.media/cryptex/",
"https://thecode.media/ali-coders/",
"https://thecode.media/po-glandy/",
"https://thecode.media/megaexcel/",
"https://thecode.media/chat-bot-generators/",
"https://thecode.media/wifi/",
"https://thecode.media/andri-oxa/",
"https://thecode.media/free-hosting/",
"https://thecode.media/hotwheels/",
"https://thecode.media/do-not-disturb/",
"https://thecode.media/dyno-ai/",
"https://thecode.media/snake-ai/",
"https://thecode.media/leet/",
"https://thecode.media/ninja/",
"https://thecode.media/supergirl/",
"https://thecode.media/vpn/ ",
"https://thecode.media/what-is-wordpress/",
"https://thecode.media/hardware/",
"https://thecode.media/division/",
"https://thecode.media/nuggets/",
"https://thecode.media/binary-notation/",
"https://thecode.media/bootstrap/",
"https://thecode.media/chat-bot/",
"https://thecode.media/myadblock3000/",
"https://thecode.media/trello/",
"https://thecode.media/python-time/",
"https://thecode.media/editor/",
"https://thecode.media/timer/",
"https://thecode.media/intro-bootstrap/",
"https://thecode.media/php-form/",
"https://thecode.media/hr-quiz/",
"https://thecode.media/c-sharp/",
"https://thecode.media/showtime/",
"https://thecode.media/uchtel-rasskazhi/",
"https://thecode.media/sshhhh/",
"https://thecode.media/marry-me-python/",
"https://thecode.media/haters-gonna-code/",
"https://thecode.media/speed-css/",
"https://thecode.media/fired/",
"https://thecode.media/zabuhal/",
"https://thecode.media/est-tri-shkatulki/",
"https://thecode.media/milk-that/",
"https://thecode.media/binary-mouse/",
"https://thecode.media/bowling/",
"https://thecode.media/dealership/",
"https://thecode.media/best-seller/",
"https://thecode.media/hr/",
"https://thecode.media/no-comments/",
"https://thecode.media/drakoni-yajca/",
"https://thecode.media/who-is-who/",
"https://thecode.media/get-a-room/",
"https://thecode.media/alps/",
"https://thecode.media/handshake/",
"https://thecode.media/choose-life/",
"https://thecode.media/high-voltage/",
"https://thecode.media/spy/",
"https://thecode.media/squirrelrrel/",
"https://thecode.media/so-agile/",
"https://thecode.media/wedding/",
"https://thecode.media/supper/",
"https://thecode.media/le-tarakan/",
"https://thecode.media/batareyki-besyat/",
"https://thecode.media/dr_jekyll/",
"https://thecode.media/everybody_lies/",
"https://thecode.media/electrician/",
"https://thecode.media/einstein/",
"https://thecode.media/bugz/",
"https://thecode.media/needforspeed/",
"https://thecode.media/be-smart/",
"https://thecode.media/bot-online/",
"https://thecode.media/microb/",
"https://thecode.media/jquery/",
"https://thecode.media/split-screen/",
"https://thecode.media/calculus/",
"https://thecode.media/big-data-sales/",
"https://thecode.media/ambient/",
"https://thecode.media/fatality/",
"https://thecode.media/biggest-loser/",
"https://thecode.media/wifi/",
"https://thecode.media/nosock/",
"https://thecode.media/variables/",
"https://thecode.media/start_python/",
"https://thecode.media/i-gonna-code/",
"https://thecode.media/sigi-est/",
"https://thecode.media/nes-game/",
"https://thecode.media/live-view/",
"https://thecode.media/remote/",
"https://thecode.media/arduino-code/",
"https://thecode.media/horses/",
"https://thecode.media/runinstein/",
"https://thecode.media/wp-template/",
"https://thecode.media/tilda/",
"https://thecode.media/todo/",
"https://thecode.media/telebot/",
"https://thecode.media/summator-2/",
"https://thecode.media/get-rich-coding/",
"https://thecode.media/content-manager/",
"https://thecode.media/vzrosly-stal/",
"https://thecode.media/py-install/",
"https://thecode.media/quantum/",
"https://thecode.media/dns/",
"https://thecode.media/practicum/",
"https://thecode.media/react/",
"https://thecode.media/1september/",
"https://thecode.media/summator/",
"https://thecode.media/vds/",
"https://thecode.media/made-in-china/",
"https://thecode.media/bar/",
"https://thecode.media/zodiac/",
"https://thecode.media/crc32/",
"https://thecode.media/css-links/",
"https://thecode.media/oop_battle/",
"https://thecode.media/be-combo/",
"https://thecode.media/unity/",
"https://thecode.media/data-science/",
"https://thecode.media/junior/",
"https://thecode.media/qc/",
"https://thecode.media/be-middle/",
"https://thecode.media/senior/",
"https://thecode.media/teamlead/",
"https://thecode.media/frontend/",
"https://thecode.media/lift/",
"https://thecode.media/be-fuzzy/",
"https://thecode.media/best-2020/",
"https://thecode.media/git/",
"https://thecode.media/stt-cloud/",
"https://thecode.media/matrix-pills/",
"https://thecode.media/na-stile/",
"https://thecode.media/no-coffee/",
"https://thecode.media/framelibs/",
"https://thecode.media/children/",
"https://thecode.media/balls-possibly/",
"https://thecode.media/le-meduza/",
"https://thecode.media/electricity/",
"https://thecode.media/tailored-swift/",
"https://thecode.media/objective/",
"https://thecode.media/host/",
"https://thecode.media/go-public/",
"https://thecode.media/how-internet-works-1/",
"https://thecode.media/domain/",
"https://thecode.media/this-is-object/",
"https://thecode.media/ole-ole-ole/",
"https://thecode.media/thousand/",
"https://thecode.media/average/",
"https://thecode.media/stt-python/",
"https://thecode.media/ping-pong/",
"https://thecode.media/pygames/",
"https://thecode.media/odobreno/",
"https://thecode.media/qwerty123/",
"https://thecode.media/neurocorrector/",
"https://thecode.media/neuro-cam/",
"https://thecode.media/10-jquery/",
"https://thecode.media/repeat/",
"https://thecode.media/assembler/",
"https://thecode.media/sublime-one-love/",
"https://thecode.media/zloy/",
"https://thecode.media/mariya-ivanovna/",
"https://thecode.media/ruby/",
"https://thecode.media/electron-password/",
"https://thecode.media/plane/",
"https://thecode.media/glitch/",
"https://thecode.media/security/",
"https://thecode.media/stupid-2019/",
"https://thecode.media/jquery-search/",
"https://thecode.media/pimp-my-pass/",
"https://thecode.media/text-ultimate/",
"https://thecode.media/hurry/",
"https://thecode.media/siri/",
"https://thecode.media/zero-cool/",
"https://thecode.media/small-talk/",
"https://thecode.media/die-hard/",
"https://thecode.media/le-piton/",
"https://thecode.media/hr-code/",
"https://thecode.media/nano-code/",
"https://thecode.media/the_question/",
"https://thecode.media/godlike/",
"https://thecode.media/be-logic/",
"https://thecode.media/snake-js/",
"https://thecode.media/be-mobile/",
"https://thecode.media/baboolya/",
"https://thecode.media/timelag/",
"https://thecode.media/doors/",
"https://thecode.media/phone-code/",
"https://thecode.media/snake-arduino/",
"https://thecode.media/css-intro/",
"https://thecode.media/le-timer/",
"https://thecode.media/oop_battle/",
"https://thecode.media/good-morning/",
"https://thecode.media/study-bot/",
"https://thecode.media/python-bot/",
"https://thecode.media/robot-quiz/",
"https://thecode.media/hacking-quiz/",
"https://thecode.media/lulz-quiz/",
"https://thecode.media/hard-quiz/",
"https://thecode.media/torrent/",
"https://thecode.media/travel/",
"https://thecode.media/le-snob/",
"https://thecode.media/no-spagetti/",
"https://thecode.media/house/",
"https://thecode.media/cryptorush/",
"https://thecode.media/coronarelax/",
"https://thecode.media/pure/",
"https://thecode.media/c-cpp/",
"https://thecode.media/machine-loving/",
"https://thecode.media/orwell/",
"https://thecode.media/darknet/",
"https://thecode.media/ai/",
"https://thecode.media/oop-class/",
"https://thecode.media/cookie/",
"https://thecode.media/malware/",
"https://thecode.media/ftp/",
"https://thecode.media/html/",
"https://thecode.media/java/",
"https://thecode.media/php-haters/",
"https://thecode.media/tor/",
"https://thecode.media/crack-safe/",
"https://thecode.media/epidemic/",
"https://thecode.media/hash-brown/",
"https://thecode.media/java-js/",
"https://thecode.media/js-types/",
"https://thecode.media/losers/",
"https://thecode.media/ssl/",
"https://thecode.media/uncaughtsyntaxerror-unexpected-identifier/",
"https://thecode.media/uncaughtsyntaxerror-unexpected-token/",
"https://thecode.media/uncaughttyperrror-cannot-read-property/",
"https://thecode.media/mobile-dev/",
"https://thecode.media/verevka/",
"https://thecode.media/speed/",
"https://thecode.media/buckwheat/",
"https://thecode.media/distance/",
"https://thecode.media/node-js/",
"https://thecode.media/pascal/",
"https://thecode.media/ill-be-clean/",
"https://thecode.media/to-be-back/",
"https://thecode.media/replaceable/",
"https://thecode.media/code-review/",
"https://thecode.media/gasoline/",
"https://thecode.media/to-be-test/",
"https://thecode.media/scala/",
"https://thecode.media/row-power/",
"https://thecode.media/percent/",
"https://thecode.media/things/",
"https://thecode.media/backend/",
"https://thecode.media/immortal-pong/",
"https://thecode.media/blind/",
"https://thecode.media/go-faster/",
"https://thecode.media/cpp/",
"https://thecode.media/uncaught-syntaxerror-unexpected-end-of-input/",
"https://thecode.media/stress-quiz/",
"https://thecode.media/secret-pong/",
"https://thecode.media/override/",
"https://thecode.media/whg/",
"https://thecode.media/profit/",
"https://thecode.media/memas/",
"https://thecode.media/digital-sound/",
"https://thecode.media/api/",
"https://thecode.media/be-math-2/",
"https://thecode.media/backup/",
"https://thecode.media/backup-master/",
"https://thecode.media/glvrd/",
"https://thecode.media/id/",
"https://thecode.media/uncaught-syntaxerror-missing-after-argument-list/",
"https://thecode.media/ex-startup/",
"https://thecode.media/doom-everywhere/",
"https://thecode.media/template-one/",
"https://thecode.media/david-roganov/",
"https://thecode.media/spacex/",
"https://thecode.media/webstorm/",
"https://thecode.media/json/",
"https://thecode.media/treger/",
"https://thecode.media/ya-blitz/",
"https://thecode.media/radius/",
"https://thecode.media/xhr/",
"https://thecode.media/treger2/",
"https://thecode.media/raidemption/",
"https://thecode.media/chief-technical-officer/",
"https://thecode.media/summary/",
"https://thecode.media/ex-wallpaper/",
"https://thecode.media/soap/",
"https://thecode.media/decompose/",
"https://thecode.media/desc/",
"https://thecode.media/sprint/",
"https://thecode.media/bye-or-die/",
"https://thecode.media/who-win/",
"https://thecode.media/vladimir-olokhtonov/",
"https://thecode.media/lossless/",
"https://thecode.media/parse/",
"https://thecode.media/typeerror-is-not-an-abject/",
"https://thecode.media/backup-me/",
"https://thecode.media/stress-test/",
"https://thecode.media/syntaxerror-missing-formal-parameter/",
"https://thecode.media/start-fast/",
"https://thecode.media/halkechev/",
"https://thecode.media/halkechev2/",
"https://thecode.media/le-design/",
"https://thecode.media/syntaxerror-missing-after-property-id/",
"https://thecode.media/attrb-mthd/",
"https://thecode.media/headphones/",
"https://thecode.media/active-noise-cancelling/",
"https://thecode.media/remote-work-quiz/",
"https://thecode.media/garbage/",
"https://thecode.media/ubuntu-linux/",
"https://thecode.media/trie/",
"https://thecode.media/func/",
"https://thecode.media/laravel/",
"https://thecode.media/save-json/",
"https://thecode.media/syntaxerror-missing-after-formal-parameters/",
"https://thecode.media/recursion/",
"https://thecode.media/haskell/",
"https://thecode.media/gen/",
"https://thecode.media/db/",
"https://thecode.media/boosting/",
"https://thecode.media/pavel-sviridov/",
"https://thecode.media/mnogo/",
"https://thecode.media/sokr/",
"https://thecode.media/dbsm/",
"https://thecode.media/pik-balmera/",
"https://thecode.media/kanban/",
"https://thecode.media/check-list/",
"https://thecode.media/text-quiz/",
"https://thecode.media/mysql/",
"https://thecode.media/mysql/",
"https://thecode.media/rust/",
"https://thecode.media/manage-this/",
"https://thecode.media/altshuller/",
"https://thecode.media/interview/",
"https://thecode.media/fotorama/",
"https://thecode.media/tetris/",
"https://thecode.media/ai-tetris/",
"https://thecode.media/scrum/",
"https://thecode.media/speed-two/",
"https://thecode.media/quick-share/",
"https://thecode.media/stack/",
"https://thecode.media/mobile-first/",
"https://thecode.media/nosql/",
"https://thecode.media/narazves/",
"https://thecode.media/oop-class-2/",
"https://thecode.media/design-first/",
"https://thecode.media/arcanoid/",
"https://thecode.media/donut/",
"https://thecode.media/casino/",
"https://thecode.media/heap/",
"https://thecode.media/rust/",
"https://thecode.media/float/",
"https://thecode.media/markdown/",
"https://thecode.media/books/",
"https://thecode.media/daniil-popov/",
"https://thecode.media/android-developer/",
"https://thecode.media/symbols/",
"https://thecode.media/oauth/",
"https://thecode.media/kotlin/",
"https://thecode.media/todo/",
"https://thecode.media/plotly/",
"https://thecode.media/no-digit-code/",
"https://thecode.media/asymmetric/",
"https://thecode.media/qi/",
"https://thecode.media/vernam/",
"https://thecode.media/vernam-js/",
"https://thecode.media/shtykov/",
"https://thecode.media/memory/",
"https://thecode.media/ark/",
"https://thecode.media/7-oshibok-na-sobesedovanii/",
"https://thecode.media/dh/",
"https://thecode.media/typescript/",
"https://thecode.media/stark/",
"https://thecode.media/crypto/",
"https://thecode.media/zapusk-2/",
"https://thecode.media/fingerprint/",
"https://thecode.media/puzzle/",
"https://thecode.media/python-time-2/",
"https://thecode.media/no-chance/",
"https://thecode.media/lossy/",
"https://thecode.media/1wire/",
"https://thecode.media/pasha-flipper/",
"https://thecode.media/perl/",
"https://thecode.media/alexey-vasilev/",
"https://thecode.media/viasat/",
"https://thecode.media/podcast/",
"https://thecode.media/copy-ya-ru/",
"https://thecode.media/mircrosd/",
"https://thecode.media/bash/",
"https://thecode.media/rotation/",
"https://thecode.media/css-grid/",
"https://thecode.media/train/",
"https://thecode.media/grid-2/",
"https://thecode.media/za-proezd/",
"https://thecode.media/grid-3/",
"https://thecode.media/david/",
"https://thecode.media/alien-vs-predator/",
"https://thecode.media/grid-portfolio/",
"https://thecode.media/podcast-lavka/",
"https://thecode.media/it-start-2/",
"https://thecode.media/anastasiya-nikulina/",
"https://thecode.media/linter/",
"https://thecode.media/bomberman/",
"https://thecode.media/5-linters/",
"https://thecode.media/lineynaya-algebra-vektory/",
"https://thecode.media/no-nda/",
"https://thecode.media/code-swap/",
"https://thecode.media/how-to-start/",
"https://thecode.media/referenceerror-invalid-left-hand-side-in-assignment/",
"https://thecode.media/vectors-operations/",
"https://thecode.media/oop-abstract/",
"https://thecode.media/leonov/",
"https://thecode.media/lucky-strike/",
"https://thecode.media/browser/",
"https://thecode.media/2020/",
"https://thecode.media/3d-stars/",
"https://thecode.media/anna-leonova/",
"https://thecode.media/cold-fusion/",
"https://thecode.media/normalize/",
"https://thecode.media/hotkey/",
"https://thecode.media/oven/",
"https://thecode.media/vim/",
"https://thecode.media/draw/",
"https://thecode.media/visual-studio-code/",
"https://thecode.media/tetris-2/",
"https://thecode.media/start-now/",
"https://thecode.media/lapsha-1/",
"https://thecode.media/cubism/",
"https://thecode.media/lapsha-2/",
"https://thecode.media/zerocode/",
"https://thecode.media/cat/",
"https://thecode.media/static/",
"https://thecode.media/komm/",
"https://thecode.media/path-js/",
"https://thecode.media/cloudly/",
"https://thecode.media/haters-gonna-code-2/",
"https://thecode.media/csp/",
"https://thecode.media/mitin-says-no/",
"https://thecode.media/hire-js/",
"https://thecode.media/tableau/",
"https://thecode.media/impossible/",
"https://thecode.media/csp-on/",
"https://thecode.media/maze/",
"https://thecode.media/mix/",
"https://thecode.media/le-beton/",
"https://thecode.media/3d-print/",
"https://thecode.media/5-and-a-half/",
"https://thecode.media/lineynaya-zavisimost-vektorov/",
"https://thecode.media/fast-m1/",
"https://thecode.media/ninja-run/",
"https://thecode.media/matrix-101/",
"https://thecode.media/arm-x86/",
"https://thecode.media/piano-js/",
"https://thecode.media/10-swift/",
"https://thecode.media/travel-plane/",
"https://thecode.media/extention/",
"https://thecode.media/obratnaya-matritsa/",
"https://thecode.media/svg/",
"https://thecode.media/freelance/",
"https://thecode.media/brat-2/",
"https://thecode.media/angular/",
"https://thecode.media/rgb/",
"https://thecode.media/10-go/",
"https://thecode.media/coffee/",
]
# открываем текстовый файл, куда будем добавлять заголовки
file = open("zag.txt", "a")
# перебираем все адреса из списка
for x in url:
# получаем исходный код очередной страницы из списка
html_code = str(urlopen(x).read(),'utf-8')
# отправляем исходный код страницы на обработку в библиотеку
soup = BeautifulSoup(html_code, "html.parser")
# находим название страницы с помощью метода find()
s = soup.find('title').text
# выводим его на экран
print(s)
# сохраняем заголовок в файле и переносим курсор на новую строку
file.write(s + '. ')
# закрываем файл
file.close()
Что дальше
А дальше логичное продолжение — программа на цепях Маркова, которая будет генерировать заголовки для статей Кода на основе наших старых заголовков.