코딩/Python 2024. 3. 8. 23:51

728x90

영문과 번역문을 병합하고 MD 태그, HTML 태그 추가하는 프로그램

프로그래밍을 공부하다 보면 이리 저리 자료를 찾아야 할 일이 많다. 대부분 영어로 된 문서를 읽게 되는데 그리 어렵진 않지만 국문자료보다는 시간이 더 걸리는 게 사실이다. 그래서 영문 자료를 번역해서 빠르게 읽는데 대부분의 경우는 괜찮지만 부분부분 해석이 난해해서 원문을 찾아 읽게 된다.

그래서 영문과 번역본을 한 눈에 보면서 비교할 수 있도록 자료를 만들었다. 하지만 수동으로 하기에는 시간이 너무 걸리므로 파이썬을 이용해서 대부분을 자동화시켰다.

이 작업으로 생성된 html 결과물이 아래 글이다.

https://summertrees.tistory.com/576

Python/정규식 HOWTO

이 문서는 아래 문서를 번역한 것입니다. 원문을 참조해주세요. Regular Expression HOWTO Regular Expression HOWTO Author: A.M. Kuchling Abstract This document is an introductory tutorial to using regular expressions in Python with the

summertrees.tistory.com

그리고 마크다운 결과물은 아래와 같다. 마크다운으로 굳이 변환하고 다시 html로 변환하는 것은 마크다운이 자료 검토하면서 수정하기 너무 쉽기 때문이다.

728x90

메인프로그램 main()

작성된 모든 함수를 호출해서 연결작업을 한다.

옵션이 두 가지인데 m옵션으로 마크다운을 작성하고, h옵션으로 html을 만든다.

import utility as u
from extractor import extractor
from binder import binder
from converter_md import converter_md
from convert_html import convert_html

def main(option):
    if option == 'm':
        extractor()
        input('Press enter after translation: ')
        binder()
        converter_md()
    elif option == 'h':
        converter_html()

진행 과정

1. 영문 자료를 텍스트 파일로 저장한다.(원본)

2. 전체 자료에서 타이틀, 본문, 코드블록, blockquote, 리스트, 테이블로 구분한다.

3. extractor()로 번역을 위해 코드블록을 삭제한 자료를 생성한다.(추출본)

4. 코드블록이 제거된 자료는 번역기로 번역해서 번역본을 생성한다.(번역본)

5. binder()로 영문과 국문을 순서대로 배치한다.(합본)

6. converter_md()가 원본자료와 번역자료를 섹션별로 배치하면서 마크다운 태그를 추가한다. 영문은 회색으로 표시되도록 한다.(마크다운 문서)

7. 자료를 읽으면서 번역이 미숙한 부분이나 용어를 수정한다.

8. convert_html()로 최종본을 html로 수정한다.(html 문서)

프로젝트에 필요한 모든 상수는 data.py에 저장하고 부분부분 쓰이는 각종 유틸리티는 utility.py에서 불러쓴다. 텍스트를 변경하는 코드라서 특별하게 임포트해서 사용하는 라이브러리가 없이 알고리즘만 잘 만들면 된다.

원본 파일 생성

아래는 python.org의 라이브러리 문서 중 [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html#regex-howto)에서 가져온 자료다.

헤더와 본문

헤더는 마크다운 헤더 형식으로 #를 단다. #을 달지 않으면 프로그램이 자동으로 ##을 붙인다.

본문은 영문과 국문으로 단락을 나누고 싶은 곳에 를 붙인다.

blockquote

<!-- q -->
Abstract

This document is an introductory tutorial to using regular expressions in Python with the re module. It provides a gentler introduction than the corresponding section in the Library Reference.

특수문자

이스케이프를 해야하는 문자나 코드표시를 해야 하는 부분은 `를 사용해서 코드화한다. 이 부분은 중요하지 않다. 나중에 읽으면서 해도 되는 부분이므로 html로 변환하기 전에 완성하면 된다.

`. ^ $ * + ? { } [ ] \ | ( )`

The first metacharacters we’ll look at are [ and ]. They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. For example, [abc] will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z].

표

도표는 t 태그를 달고 옵션을 추가한다. 옵션은 'xxx y'이고, xxx는 컬럼수만큼 정렬을 뜻하고 y는 공백을 제거할지 여부를 정한다. 표는 공백셀이 포함될 수 있기 때문에 공백을 제거할 지 확인해서 직접 공백을 제거하면서 표를 정리하거나 공백이 필요없는 표라면 자동으로 모두 제거한다.

정렬: l - left, r - right, c - center
공백: 어떤 문자인지 상관없이 문자가 있으면 공백을 자동으로 제거한다.

<!-- t -->lll l
Step

Matched

Explanation

1

a

The a in the RE matches.

2

abcbd

The engine matches [bcd]*, going as far as it can, which is to the end of the string.

코드

코드 섹션 태그 뒤에 언어 표시를 한다.( py)

<!-- c -->py
>>>
import re
p = re.compile('ab*')
p
re.compile('ab*')

문서의 끝

문서 마지막에  표시를 넣는다.

저장

이 작업은 시간이 그리 많이 걸리지 않는다. 단순 복/붙이 대부분이라서 그렇다.

Section extracter `extractor()`

이 프로그램은 번역을 위해 코드블럭을 삭제한 자료를 생성한다.

데이터와 유틸리티를 임포트한다.

from data import Data as d
import utility as u

섹션 태그를 발견할 때마다 섹션 타입을 확인하고 코드가 아닌 경우만 저장한다.

def extractor():
    lines = u.file_to_list(d.FILES_SOURCE, sof=True, eof=True, strip='a')

    results = []

    sections = []
    section_type = ''

    for l in lines:
        # <!--
        if l.startswith(d.SECTIONS_TAGS[0]):
            # except code
            if not section_type == d.SECTIONS_TYPE_CHARS[2]:
                results = results + sections

            sections.clear()

            # add section header
            sections.append(l)

            # <!-- section_type -->
            section_type = l[d.SECTIONS_TAGS[2]]
        else:
            sections.append(l)

    # result.txt
    u.write_file(results, d.FILES_DESTINAION, True)

번역본 생성

extractor()가 작업을 마치면 사용자입력을 기다린다.

코드블록이 제거된 자료를 번역기로 번역해서 번역본을 생성한다.(번역본) 원본을 워드 문서로 저장해서 구글 번역기로 돌리는 방법도 있는데 번역품질이 꽤 좋지 않다.

번역본을 생성한 후 엔터를 치면 다음 작업으로 넘어간다.

<!-- h -->
# 정규표현식 하우투

<!-- e -->
작가:
오전. 쿠츨링 <amk@amk.ca>

<!-- q -->
추상적인

이 문서는 re 모듈과 함께 Python에서 정규식을 사용하는 방법에 대한 입문 튜토리얼입니다. Library Reference의 해당 섹션보다 더 부드러운 소개를 제공합니다.

<!-- t -->
## 소개

<!-- e -->
정규식(RE, regexes 또는 regex 패턴이라고 함)은 본질적으로 Python에 내장된 작고 고도로 전문화된 프로그래밍 언어이며 re 모듈을 통해 사용할 수 있습니다. 이 작은 언어를 사용하여 일치시키려는 가능한 문자열 집합에 대한 규칙을 지정합니다. 이 세트에는 영어 문장, 이메일 주소, TeX 명령 또는 원하는 모든 것이 포함될 수 있습니다. 그런 다음 "이 문자열이 패턴과 일치합니까?" 또는 "이 문자열의 어느 위치에나 패턴과 일치합니까?"와 같은 질문을 할 수 있습니다. RE를 사용하여 문자열을 수정하거나 다양한 방법으로 분할할 수도 있습니다.

영문과 국문을 결합시키는 `binder()`

프로그램은 세 부분으로 이루어진다.

데이터를 로드하는 `get_sources()`.

데이터를 로드한 후 첫 줄에 헤더가 있는 지 확인해서 없으면 헤더를 추가한 후, 필요한 기본 데이터를 반환한다.

def get_sources():
    '''
    -> sources, translations, sections, sections_trans, section_type
    '''
    # text.txt
    sources = u.file_to_list(d.FILES_SOURCE, sof=True, eof=True, strip='r')

    # trans.txt
    translations = u.file_to_list(d.FILES_TRANS, sof=True, eof=True, strip='r')

    # no header in 1st line
    if not sources[0].startswith(d.SECTIONS_TAGS[0]):
        # section type 'e'
        section_type = d.SECTIONS_TYPE_CHARS[1]

        # eng: header, grey
        sections = [d.SECTIONS_TAGS[1](d.SECTIONS_TYPE_CHARS[1]), d.SECTION_GREY_TAGS_MD[0]]

        # kor: header
        sections_trans = [d.SECTIONS_TAGS[1](d.SECTIONS_TYPE_CHARS[1])]

    else:
        section_type = sources[0][d.SECTIONS_TAGS[2]]

        # code: header
        if section_type == d.SECTIONS_TYPE_CHARS[2]:
            sections = [sources[0]]

        # eng: header, grey
        else:
            sections = [sources[0], d.SECTION_GREY_TAGS_MD[0]]

        # kor: header
        sections_trans = [translations[0]]

    return sources, translations, sections, sections_trans, section_type

번역본을 삽입하는 `add_trans()`.

다음 섹션이 시작되는 지점에 번역본을 추가하고 초기화된 번역본 섹션과 결과, 번역본의 다음 시작점을 반환한다.

원본과 번역본의 인덱스를 추적해야 하므로 while문을 사용한다.

def add_trans(sections_trans, translations, results, ti):
    '''
    -> sections_trans, results, ti
    '''
    while ti < len(translations):
        # <!--
        if translations[ti].startswith(d.SECTIONS_TAGS[0]):
            # add kor
            results = results + sections_trans

            # init kor
            sections_trans = [translations[ti]]

            # break for eng
            ti += 1
            break
        else:
            # add item
            sections_trans = sections_trans + [translations[ti]]

        ti += 1

    return sections_trans, results, ti

메인 프로그램인 `binder()`.

get_sources(), add_trans()를 호출하여 처리하고, 코드를 제외한 영문부분에 회색태그를 붙이고 파일에 저장한다.

역시 인덱스를 추적해야 하므로 while문을 사용한다.

def binder():
    # 데이터 로드
    sources, translations, sections, sections_trans, section_type = get_sources()

    results = []

    # start eng, kor
    i = 1; ti = 1

    # eng
    while i < len(sources):
        # <!--
        if sources[i].startswith(d.SECTIONS_TAGS[0]):
            # code
            if section_type == d.SECTIONS_TYPE_CHARS[2]:
                # add eng
                results = results + sections
            else:
                # add eng, grey
                results = results + sections + [d.SECTION_GREY_TAGS_MD[1]]

                # kor
                sections_trans, results, ti = add_trans(sections_trans, translations, results, ti)

            section_type = sources[i][d.SECTIONS_TAGS[2]]

            # code
            if section_type == d.SECTIONS_TYPE_CHARS[2]:
                # init for code [header]
                sections = [sources[i]]
            else:
                # init for eng [header, grey]
                sections = [sources[i], d.SECTION_GREY_TAGS_MD[0]]

        else:
            sections = sections + [sources[i]]

        i += 1

    u.write_file(results, d.FILES_BINDER)

마크다운 변환 `converter_md()`

converter_md()는 문서를 마크다운 포맷으로 변환한다.

원본자료를 로딩하는 get_source()

binder()가 만든 자료를 로딩한다. 첫 줄에 태그가 있는 지 확인해서 없으면 태그를 붙이고 섹션 타입을 본문으로 설정한다.

def get_source():
    '''
    -> sources, section_type, sections
    '''
    # load data without '\n'
    sources = u.file_to_list(d.FILES_BINDER, sof=True, eof=True, strip='r')

    # <!--
    if sources[0].startswith(d.SECTIONS_TAGS[0]):
        section_type = sources[0][d.SECTIONS_TAGS[2]]

    else:
        # type e
        section_type = d.SECTIONS_TYPE_CHARS[1]

    sections = [sources[0]]

    return sources, section_type, sections

영역을 설정하는 get_range()

섹션데이터를 불러와 영문인지 국문인지 구분해서 폰트태그 위치를 정하고 본문의 시작과 끝 인덱스를 반환한다.

def get_range(sections):
    '''
    -> results, last_result, start, end
    '''
    # eng
    if sections[1].startswith(d.SECTION_GREY_TAGS_MD[0]):
        start = 2
        end = len(sections) - 1
        results = ['', sections[1], '']
        last_result = ['', sections[-1], '']

    # kor
    else:
        start = 1
        end = len(sections)
        results = ['']
        last_result = ['']

    return results, last_result, start, end

헤더 변환 modify_header()

이제부터는 직접 마크다운으로 변환하는 코드들이다. 모든 변환함수들은 원본의 섹션태그를 제거한다.

헤더는 #으로 표기하는데 #의 수에 따라 html의 <h1>, <h2>, <h3>.. 등과 대응한다. 모든 공백줄은 제거한다. 미리 #이 기입되어 있지 않으면 ##를 붙인다.

def modify_header(sections):
    sections = u.strip_list(sections, start=1, end=-1)

    # grey
    results, last_result, start, end = get_range(sections)

    # eng
    for item in sections[start:end]:
        # remove \n
        if not item == '':
            # no #
            if not item.startswith(d.SECTION_HEADER_TAG_MD):
                # add ## ...
                results = results + [f'{d.SECTION_HEADER_TAG_MD*2} {item}']

    results = results + last_result

    return results

본문 변환 modify_paragraph()

태그만 제거하고 본문을 그대로 출력한다.

def modify_paragraph(sections):
    sections = u.strip_list(sections, start=1, end=-1)

    results, last_result, start, end = get_range(sections)

    for item in sections[start:end]:
        results.append(item)

    results = results + last_result

    return results

코드블럭 변환 modify_code()

코드블럭도 별다른 변환이 없다. 태그제거와 함께 ```python과 ```를 블럭 위 아래에 붙인다.

def modify_code(sections):
    sections = u.strip_list(sections, start=1, end=-1)

    # language
    try:
        l = sections[0][d.SECTIONS_TAGS[3]:]
        lang = d.SECTION_CODE_LANGUAGES[l]
    except:
        lang = d.SECTION_CODE_LANGUAGE_DEFAULT

    results = u.add_tags(sections[1:], f'{d.SECTION_CODE_TAG_MD}{lang}', f'{d.SECTION_CODE_TAG_MD}')

    return results

리스트 변환 modify_list()

리스트는 '- '를 문자열 앞에 붙인다. 그리고 모든 공백줄을 제거한다.

def modify_list(sections):
    sections = u.strip_list(sections, start=1, end=-1)

    results, last_result, start, end = get_range(sections)

    for item in sections[start:end]:
        # remove \n
        if not item == '':
            # - ...
            results = results + [f'{d.SECTION_LIST_TAG_MD} {item}']

    results = results + last_result

    return results

테이블 변환 modify_table()

테이블이 가장 할 일이 많다. 표는 파이프(|)를 이용해서 작성한다. 정렬은 :-(left) 등의 기호를 붙인다.

태그 옆의 옵션값을 가져와서 정렬과 공백줄 제거를 결정한다. 컬럼수는 정렬옵션의 문자수로 판단한다. 컬럼수와 남은 줄 수를 계산해서 시작과 끝 인덱스를 정한다. 헤더줄을 먼저 저장하고 다음으로 정렬줄을 만든다. 그 뒤 나머지 표도 완성한다. 이 때 마지막 컬럼에서 줄을 완성해서 저장한다.

def modify_table(sections):
    sections = u.strip_list(sections, start=1)

    # options
    option = sections[0][d.SECTIONS_TAGS[3]:]
    [als, remove_cr] = option.split(' ')

    # 공백 제거
    if len(remove_cr) > 0:
        sections = u.remove_cr(sections, remove_cr=True)

    # aligns
    aligns = []
    for al in als:
        for i, chr in enumerate(d.SECTION_TABLE_ALIGNS[2]):
            if chr == al:
                aligns = aligns + [d.SECTION_TABLE_ALIGNS[0][i]]
            break

    columns = len(als)

    results, last_result, start, end = get_range(sections)

    # column을 채우고 남는 줄 제거
    end = int(len(sections[start:end])/columns) * columns + start

    # header
    string = d.SECTION_TABLE_TAG_MD
    for item in sections[start:start+columns]:
        string = string + item + d.SECTION_TABLE_TAG_MD
    results = results + [string]

    # aligns
    string = d.SECTION_TABLE_TAG_MD
    for al in aligns:
        string = string + al + d.SECTION_TABLE_TAG_MD
    results = results + [string]

    # |
    string = d.SECTION_TABLE_TAG_MD
    i = 1
    for item in sections[start+columns:end]:
        # ...|
        string = string + item + d.SECTION_TABLE_TAG_MD

        # last column
        if i%columns == 0:
            results = results + [string]

            # |
            string = d.SECTION_TABLE_TAG_MD

        i += 1

    results = results + last_result

    return results

Blockquote 변환 modify_blockquote()

Blockquote는 줄 앞에 '>'를 붙인다.

영문에서도 첫 줄은 블럭의 제목으로 달기 위해 회색 글자 밖으로 끄집어 내어 배치한다. >가 없는 공백줄이 있으면 한 블럭에서 벗어나므로 영문과 국문간 간격을 없앤다. 이중공백줄이 있으면 한줄로 만든다. 새로운 섹션을 만들어 앞의 사항들을 정리한 후 마지막에 >를 붙여 반환한다.

def modify_blockquote(sections):
    # remove cr
    sections = u.strip_list(sections, start=1, end=-1)

    new_sections = []
    # eng
    if sections[1].startswith(d.SECTION_GREY_TAGS_MD[0]):
        # [title, <font>, '']
        new_sections = [sections[2], sections[1], '']
        start = 3

    # kor
    else:
        start = 1
        new_sections = []

    new_sections = new_sections + sections[start:]

    # remove \n\n
    new_sections = u.remove_cr(new_sections, remove_double_cr=True)

    # > ...
    results = []
    for item in new_sections:
        # > ...
        results = results + [f'{d.SECTION_BLOCKQUOTE_TAG_MD} {item}']

    return results

MD 변환 메인프로그램 converter_md()

각 서브프로그램에서 받은 데이터를 합치고 이중 공백줄을 제거한 후 최종파일에 저장한다.

def converter_md():
    # 데이터 로드
    sources, section_type, sections = get_source()

    # 결과물, 섹션
    results = []

    # 섹션에 내용 추가
    i = 0 # debug
    for item in sources[1:]:
        print(i) # debug
        # <!--
        if item.startswith(d.SECTIONS_TAGS[0]):
            # header
            if section_type == d.SECTIONS_TYPE_CHARS[0]:
                results = results + modify_header(sections)

            # code
            elif section_type == d.SECTIONS_TYPE_CHARS[2]:
                results = results +  modify_code(sections)

            # blockquote
            elif section_type == d.SECTIONS_TYPE_CHARS[3]:
                results = results +  modify_blockquote(sections)

            # table
            elif section_type == d.SECTIONS_TYPE_CHARS[5]:
                results = results +  modify_table(sections)

            # list
            elif section_type == d.SECTIONS_TYPE_CHARS[4]:
                results = results +  modify_list(sections)

            else:
                results = results +  modify_paragraph(sections)

            # init sections
            sections = [item]

            section_type = item[d.SECTIONS_TAGS[2]]
        else:
            sections = sections + [item]
        i += 1 # debug

    # remove \n\n
    results = u.remove_cr(results, remove_double_cr=True)

    u.write_file(results, d.FILES_DESTINAION)

스터디 & 수정

읽기 전에 먼저 그림이나 링크를 추가해야 하는 경우 원본을 보면서 추가한다.

그리고나서 인터넷 브라우저에서 문서를 읽으면서 필요한 부분을 수정한다. 특수문자를 백틱(`)으로 감싸거나 줄 간격을 조정하는 것이다. 마크다운은 공백줄로 줄을 나누므로 줄이 나뉘어야 하는 곳은 공백줄을 추가하면서 읽는다. 대부분의 작업은 파이썬에서 해결했으므로 읽는데 집중할 수 있다.

html 변환 `converter_html()`

스터디 과정에서 대부분을 수정했다면 블로그 등에 html로 올려 보관하는 작업이 남았다. 티스토리 같은 블로그에서도 마크다운을 지원해서 html로 수정해주지만 만족스럽지가 않다. 수동으로 html로 변환하는 것은 불가능에 가깝다.

마크다운은 간단하게 변환할 수 있어서 빠르게 문서를 작성하기 편하다. 반면에 html은 자동화시키는데도 손이 많이 간다. 서브프로그램만 16개를 만들었다.

섹션을 만드는 get_section()

소스를 받아서 유형별로 섹션 유형과 섹션 데이터, 인덱스를 반환한다. 어떤 섹션은 종료조건에 태그가 달리고 어떤 것은 접두사로 구분된다.

섹션이 많아져서 데이터 종류를 구분해야 했다. 원본은 sources, section은 섹션유형과 섹션데이터를 담고, 섹션데이터는 items라고 이름 지었다.

def get_section(sources: list[str], i: int):
    '''
    -> tuple[section_type: str, items: list[str]], i: int
    '''
    # header
    for j in range(len(d.HEADERS)):
        # sources[i] == header, get index of headers, section_type, group 또는 단일 행
        if sources[i].startswith(d.HEADERS[j]):
            section_type = d.SECTIONS_TYPES[j]
            group = True
            break

        # p
        else:
            section_type = d.SECTIONS_TYPES[9]
            group = False

    # add sources[i]
    items = [sources[i]]

    if group:
        i += 1

        # 종료 조건이 '' 인 경우
        if d.TERMINATOR[j] == '':
            while sources[i].startswith(d.HEADERS[j]):
                items = items + [sources[i]]

                i += 1

                # end of data
                if i == len(sources):
                    break

            i -= 1

        # 종료 조건에 값이 있을 경우
        else:
            while not sources[i] == d.TERMINATOR[j]:
                items = items + [sources[i]]

                i += 1

            items = items + [sources[i]]

    return [section_type, items], i

인덱스를 추적하고 섹션을 모으는 get_sections()

인덱스 추적을 위해 while문을 사용한다. get_section()이 전달하는 섹션들을 모아 저장한다.

def get_sections(sources: list[str]):
    i = 0
    sections = []

    # 인덱스번호 추적 위해 while문 사용
    while i < len(sources):
        # remove blank line
        if not sources[i] == '':
            # get sections
            section, i = get_section(sources, i)

            sections = sections + [section]

        i += 1

    return sections

마크다운 태그와 html에서 필요한 캐릭터를 정리하는 modify_string()

마크다운 언어와 html 사이에서 호환이 되지 않는 캐릭터를 정리하는 유틸리티. '<', '>'를 '<', '>'로 바꾸고, 마크다운용 태그 '>'와 '#'을 제거한다.

def modify_string(strings: list[str], angle=False, blockquote=False, header_mark=False):
    # 인수로 받은 리스트는 리턴을 하지 않아도 list[i]를 변경시켜도 원래 리스트가 변한다.

    # < >
    if angle:
        for i, _ in enumerate(strings):
            strings[i] = strings[i].replace(d.SECTION_CODE_CHARACTERS[0][0], d.SECTION_CODE_CHARACTERS[1][0])
            strings[i] = strings[i].replace(d.SECTION_CODE_CHARACTERS[0][1], d.SECTION_CODE_CHARACTERS[1][1])

    # >
    if blockquote:
        for i, _ in enumerate(strings):
            strings[i] = strings[i].lstrip(d.SECTION_BLOCKQUOTE_TAG_MD).strip()

    # #
    if header_mark:
        for i, _ in enumerate(strings):
            strings[i] = strings[i].replace(d.SECTION_HEADER_TAG_MD, '').strip()

    return strings

html의 스페셜 문자를 정리하는 modify_special_char()

이스케이프문자와 백틱(`)을 정리해서 문자 앞 뒤에 <code>...</code>를 붙인다.

def modify_special_char(strings):
    results = []
    for string in strings:
        # `` -> <code></code>
        i = 0
        j = 0
        result = ''
        code = False
        while i < len(string):
            j = string.find(d.SECTION_CODE_TAG_MD[0], i)

            if j == -1:
                result = result + string[i:].replace('\\\\', '\\')
                break
            else:
                # \` -> `
                if string[j-1] == '\\':
                    result = result + string[i:j-1].replace('\\\\', '\\') + string[j]

                else:
                    if code:
                        # </code>
                        result = result + string[i:j] + d.SECTION_CODE_TAGS_HTML[1][1]
                    else:
                        result = result + string[i:j].replace('\\\\', '\\')

                        # <code>
                        result = result + d.SECTION_CODE_TAGS_HTML[1][0]

                    code = not code
                i = j

            i += 1

        results.append(result)

    return results

코드블럭을 변환하는 convert_code()

지금부터는 섹션별로 정리하는 코드들이다.

섹션은 code, blockquote, grey, table, list, header, line, picture, link, p 등 10개지 종류다. 그 중, blockquote는 code, grey, table, list, link, p등을 포함할 수 있고, grey는 table, list, p 등을 포함할 수 있다. 나머지 섹션은 다른 섹션을 포함하지 않는 독립형 섹션이다.

코드는 <pre class=python>...</pre> 형식의 태그를 붙이고 내용은 손대지 않는다.

def convert_code(items: list[str]):
    '''
    -> new_items
    '''
    # language
    try:
        lang = items[0][3:].strip()
    except:
        lang = d.SECTION_CODE_LANGUAGES[d.SECTION_CODE_LANGUAGE_DEFAULT]

    # body
    new_items = modify_string(items[1:-1], angle=True)

    # header
    new_items[0] = f'{d.SECTION_CODE_TAGS_HTML[0][0](lang)}{d.SECTION_CODE_TAGS_HTML[1][0]}{new_items[0]}'

    # footer
    new_items[-1] = f'{new_items[-1]}{d.SECTION_CODE_TAGS_HTML[1][1]}{d.SECTION_CODE_TAGS_HTML[0][1]}'

    return new_items

테이블 데이터와 정렬값을 반환하는 get_tables()

테이블은 두 개의 프로그램을 사용한다. 첫번째로 테이블 변환을 위해 테이블 내용과 정렬을 구분해서 튜플로 반환한다. 마크다운은 :- 형식으로 정렬을 표시하므로 left등으로 바꾼다.

def get_tables(items):
    '''
    \nif not table: len(aligns) == 0
    \n-> tables, aligns
    '''
    items = modify_string(items, angle=True)

    # table header
    tables = [items[0]]

    # table alignment
    try:
        align_marks = u.string_to_list(items[1], d.SECTION_TABLE_TAG_MD)

        aligns = []
        if align_marks[0] in d.SECTION_TABLE_ALIGNS[0]:
            # alignments
            for mark in align_marks:
                for i, align in enumerate(d.SECTION_TABLE_ALIGNS[0]):
                    if align == mark:
                        aligns.append(d.SECTION_TABLE_ALIGNS[1][i])

            # body
            tables = tables + items[2:]
    except:
        # non table
        aligns = []

    return tables, aligns

테이블을 html 태그로 변환하는 convert_table()

전달받은 테이블데이터와 정렬데이터를 이용해 html 태그로 변환한다.

def convert_table(tables, aligns):
    '''
    -> items
    '''
    # table tag: <table><thead><tr>
    new_items = [f'{d.SECTION_TABLE_TAGS_HTML[0][0]}{d.SECTION_TABLE_TAGS_HTML[1][0]}{d.SECTION_TABLE_TAGS_HTML[2][0]}']

    # header
    items = u.string_to_list(tables[0], d.SECTION_TABLE_TAG_MD)
    for i, item in enumerate(items):
        # <th> header </th>
        new_items = new_items + [f'{d.SECTION_TABLE_TAGS_HTML[3][0](aligns[i])}{item}{d.SECTION_TABLE_TAGS_HTML[3][1]}']

    # header end - body start: </tr></thead><tbody>
    new_items = new_items + [f'{d.SECTION_TABLE_TAGS_HTML[2][1]}{d.SECTION_TABLE_TAGS_HTML[1][1]}{d.SECTION_TABLE_TAGS_HTML[4][0]}']

    # <tr><td>...</td></tr>
    for table in tables[2:]:
        items = u.string_to_list(table, d.SECTION_TABLE_TAG_MD)

        new_items = new_items + [d.SECTION_TABLE_TAGS_HTML[2][0]]

        for item in items:

            new_items = new_items + [f'{d.SECTION_TABLE_TAGS_HTML[5][0](aligns[i])}{item}{d.SECTION_TABLE_TAGS_HTML[5][1]}']

        new_items = new_items + [f'{d.SECTION_TABLE_TAGS_HTML[2][1]}']

    # table tag: </tbody>></table>
    new_items = new_items + [f'{d.SECTION_TABLE_TAGS_HTML[4][1]}{d.SECTION_TABLE_TAGS_HTML[0][1]}']

    new_items = modify_special_char(new_items)

    return new_items

리스트를 변환하는 convert_list()

def convert_list(items):
    '''
    -> new_items
    '''
    items = modify_string(items, angle=True)

    # write list
    # header <ul>
    new_items = [d.SECTION_LIST_TAGS_HTML[0][0]]

    # body
    for item in items:
        item = item.removeprefix(d.SECTION_LIST_TAG_MD).strip()

        # <li>...</li>
        new_items = new_items + [f'{d.SECTION_LIST_TAGS_HTML[1][0]}{item}{d.SECTION_LIST_TAGS_HTML[1][1]}']

    # footer </ul>
    new_items = new_items + [d.SECTION_LIST_TAGS_HTML[0][1]]

    new_items = modify_special_char(new_items)

    return new_items

헤더를 변환하는 convert_header()

마크다운의 # 수를 세고 그에 맞게 <h1>, <h2>,... 등의 태그를 단다.

def convert_header(items):
    '''
    -> new_items
    '''
    count_header = u.count_char(d.SECTION_HEADER_TAG_MD, ' ', items[0])

    items = modify_string(items, angle=True, header_mark=True)

    new_items = []
    for item in items:
        # <hx>...</hx>
        new_items = new_items + [f'{d.SECTION_HEADER_TAGS_HTML[0](count_header)}{item}{d.SECTION_HEADER_TAGS_HTML[1](count_header)}']

    new_items = modify_special_char(new_items)

    return new_items

라인을 변환하는 convert_line()

간단해서 너무 편안한 코드. ***를 <hr> 태그로 바꾸기만 하면 된다.

def convert_line():
    items = [d.SECTION_LINE_TAG_HTML]

    return items

이미지를 변환하는 convert_picture()

역시 간단하다. 단 한줄의 문자열 순서만 바꾸면 된다.

def convert_picture(items):
    new_items = []
    for item in items:
        i = item.find(d.SECTION_PICTURE_TAGS_MD[1]) # ]
        j = item.find(d.SECTION_PICTURE_TAGS_MD[2]) # )

        alt = item[2:i]
        url = item[i+2:j]

        # <img src="url" alt="alt">
        result = d.SECTION_PICTURE_TAG_HTML(url, alt)
        new_items.append(result)

    return new_items

링크를 바꾸는 convert_link()

내겐 가장 어려웠던 코드다. 마크다운의 링크는 [...](...)인데, 각 문자가 정말 링크인지, 문장 내의 기호인지 링크인지 어떻게 구분하느냐가 문제였다. 프로그래밍 자료들을 보다보면 경우에 따라 뭔가를 설명하기 위해 [[[[[]()]]]] 이런 식의 문자들이 나올 수도 있다. 알고리듬을 만드는 게 힘들었다. 배열을 이용해서 각각의 자리값을 비교, 계산하는 것으로 결정정했는데 내가 생각하지 못한 어떤 오류가 나타날지 모르겠다.

고민만큼 긴 코드가 됐다. 테스트 코드도 여러개 만들었다. 파이썬의 장점은 바로 이런 것이다. 테스트 코드에 대한 자유. 나 같은 아마추어도 여러 알고리듬을 부담없이 빠르게 테스트할 수 있다.

고민과는 반대로 가장 활용도가 낮은 코드가 아닐까 한다. 링크를 거는 것은 문서 별로 한 두개 있을까 말까 하니까. 가장 활용도 낮은 섹션이 가장 긴 시간과 긴 줄을 차지하고 있다.

def convert_link(items: list[str]):
    new_items = []
    for i, item in enumerate(items):
        # 최초 위치
        k = 0

        j = 0
        positions_link = []

        # 문자열 끝 또는 [를 찾지 못할 때까지 순환
        while j < len(item) and j > -1:
            position = []

            # [ 를 검색하고 위치 저장
            j = item.find(d.SECTION_LINK_TAG_MD[0], j)
            position.append(j)

            # [ 를 찾았을 때
            if not j == -1:
                # []() 를 검색하고 위치 저장
                for tag in d.SECTION_LINK_TAG_MD:
                    position.append(item.find(tag, j+1))

                # ..](..)[ 순서로 정렬
                if position[0] < position[2] and position[3]-position[2] == 1 and position[3] < position[4] and (position[4] < position[1] or position[1] == -1):
                    # link 위치 저장
                    positions_link = positions_link + [[k] + [position[0]] + position[2:]]

                    # 최초 위치 저장
                    k = position[4] + 1

            else:
                break

            if position[1] == -1:
                break
            else:
                j = position[4] + 1

        new_string = ''
        if len(positions_link) > 0:
            for position in positions_link:
                text = item[position[1]+1:position[2]]
                link = item[position[3]+1:position[4]]

                new_string = new_string + item[position[0]:position[1]] + d.SECTION_LINK_TAG_HTML(link, text)

        if len(item) - 1 > positions_link[-1][-1]:
            new_string = new_string + item[positions_link[-1][-1]+1:]

        new_items = new_items + [new_string]

    new_items = modify_special_char(new_items)

    return new_items

패러그래프를 변환하는 convert_p

링크와는 반대로 가장 활용이 많이 되는 코드가 이름도 짧고 코드도 짧고 작성시간도 짧다. 단, blockquote 내에서는 섹션 태그에서 회색을 표시하고 섹션을 닫아도 다음 패러그래프에 색상이 그대로 넘어오기 때문에 blockquote에서는 별도로 검정색 테그를 붙인다.

def convert_p(items, black=False):
    items = modify_string(items, angle=True)

    new_items = []
    for item in items:
        if black:
            strings = [f'{d.SECTION_P_TAGS_HTML[1][0]}{item}{d.SECTION_P_TAGS_HTML[0][1]}']
        else:
            strings = [f'{d.SECTION_P_TAGS_HTML[0][0]}{item}{d.SECTION_P_TAGS_HTML[0][1]}']

        new_items = new_items + strings

    new_items = modify_special_char(new_items)

    return new_items

독립 변환 섹션을 모아서 변환하는 convert_single_section()

섹션의 유형과 데이터를 받아 유형에 맞게 서브 프로그램을 실행하고 결과를 반환한다.

일반적으로는 if..elif..elif를 8개를 만들어야 하지만 이 코드에서는 배열과 람다식을 사용해서 코드를 줄였다. 인수 전달 방식이 동일한 것끼리 모으니 코드가 많이 줄었다.

def convert_single_section(items, black=False):
    '''
    \n convert code, table, list, p
    \n -> new_items
    '''
    # code, list, header, picture, link, p
    section_types = [0, 4, 5, 7, 8, 9]
    commands = [lambda items: convert_code(items), lambda items: convert_list(items), lambda items: convert_header(items), lambda items: convert_picture(items), lambda items: convert_link(items), lambda items: convert_p(items)]

    # table
    if items[0] == d.SECTIONS_TYPES[3]:
        tables, aligns = get_tables(items[1])
        new_items = convert_table(tables, aligns)
    # line
    elif items[0] == d.SECTIONS_TYPES[6]:
        new_items = convert_line()
    # p
    elif items[0] == d.SECTIONS_TYPES[9]:
        new_items = convert_p(items[1], black=black)
    else:
        for i, section_type in enumerate(section_types):
            if items[0] == d.SECTIONS_TYPES[section_type]:
                new_items = commands[i](items[1])
                break

    return new_items

영문용 복합 섹션 convert_grey()

영문은 문자를 회색으로 표시할 것이므로 회색 섹션 태그를 단다. 영문 섹션 내에는 list, table, header, p 등이 들어간다. 그래서 별도로 섹션 내에서 서브 섹션을 구분하고 각각에 맞게 컨버터를 불러와 별도로 변환해야 한다.

def convert_grey(items):
    sections = get_sections(items[1:-1])

    new_items = []
    for section in sections:
        new_items = new_items + convert_single_section(section)

    new_items = [d.SECTION_GREY_TAGS_HTML[0]] + new_items + [d.SECTION_GREY_TAGS_HTML[1]]

    return new_items

blockquote용 복합 섹션 convert_blockquote()

이것도 사용빈도가 높지 않을텐데 고민하게 만든 섹션이다. 영문 복합 섹션과 동일한 방식인데 더 광범위하다. code, picture, link 등이 더 들어갈 수 있다.

def convert_blockquote(items):
    '''
    -> new_items
    '''
    items = modify_string(items, blockquote=True)

    sections = get_sections(items)

    new_items = []
    for section in sections:
        # grey
        if section[0] == d.SECTIONS_TYPES[2]:
            new_items = new_items + convert_grey(section[1])
        else:
            new_items = new_items + convert_single_section(section, black=True)

    # tags
    new_items = [f'{d.SECTION_BLOCKQUOTE_TAGS_HTML[0]}'] + new_items + [f'{d.SECTION_BLOCKQUOTE_TAGS_HTML[1]}']

    return new_items

html 변환 메인프로그램 converter_html()

드디어 메인프로그램이다. 메인일 수록 할 일이 별로 없다. 역시 인수 전달이 동일한 grey와 blockquote만 배열과 람다식으로 처리하고 나머지는 서브 프로그램에게 넘긴다. 그런 다음 모인 데이터를 결과파일에 저장한다.

def converter_html():
    '''
    source > sections > section[section_type, items]

    '''
    # read source: list[str]
    sources = u.file_to_list(d.FILES_SOURCE, strip='r')

    #
    sections = get_sections(sources)

    section_types = [1, 2]
    commands = [lambda items: convert_blockquote(items), lambda items: convert_grey(items)]

    results = []
    for section in sections:
        # blockquote or grey
        if section[0] == d.SECTIONS_TYPES[1] or section[0] == d.SECTIONS_TYPES[2]:
            for i, section_type in enumerate(section_types):
                if section[0] == d.SECTIONS_TYPES[section_type]:
                    results = results + commands[i](section[1])
        else:
            results = results + convert_single_section(section)

    u.write_file(results, d.FILES_DESTINAION)

상수 데이터 data.py

상수데이터는 json으로 설정하고 클래스로 전달한 후 각 프로그램이 임포트해서 사용한다. 나는 프로가 아니라서 이 방식이 옳은 지 잘 모르겠다. 그냥 클래스로 정의하는 게 더 간편할 것 같기도 하다. 둘 다 장단점이 있는 것 같다.

json은 데이터를 수정할 때 일목요연하게 볼 수 있어 편하다. 코드 작성할 때도 창을 나눠서 보면 눈에 잘 들어온다. 하지만 데이터를 이중 정의해야 하는 불편함이 있다. json을 그대로 사용할 수도 있지만 입력할 때 무척 불편하다. 그리고 각 섹션별로 데이터를 나누는 것은 데이터 수정, 확인하는데는 너무 편하지만 코드를 입력할 때 반대로 헷갈리는 경향이 있다.

클래스를 정의하는 것은 입력할 때는 편하지만 항상 이중일을 해야 하는 불편함이 있다.

상수데이터에 대해서 프로들은 어떻게 관리하는지 궁금하기도 하다.

SECTIONS = {
    'types':['code', 'blockquote', 'grey', 'table', 'list', 'header', 'line', 'picture', 'link', 'p'], # 9
    'type_chars':['h', 'e', 'c', 'q', 'l', 't', 'x'], # header, eng, code, blockquote, list, table, end # 6
    'tags':['<!--', lambda section_type: f'<!-- {section_type} -->', 5, 10], #[,, position of type, position of ext info] # 3
}
SECTION_BLOCKQUOTE = {
    'tag_md':'>',
    'tags_html':['<blockquote data-ke-style="style2">', '</blockquote>'],
}

class Data:
    FILES_SOURCE = FILES['source']
    FILES_DESTINAION = FILES['destinaion']
    FILES_TRANS = FILES['trans']
    FILES_BINDER = FILES['binder']
    FILES_LOG = FILES['log']
    FILES_TEST = FILES['test']
    FILES_ENCODING = FILES['encoding']
    SECTIONS_TYPES = SECTIONS['types']
    SECTIONS_TYPE_CHARS = SECTIONS['type_chars']
    SECTIONS_TAGS = SECTIONS['tags']
    SECTION_GREY_TAGS_MD = SECTION_GREY['tags_md']

유틸리티 utility.py

코드를 작성하는 중간중간 이 프로젝트와는 상관없이 사용할 수 있는 코드들은 별도로 유틸리티로 관리한다. 이것들은 나중에 스니펫으로 사용할 수도 있다. 필요할 때 그때 그때 만들고 필요없어진 것도 있다. 유틸리티를 관리하면 간단하지만 코딩을 하는데 아주 유용하고 시간을 많이 절약해준다. 이런 잡동사니들이 모이면 재산이 된다.

def init_log():
    '''
    \n -> log
    '''
    log = open(d.FILES_LOG, 'a', encoding=d.FILES_ENCODING, newline='')
    log.write('\n' + '*'*5 + f'{dt.datetime.now()}')

    return log

def file_to_list(fp, sof=False, eof = False, strip=None):
    '''
    \n strip = 'a'll | 'r'ight | 'l'eft
    \n -> list[str]
    '''
    with open(fp, 'r', encoding=d.FILES_ENCODING, newline='') as f:
        ls = []
        if strip == None:
            ls = f.readlines()
        else:
            for line in f:
                if strip == 'a':
                    ls.append(line.strip())
                elif strip == 'r':
                    ls.append(line.rstrip())
                elif strip == 'l':
                    ls.append(line.lstrip())

    # trim
    ls = strip_list(ls, start=0, end=-1)

    if sof:
        if not ls[0].startswith(d.SECTIONS_TAGS[1](d.SECTIONS_TYPE_CHARS[0])):
            ls = [d.SECTIONS_TAGS[1](d.SECTIONS_TYPE_CHARS[0])] + ls

    if eof:
        if not ls[-1].startswith(d.SECTIONS_TAGS[1](d.SECTIONS_TYPE_CHARS[6])):
            ls.append(d.SECTIONS_TAGS[1](d.SECTIONS_TYPE_CHARS[6]))

    return ls

def count_type(items: list[str], header, type_position, type_code):
    count = 0
    for item in items:
        if item.startswith(header):
            if item[type_position] == type_code:
                count += 1

def write_file(items, fp, eof=False):
    '''
    \n -> write text file
    '''
    with open(fp, 'wt', encoding=d.FILES_ENCODING, newline='') as f:
        for i, item in enumerate(items):
            f.write(item + '\n')
        if eof:
            if not items[-1] == d.SECTIONS_TAGS[1](d.SECTIONS_TYPE_CHARS[6]):
                f.write(d.SECTIONS_TAGS[1](d.SECTIONS_TYPE_CHARS[6]))

def remove_cr(items, remove_cr=False, remove_double_cr=False):
    new_items = []
    if remove_cr:
        for item in items:
            if not item == '':
                new_items = new_items + [item]

    if remove_double_cr:
        for i, item in enumerate(items):
            if item == '':
                if not items[i-1] == '':
                    new_items.append(item)
            else:
                new_items.append(item)

    return new_items

def strip_list(items: list[str], start=0, end=-1):
    '''
    \n -> list[str]
    '''
    # remove top
    if not start == -1:
        while items[start] == '':
            items.pop(start)

    # remove bottom
    if not end == 0:
        while items[end] == '':
            items.pop(end)

    return items

def count_char(char, breaker, string):
    count = 0
    for s in string:
        if s == breaker:
            break
        if s == char:
            count += 1

    return count

def string_to_list(string: str, delimiter):
    '''
    \n -> list, count of delimiter
    '''
    if string[0] == delimiter:
        i = 1
    else:
        i = 0

    j = 0
    l = []
    while i < len(string):
        j = string.find(delimiter, i)
        l.append(string[i:j])
        i = j + 1

    return l

def add_tags(sections, tag_start, tag_end, remove_blank=False):
    if len(sections) > 1:
        sections = strip_list(sections, start=1, end=-1)
        
        results = [tag_start]
        for item in sections:
            if remove_blank:
                if not item == '':
                    results.append(item)
            else:
                results = results + [item]
                
        results = results + [tag_end]
    
    return results

728x90

저작자표시 비영리 변경금지

'코딩 > Python' 카테고리의 다른 글

Python/Reference/소개 (0)	2024.03.15
Python/PEP 8 – Style Guide for Python Code (0)	2024.03.10
Python/람다식의 다른 적용: if문 대체 (1)	2024.03.08
Python/정규식 HOWTO (0)	2024.03.04
Python/Enumerate()의 올바른 이해 (1)	2024.02.26

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인기포스트

ABOUT ME

영문과 번역문을 병합하고 MD 태그, HTML 태그 추가하는 프로그램

메인프로그램 main()

진행 과정

원본 파일 생성

태그

헤더와 본문

blockquote

특수문자

표

코드

문서의 끝

저장

Section extracter extractor()

번역본 생성

영문과 국문을 결합시키는 binder()

데이터를 로드하는 get_sources().

번역본을 삽입하는 add_trans().

메인 프로그램인 binder().

마크다운 변환 converter_md()

원본자료를 로딩하는 get_source()

영역을 설정하는 get_range()

헤더 변환 modify_header()

본문 변환 modify_paragraph()

코드블럭 변환 modify_code()

리스트 변환 modify_list()

테이블 변환 modify_table()

Blockquote 변환 modify_blockquote()

MD 변환 메인프로그램 converter_md()

스터디 & 수정

html 변환 converter_html()

섹션을 만드는 get_section()

인덱스를 추적하고 섹션을 모으는 get_sections()

마크다운 태그와 html에서 필요한 캐릭터를 정리하는 modify_string()

html의 스페셜 문자를 정리하는 modify_special_char()

코드블럭을 변환하는 convert_code()

테이블 데이터와 정렬값을 반환하는 get_tables()

테이블을 html 태그로 변환하는 convert_table()

리스트를 변환하는 convert_list()

헤더를 변환하는 convert_header()

라인을 변환하는 convert_line()

이미지를 변환하는 convert_picture()

링크를 바꾸는 convert_link()

패러그래프를 변환하는 convert_p

독립 변환 섹션을 모아서 변환하는 convert_single_section()

영문용 복합 섹션 convert_grey()

blockquote용 복합 섹션 convert_blockquote()

html 변환 메인프로그램 converter_html()

상수 데이터 data.py

유틸리티 utility.py

'코딩 > Python' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

Section extracter `extractor()`

영문과 국문을 결합시키는 `binder()`

데이터를 로드하는 `get_sources()`.

번역본을 삽입하는 `add_trans()`.

메인 프로그램인 `binder()`.

마크다운 변환 `converter_md()`

html 변환 `converter_html()`