Python에서 대용량 파일의 MD5 해시 가져오기

programing

Python에서 대용량 파일의 MD5 해시 가져오기

itmemos 2023. 7. 1. 08:04

Python에서 대용량 파일의 MD5 해시 가져오기

저는 hashlib(Python 2.6/3.0에서 md5를 대체함)을 사용해 보았는데, 파일을 열고 그 내용을 함수에 넣으면 잘 작동했습니다.

문제는 파일 크기가 RAM 크기를 초과할 수 있다는 점입니다.

전체 파일을 메모리에 로드하지 않고 파일의 MD5 해시를 가져올 수 있는 방법은 무엇입니까?

적절한 크기의 청크로 파일을 읽어야 합니다.

def md5_for_file(f, block_size=2**20):
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.digest()

참고: 'rb'가 있는 파일을 열려 있는지 확인하십시오. 그렇지 않으면 잘못된 결과를 얻을 수 있습니다.

따라서 한 가지 방법으로 전체 로트를 수행하려면 다음과 같은 방법을 사용합니다.

def generate_file_md5(rootdir, filename, blocksize=2**20):
    m = hashlib.md5()
    with open( os.path.join(rootdir, filename) , "rb" ) as f:
        while True:
            buf = f.read(blocksize)
            if not buf:
                break
            m.update( buf )
    return m.hexdigest()

위의 업데이트는 Frerich Raabe가 제공한 코멘트를 기반으로 했습니다. 그리고 저는 이것을 테스트했고 그것이 제 Python 2.7.2 Windows 설치에서 정확하다는 것을 발견했습니다.

저는 잭섬 툴을 사용하여 결과를 교차 확인했습니다.

jacksum -a md5 <filename>

청크( 128바이트의로 파일바 8192또이청로 (는 128이트다배른) 수에다 MD5니합을 합니다.update().

이는 MD5에 128바이트 다이제스트 블록(8192는 128x64)이 있다는 점을 활용합니다.전체 파일을 메모리로 읽지 않으므로 8192바이트 이상의 메모리를 사용하지 않습니다.

Python 3.8+에서 할 수 있는 것은

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

파이썬 < 3.7

import hashlib

def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
    h = hash_factory()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''): 
            h.update(chunk)
    return h.digest()

Python 3.8 이상

import hashlib

def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
    h = hash_factory()
    with open(filename,'rb') as f: 
        while chunk := f.read(chunk_num_blocks*h.block_size): 
            h.update(chunk)
    return h.digest()

원본 게시물

만약 당신이 더 피톤식을 원한다면 (아니오.while True파일을 읽는 방법은 다음 코드를 확인하십시오.

import hashlib

def checksum_md5(filename):
    md5 = hashlib.md5()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(8192), b''): 
            md5.update(chunk)
    return md5.digest()

로 고는 다음과 .iter()는 반환된 에서 정지하려면 빈 합니다.read()아온다를 합니다.b''(뿐만 아니라)'').

Piotr Czapla의 방법에 대한 제 버전은 다음과 같습니다.

def md5sum(filename):
    md5 = hashlib.md5()
    with open(filename, 'rb') as f:
        for chunk in iter(lambda: f.read(128 * md5.block_size), b''):
            md5.update(chunk)
    return md5.hexdigest()

이 질문에 대한 여러 의견/답변을 사용하여 다음과 같은 해결책을 제시합니다.

import hashlib
def md5_for_file(path, block_size=256*128, hr=False):
    '''
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)
    '''
    md5 = hashlib.md5()
    with open(path,'rb') as f:
        for chunk in iter(lambda: f.read(block_size), b''):
             md5.update(chunk)
    if hr:
        return md5.hexdigest()
    return md5.digest()

이것은 파이썬입니다.
이것은 함수입니다.
암묵적인 값을 피합니다. 항상 명시적인 값을 선호합니다.
성능 최적화(매우 중요)가 가능합니다.

Python 2/3 휴대용 솔루션

체크섬(md5, sha1 등)을 계산하려면 바이트 값을 합하므로 이진 모드에서 파일을 열어야 합니다.

2이식할 수 Python 2.7을 해야 .io다음과 같은 패키지:

import hashlib
import io


def md5sum(src):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        content = fd.read()
        md5.update(content)
    return md5

파일 크기가 큰 경우 전체 파일 내용을 메모리에 저장하지 않도록 파일을 청크로 읽는 것이 좋습니다.

def md5sum(src, length=io.DEFAULT_BUFFER_SIZE):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
    return md5

여기서 요령은 함수를 센티널(빈 문자열)과 함께 사용하는 것입니다.

이 경우 생성된 반복기는 각 호출에 대한 인수 없이 [람다 함수]를 호출합니다.next()방법; 반환되는 값이 보초와 같다면,StopIteration값이 상승합니다. 그렇지 않으면 값이 반환됩니다.

파일이 정말 큰 경우 진행률 정보를 표시해야 할 수도 있습니다.이렇게 하려면 계산된 바이트 양을 인쇄하거나 기록하는 콜백 함수를 호출합니다.

def md5sum(src, callback, length=io.DEFAULT_BUFFER_SIZE):
    calculated = 0
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
            calculated += len(chunk)
            callback(calculated)
    return md5

일반 해싱 기능에 대한 호크윙의 코멘트를 고려한 바스티앙 세메네의 코드 리믹스...

def hash_for_file(path, algorithm=hashlib.algorithms[0], block_size=256*128, human_readable=True):
    """
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)

    Linux Ext4 block size
    sudo tune2fs -l /dev/sda5 | grep -i 'block size'
    > Block size:               4096

    Input:
        path: a path
        algorithm: an algorithm in hashlib.algorithms
                   ATM: ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')
        block_size: a multiple of 128 corresponding to the block size of your filesystem
        human_readable: switch between digest() or hexdigest() output, default hexdigest()
    Output:
        hash
    """
    if algorithm not in hashlib.algorithms:
        raise NameError('The algorithm "{algorithm}" you specified is '
                        'not a member of "hashlib.algorithms"'.format(algorithm=algorithm))

    hash_algo = hashlib.new(algorithm)  # According to hashlib documentation using new()
                                        # will be slower then calling using named
                                        # constructors, ex.: hashlib.md5()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(block_size), b''):
             hash_algo.update(chunk)
    if human_readable:
        file_hash = hash_algo.hexdigest()
    else:
        file_hash = hash_algo.digest()
    return file_hash

전체 내용을 읽지 않고는 md5를 얻을 수 없습니다.그러나 업데이트 기능을 사용하여 파일의 내용을 블록 단위로 읽을 수 있습니다.

m.update(a); m.update(b)는 m.update(a+b)와 동일합니다.

저는 다음 코드가 더 파이썬적이라고 생각합니다.

from hashlib import md5

def get_md5(fname):
    m = md5()
    with open(fname, 'rb') as fp:
        for chunk in fp:
            m.update(chunk)
    return m.hexdigest()

저는 루프를 좋아하지 않습니다.Nathan Feger의 답변을 기반으로 합니다.

md5 = hashlib.md5()
with open(filename, 'rb') as f:
    functools.reduce(lambda _, c: md5.update(c), iter(lambda: f.read(md5.block_size * 128), b''), None)
md5.hexdigest()

장고에 대한 유발 아담의 답변 이행:

import hashlib
from django.db import models

class MyModel(models.Model):
    file = models.FileField()  # Any field based on django.core.files.File

    def get_hash(self):
        hash = hashlib.md5()
        for chunk in self.file.chunks(chunk_size=8192):
            hash.update(chunk)
        return hash.hexdigest()

저는 이 근처에 너무 많은 소란이 있는지 잘 모르겠습니다.저는 최근에 MySQL에 blobs로 저장된 md5와 파일에 문제가 있어서 다양한 파일 크기와 간단한 Python 접근법인 viz:

FileHash = hashlib.md5(FileData).hexdigest()

파일 크기가 2KB에서 20MB 사이인 경우 뚜렷한 성능 차이를 감지할 수 없었기 때문에 해시를 '청크'할 필요가 없습니다.어쨌든, 리눅스가 디스크로 가야 한다면, 적어도 평균적인 프로그래머의 능력만큼 그렇게 하지 않을 것입니다.공교롭게도, 그 문제는 md5와는 무관합니다.MySQL을 사용하는 경우 이미 md5() 및 sha1() 함수를 사용하는 것을 잊지 마십시오.

import hashlib,re
opened = open('/home/parrot/pass.txt','r')
opened = open.readlines()
for i in opened:
    strip1 = i.strip('\n')
    hash_object = hashlib.md5(strip1.encode())
    hash2 = hash_object.hexdigest()
    print hash2

언급URL : https://stackoverflow.com/questions/1131220/get-the-md5-hash-of-big-files-in-python

'programing' 카테고리의 다른 글

Angular2에서 객체 배열 정렬 (0)	2023.07.01
프로그래밍 방식으로 UITextField 키보드 유형 변경 (0)	2023.07.01
TypeScript에는 하나 또는 다른 매개 변수가 필요하지만 둘 다 필요하지는 않습니다. (0)	2023.07.01
가져오기 오류: bs4(BeautifulSoup)라는 모듈이 없습니다. (0)	2023.07.01
Vuex에서 (객체의 소품이 아닌) 단일 계산 소품으로 모듈 상태 소품을 가져오는 방법은? (0)	2023.07.01

현재글Python에서 대용량 파일의 MD5 해시 가져오기

각종 프로그래밍 정보를 다루는 블로그입니다.

C, CSS, MySQL, jQuery, spring-boot, Python, Excel, MongoDB, mariadb, Android, ASP.NET, oracle, Wordpress, sql-server, git, json, angularJs, PowerShell, Reactjs, Ajax,

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

itmemos

Python에서 대용량 파일의 MD5 해시 가져오기

Python에서 대용량 파일의 MD5 해시 가져오기

파이썬 < 3.7

Python 3.8 이상

원본 게시물

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

Python에서 대용량 파일의 MD5 해시 가져오기

Python에서 대용량 파일의 MD5 해시 가져오기

파이썬 < 3.7

Python 3.8 이상

원본 게시물

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바