What is the default __hash__ in python?
What you can rely on: custom objects have a default hash()
that is based in some way on the identity of the object. i.e. any object using the default hash will have a constant value for that hash over its lifetime and different objects may or may not have a different hash value.
You cannot rely on any particular relationship between the value returned by id()
and the value returned by hash()
. In the standard C implementation of Python 2.6 and earlier they were the same, in Python 2.7-3.2 hash(x)==id(x)/16
.
Edit: originally I wrote that in releases 3.2.3 and later or 2.7.3 or later the hash value may be randomised and in Python 3.3 the relationship will always be randomised. In fact that randomisation at present only applies to hashing strings so in fact the divide by 16 relationship may continue to hold for now, but don't bank on it.
Hash collisions don't usually matter: in a dictionary lookup to find an object it must have the same hash and must also compare equal. Collisions only matter if you get a very high proportion of collisions such as in the denial of service attack that led to recent versions of Python being able to randomise the hash calculation.
What is the default hash of user defined classes?
The relevant function appears to be:
Py_hash_t
_Py_HashPointer(void *p)
{
Py_hash_t x;
size_t y = (size_t)p;
/* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
excessive hash collisions for dicts and sets */
y = (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4));
x = (Py_hash_t)y;
if (x == -1)
x = -2;
return x;
}
(that code comes from here, and is then used to be the tp_hash
slot in type
here.) The comment there seems to give a reason for not using the pointer (which is the same thing as the id
) directly. Indeed, the commit that introduced that change to the function is here, and states that the reason for the change is:which refers to this issue, which explains more why the change was made.Issue #5186: Reduce hash collisions for objects with no hash
method by rotating the object pointer by 4 bits to the right.
Using an object's id() as a hash value
The __hash__
method has to satisfy the following requirement in order to work:
Forall x, y such that x == y
, then hash(x) == hash(y)
.
In your case your class does not implement __eq__
which means that x == y
if and only if id(x) == id(y)
, and thus your hash implementation satisfy the above property.
Note however that if you do implement __eq__
then this implementation will likely fail.
Also: there is a difference between having a "valid" __hash__
and having a good hash. For example the following is a valid __hash__
definition for any class:
def __hash__(self):
return 1
A good hash should try to distribute uniformly the objects as to avoid collisions as much as possible. Usually this requires a more complex definition.I'd avoid trying to come up with formulas and instead rely on python built-in
hash
function.For example if your class has fields a
, b
and c
then I'd use something like this as __hash__
:
def __hash__(self):
return hash((self.a, self.b, self.c))
The definition of hash
for tuples should be good enough for the average case.Finally: you should not define __hash__
in classes that are mutable (in the fields used for equality). That's because modifying the instances will change their hash and this will break things.
Python - Using the default __hash__ method in __hash__ method definition
To call parent implementation use:
super(Foo, self).__hash__()
You are overriding a magic method, so it's ok to call parent's implementation directly.It also occurred to me that I could rewrite it as
return
. This works, but seems even worse, as special
object.__hash__(self)
methods are not intended to be called directly.
hash function in Python 3.3 returns different results between sessions
Python uses a random hash seed to prevent attackers from tar-pitting your application by sending you keys designed to collide. See the original vulnerability disclosure. By offsetting the hash with a random seed (set once at startup) attackers can no longer predict what keys will collide.
You can set a fixed seed or disable the feature by setting the PYTHONHASHSEED
environment variable; the default is random
but you can set it to a fixed positive integer value, with 0
disabling the feature altogether.
Python versions 2.7 and 3.2 have the feature disabled by default (use the -R
switch or set PYTHONHASHSEED=random
to enable it); it is enabled by default in Python 3.3 and up.
If you were relying on the order of keys in a Python set, then don't. Python uses a hash table to implement these types and their order depends on the insertion and deletion history as well as the random hash seed. Note that in Python 3.5 and older, this applies to dictionaries, too.
Also see the object.__hash__()
special method documentation:
If you need a stable hash implementation, you probably want to look at theNote: By default, the
__hash__()
values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).
See also
PYTHONHASHSEED
.
hashlib
module; this implements cryptographic hash functions. The pybloom project uses this approach.Since the offset consists of a prefix and a suffix (start value and final XORed value, respectively) you cannot just store the offset, unfortunately. On the plus side, this does mean that attackers cannot easily determine the offset with timing attacks either.
Python - class __hash__ method and set
Your reading is incorrect. The __eq__
method is used for equality checks. The documents just state that the __hash__
value must also be the same for 2 objects a
and b
for which a == b
(i.e. a.__eq__(b)
) is true.
This is a common logic mistake: a == b
being true implies that hash(a) == hash(b)
is also true. However, an implication does not necessarily mean equivalence, that in addition to the prior, hash(a) == hash(b)
would mean that a == b
.
To make all instances of MyClass
compare equal to each other, you need to provide an __eq__
method for them; otherwise Python will compare their identities instead. This might do:
class MyClass(object):
def __hash__(self):
return 0
def __eq__(self, other):
# another object is equal to self, iff
# it is an instance of MyClass
return isinstance(other, MyClass)
Now:>>> result = set()
>>> result.add(MyClass())
>>> result.add(MyClass())
1
In reality you'd base the
__hash__
on those properties of your object that are used for __eq__
comparison, for example:class Person
def __init__(self, name, ssn):
self.name = name
self.ssn = ssn
def __eq__(self, other):
return isinstance(other, Person) and self.ssn == other.ssn
def __hash__(self):
# use the hashcode of self.ssn since that is used
# for equality checks as well
return hash(self.ssn)
p = Person('Foo Bar', 123456789)
q = Person('Fake Name', 123456789)
print(len({p, q}) # 1
Related Topics
How to Check Task Status in Celery
Does Python Evaluate If's Conditions Lazily
Pairwise Crossproduct in Python
Typeerror: Expected a Character Buffer Object - While Trying to Save Integer to Textfile
How to Get Two Random Records with Django
Preventing Python Code from Importing Certain Modules
A Good Way to Make Long Strings Wrap to Newline
"Overflowerror: Python Int Too Large to Convert to C Long" on Windows But Not MAC
Why Is True Returned When Checking If an Empty String Is in Another
How to Use Digit Separators for Python Integer Literals
Logisticregression: Unknown Label Type: 'Continuous' Using Sklearn in Python
How to Ignore Hidden Files Using Os.Listdir()
Numpy Array Dtype Is Coming as Int32 by Default in a Windows 10 64 Bit MAChine
Python & Pandas: How to Query If a List-Type Column Contains Something
Find All Upper, Lower and Mixed Case Combinations of a String